Solr configuration for autocompletion implementation with php - php

how do i have to index my data and configure solr and my search options in solr, that an autocompletion (like google) with the following requirements is possible:
Products:
- We have products with their titles, descriptions, id's, e.g. for the title: toshiba tecra s1: centrino 1.5 ghz/xp pro/15.0" tft/40 gb/256 mb+256mb/cd-rw-dvd-rom/lan/wi-fi
- this products or fields of this product has to be indexed in such a way that the following should be possible (no differentation how a user search for the searchterm, e.g. TOSHIBA or tOSHiba)
- if a user starts entering the first three characters "tos" max. 20 results (the complete title (phrase) e.g. "toshiba tecra s1: centrino 1.5 ghz/xp pro/15.0" tft/40 gb/256 mb+256mb/cd-rw-dvd-rom/lan/wi-fi") should appear in the autocomplete box.
- if a user enters e.g. two terms "toshiba tecra" the searchresult must be more precisly and just all documents should be shown, that contain the (coherent) terms "toshiba tecra"
It would be great to get any hints for this, what kind of tokenizer/searchcomponent etc. to use.
I'm using solr Version 3.5
Thank you for oyur thoughts
Ramo

Solr 3.X has an inbuilt Suggester component, which allows you to build suggestion on limited fields.
The following links provide the implementation details -
1. http://lucidworks.lucidimagination.com/display/solr/Suggester
2. http://solr.pl/en/2010/11/15/solr-and-autocomplete-part-2/
For alternate approaches you can check EdgeNGrams implementation or Terms Component.

Related

Sphinx: incorrect relevance?

I have a project where user can search for electrical goods. Search is implemented with Sphinx(Note: Sphinx version is 2.0.4 and I can't update it)
For exmaple, we have a query Светильник Е27(lamp e27). Results are following
As for me, results are not correct, because I think that results 6-11 are way more relevant than 1-5.
Is it possible to fix this issue?
P.S. Already tried to apply SPH_RANK_WORDCOUNT and SPH_RANK_SPH04 for ranking mode, but results are the same
Having now clarified in comments, can say
1) Check what fields you have indexed for each document, it might be that Светильник is used a lot in those fields, so boost the ranking. Where you seem to want most of the ranking to be on the title. Could omit less relevant fields.
2) You can also specifically make title play a bigger part in ranking with setFieldWeights().
3) Finally can even specifically only match against title using extended match mode #title Светильник Е27 - the words would have to be in the title, so results 1-5 wouldn't even show.
... basically all about manipulating what fields match and used for ranking.

Finding images in Wikipedia that are being used across various articles

I'm trying to query wikipedia using MediaWiki api with php (and Curl), in order to search for images that are being used in various articles by a specific search term. For example - search for 'panda', but get only images that are being used somewhere and be able to go to the articles.
I am able to search for images generally using:
https://en.wikipedia.org/w/api.php?action=query&list=allimages&ailimit=100&aifrom=Panda&aiprop=url&format=xmlfm
and I know that basically this should show the usage:
https://commons.wikimedia.org/w/api.php?action=query&prop=images&list=imageusage&iutitle=File:MY_IMAGE_NAME&format=xmlfm
Trying the above does not give me the result I need - I can see a list of images, but I cannot know if or where they are being used.
Can anyone assist?
list=imageusage does not show cross-wiki usage; you'll need prop=globalusage for that. Which is also conveniently a prop module, so it can be folded into the first query using allimages as a generator:
action=query&generator=allimages&gailimit=100&gaifrom=Panda&prop=globalusage
(Omitted prop=images since it does not seem to have any useful purpose.)

Most used words on website using Solr etc

I want to generate a list of the most words used on a website. The application should crawl the content of the site.
Does anyone know if this can be done by Solr or any other technique?
The list can be php objects/array or an xml file.
you might want to check http://wiki.apache.org/solr/TermsComponent
Example -
http://host:port/solr/core/terms?terms.fl=title&terms.sort=count
Will give you all the terms for the field title ordered by count (default)
terms.fl - Field you want to check the terms on
terms.sort={count|index} - If count, sorts the terms by the term frequency (highest count first). If index, returns the terms in index order. Default is to sort by count.
This gives the indexed terms which go through the tokenizer and filters, so if you need terms as is, you can vary the field analysis. (probably use field type string)
SOLR is a search engine. It doesn't crawl websites. You need to make a simple website crawler using scrapy http://scrapy.org/ or some similar tool. Design a SOLR schema to record the data, crawl the websites, send record updates to SOLR. Your specific question would probably be answered by the SCHEMA BROWSER choice on the SOLR admin menu through the web admin interface. Click on DYNAMIC FIELDS, select the field you are interested and see the to 10. Change the number to 50, press ENTER and get the top 50.

PHP Detect Pages Genre/Category

I was wondering if their was any sort of way to detect a pages genre/category.
Possibly their is a way to find keywords or something?
Unfortunately I don't have any idea so far, so I don't have any code to show you.
But if anybody has any ideas at all, let me know.
Thanks!
EDIT #Nican
Perhaps their is a way to set, let's say 10 category's (Entertainment, Funny, Tech).
Then creating keywords for these category's (Funny = Laughter, Funny, Joke etc).
Then searching through a webpage (maybe using a cUrl) for these keywords and assigning it to the right category.
Hope that makes sense.
What you are talking about is basically what Google Adsense and similar services do, and it's based on analyzing the content of a page and matching it to topics. Generally, this kind of stuff is beyond what you would call simple programming / development and would require significant resources to be invested to get it to work "right".
A basic system might work along the following lines:
Get page content
Get X most commonly used words (omitting stuff like "and" "or" etc.)
Get words used in headings
Assign weights to different words according to a set of factors (is used in heading, is used in more than one paragraph, is used in link anchors)
Match the filtered words against a database of words related to a specific "category"
If cumulative score > treshold, classify site as belonging to category
Rinse and repeat
Folksonomy may be a way of accomplishing what you're looking for:
http://en.wikipedia.org/wiki/Folksonomy
For instance, in Drupal they have a Folksonomy module:
http://drupal.org/node/19697 (Note this module appears to be dead, see http://drupal.org/taxonomy/term/71)
Couple that with a tag cloud generator, and you may get somewhere:
http://drupal.org/project/searchcloud
Plus, a little more complexity may be able to derive mapped relationships to other terms, especially if you control the structure of the tagging options.
http://intranetblog.blogware.com/blog/_archives/2008/5/22/3707044.html
EDIT
In general, the type of system you're trying to build relies on unique word values on a page. So you would need to...
Get unique word values from your content (index values or create a bot to crawl your site)
Remove all words and symbols you can't use (at, the, or, and, etc...)
Count the number of times the unique words appear on the page
Add them to some type of datastore so you can call them based on the relationships you're mapping
If you have a root label system in place, associate those values with the word counts on the page (such as a query or derived table)
This is very general, and there are a number of ways this can be implemented/interpreted. Folksonomies are meant to "crowdsource" much of the effort for you, in a "natural way", as long as you have a user base that will contribute.

Which third party search engine (free) should I use?

As the title says, I need a search engine... for mysql searching.
My website is PHP based.
I was going with sphinx but my hosting company doesn't support full-text indexes!
So a search engine to be used without full-text!
It should be pretty powerful, and must include atleast these functions below:
When searching for 'bmw 520' only matches where these two words come in exactly this order is returned. not matches for only 'bmw' or only '520'.
When searching for 'bmw 330ci' results as the above will be returned, but, WITH AND WITHOUT the ci extension. There are a nr of extensions in cars as you all know (i, ci, si, fi etc).
I want the 'minus sign' to 'exclude' all returns containing the word after the sign, ex: 'bmw -330' will return all 'bmw' results without the '330' ones. (a NOT instead of minus sign is also ok)
all special character accents like 'é' are converted to their simple values, in this case 'e'.
list of words to ignore completely in the search
Thanks guys!
The Zend_Lucene search competent works fairly well. I am not sure how it would cope with your second requirement, however if you customized the tokenized you should be able to do it by treating a change from letters to numbers as a new word.
The one I am really not sure about is the top requirement. Given how it is indexed, order becomes irreverent in the search, so you may not be able to do it without heavy editing of Lucene, writing a filter (using lucene to pull the matches, then checking the order), or writing your own solution. All of these will slow the search down, and add load to your server.
There is also solr, but I have never used it and don't know anything about it. Sphinx was another one, but I see you have already ruled that out.
Xapian is very good (very comprehensive) if you have the time for the initial setup.
It functions as you would expect a search engine to work, tell the indexer what bits of information to index under what namespace/table/object (Page, Profile, Products etc), then issue a query for your users based on keywords, it also supports google style tags e.g. "profile:Mark icecream" would search my profile for the word icecream, i seem to remember it supporting ranges too for data you specify as numeric.
Can be used in local mode which can offer spelling modifications (Did you mean?), or remote mode that many sites can index to and query from.
What really saved me one time was the ability to attach transient non searchable data to an indexed item, e.g. attaching the DB id to all data indexed for that record, very good for then going and getting the whole record from the DB when your matches come back from xapian.
I have used a couple of Search Engines on my site during it's time, but in the next rebuild I'm planning to move to Google Site Search.
There are several reasons for this:
Users are very familiar with the Google style of search result listings which improves usability and hence click-through rates
The Google engine is very good at guessing when to use the page description and when to use a fragment of the page (it also very good at getting relevant fragments compared to some other engines)
It's used by thousands of very popular websites
Google is the most popular search engine around so you know their technology is both reliable and accurate
Google Site Search begins at $100 per annum for 1000 pages or less (and a limit on queries)
or you can use the free Google Custom Search Engine (but this has much less customizability)

Categories