Sphinx: incorrect relevance? - php

I have a project where user can search for electrical goods. Search is implemented with Sphinx(Note: Sphinx version is 2.0.4 and I can't update it)
For exmaple, we have a query Светильник Е27(lamp e27). Results are following
As for me, results are not correct, because I think that results 6-11 are way more relevant than 1-5.
Is it possible to fix this issue?
P.S. Already tried to apply SPH_RANK_WORDCOUNT and SPH_RANK_SPH04 for ranking mode, but results are the same

Having now clarified in comments, can say
1) Check what fields you have indexed for each document, it might be that Светильник is used a lot in those fields, so boost the ranking. Where you seem to want most of the ranking to be on the title. Could omit less relevant fields.
2) You can also specifically make title play a bigger part in ranking with setFieldWeights().
3) Finally can even specifically only match against title using extended match mode #title Светильник Е27 - the words would have to be in the title, so results 1-5 wouldn't even show.
... basically all about manipulating what fields match and used for ranking.

Related

Common attributes from Search Results

We're using SphinxSearch (not really relevant likely as we're returning the resulting objects from MySQL) to return user's search results. This part is working fine. We're displaying 30 items per page, but there may be up to 20k results that match.
What we're trying to do is add the ability to filter search results based on the total search results attributes and options. Take this amazon search for instance:
https://www.amazon.ca/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=tablet
If you look at the left side, you can filter by brand, category, keywords, discount percentage, memory capacity, screen size, et al. Obviously this doesn't just apply to the currently displayed search results, but the entire result set (which in this Amazon maxes out at 400 pages).
If we were to do that, how can we avoid loading and looping through all 400*30 results to display relevant attribute/category filters? We've tried looping just to see how long that would take, and it's easily above 15 seconds. We've also tried caching common search terms (such as tablet in this case) but obviously, most user searches won't fall neatly into easily cacheable result sets.
Also, is there a name for this post search entire result set type of filtering?
Often called faceted search. Ie can filter results by facets.
Good overview...
http://sphinxsearch.com/blog/2013/06/21/faceted-search-with-sphinx/
In short let sphinx calculate the list and counts, rather than doing it in post

how to compare parts of 2 strings in php

Good evening,
I am facing a small problem whilst trying to build a little search algorithm.
I have a database table containing video game names and software names. Now I would like to add new offers by fetching and parsing xml files on other servers. The issue is:
How can I compare the strings for the product name so it works even if the offer name doesn't match the product name stored in my database up to a 100%?
As an example I am currently using this PHP + SQL code to compare the strings:
$query_GID = "select ID,game from gkn_catalog where game like '%$batch_name%' or meta like '%$batch_name%' ";
I am currently using the like operator in conjunction with two wild-cards to compare the offer name (batch_name) with the name in the database (game).
I would like to know how I can improve on this as this method isn't very failsafe or whatever you want to call it, what happens is:
If the database says the game title is:
Deus Ex Human Revolution Missing Link
and the batch_name says:
Deus Ex Human Revolution Missing Link DLC
the result will be empty/wrong/false ... well it won't find the game in my database at all.
Same goes for something like this:
Database = Lego Star Wars The Complete Saga batch_name = Lego
Star Wars : The Complete Saga
Result: False
Is there a better way to do the SQL query? Or how can I try to get that query working so it can deal with strings that come with special characters (like -minus- & [brackets]) and or characters which aren't included in the names within the database (like DLC, CE...)?
You're looking for fuzzy search algorithms and fuzzy search results. This is a whole field of study. However, there are also some straightforward tutorials to get you started if you take a quick google around.
You might be tempted to try something like PHP's wonderful levenshtein method, which calculates the "closeness" of two strings. However, this would require matching it against every record. If there will be thousands of records, that's out of the question.
MySQL has some matching tools which may help. I see that as I'm writing this, somebody has already mentioned FULLTEXT and MATCH() in the comments. Those are a great way to go.
There are a few other good solutions to look into as well. Storing an index of keywords (with all the articles and helpers like of/the/an/am/is/are/was/of/from removed) and then searching on each word in the search is a simple solution. However, it doesn't produce great results in that the returned values are not weighted well, and it doesn't localize at all.
There are lots of cheap and wonderful third party search tools (Lucene comes to mind) as well that will do most of this work for you. You just call an API and they manage the caching, keywords, indexing, fuzzying, et al for searches.
Here are some SO questions that are related to fuzzy searches, which will help you find more terminology and ideas:
Lightweight fuzzy search library
Fuzzy queries to database
Fuzzy matching on string
fuzzy searching an array in php
MySQL queries, as you found out can use the percent character as a joker (%) in conjunction with the LIKE operator.
You have multiple solutions depending on what you want exactly.
you can make a fulltext search
you can search using language algorithm like soundex
you can search by keywords
Remember that you can make a search in multiple passes (search for exact match, then percent on every side, explode in words then insert % between every word, search by keyword, etc.) depending if exact match has priority over close search, etc.

PHP library for word clustering/NLP?

What I am trying to implement is a rather trivial "take search results (as in title & short description), cluster them into meaningful named groups" program in PHP.
After hours of googling and countless searches on SO (yielding interesting results as always, albeit nothing really useful) I'm still unable to find any PHP library that would help me handle clustering.
Is there such a PHP library out there that I might have missed?
If not, is there any FOSS that handles clustering and has a decent API?
Like this:
Use a list of stopwords, get all words or phrases not in the stopwords, count occurances of each, sort in descending order.
The stopwords needs to be a list of all common English terms. It should also include punctuation, and you will need to preg_replace all the punctuation to be a separate word first, e.g. "Something, like this." -> "Something , like this ." OR, you can just remove all punctuation.
$content=preg_replace('/[^a-z\s]/', '', $content); // remove punctuation
$stopwords='the|and|is|your|me|for|where|etc...';
$stopwords=explode('|',$stopwords);
$stopwords=array_flip($stopwords);
$result=array(); $temp=array();
foreach ($content as $s)
if (isset($stopwords[$s]) OR strlen($s)<3)
{
if (sizeof($temp)>0)
{
$result[]=implode(' ',$temp);
$temp=array();
}
} else $temp[]=$s;
if (sizeof($temp)>0) $result[]=implode(' ',$temp);
$phrases=array_count_values($result);
arsort($phrases);
Now you have an associative array in order of the frequency of terms that occur in your input data.
How you want to do the matches depends upon you, and it depends largely on the length of the strings in the input data.
I would see if any of the top 3 array keys match any of the top 3 from any other in the data. These are then your groups.
Let me know if you have any trouble with this.
"... cluster them into meaningful groups" is a bit to vague, you'll need to be more specific.
For starters you could look into K-Means clustering.
Have a look at this page and website:
PHP/irInformation Retrieval and other interesting topics
EDIT: You could try some data mining yourself by cross referencing search results with something like the open directory dmoz RDF data dump and then enumerate the matching categories.
EDIT2: And here is a dmoz/category question that also mentions "Faceted Search"!
Dmoz/Monster algorithme to calculate count of each category and sub category?
If you're doing this for English only, you could use WordNet: http://wordnet.princeton.edu/. It's a lexicon widely used in research which provides, among other things, sets of synonyms for English words. The shortest distance between two words could then serve as a similarity metric to do clustering yourself as zaf proposed.
Apparently there is a PHP interface to WordNet here: http://www.foxsurfer.com/wordnet/. It came up in this question: How to use word Net with php, but I have not tried it. However, interfacing with a command line tool from PHP yourself is feasible as well.
You could also have a look at Programming Collective Intelligence (Chapter 3 : Discovering Groups) by Toby Segaran which goes through just this use case using Python. However, you should be able to implement things in PHP once you understand how it works.
Even though it is not PHP, the Carrot2 project offers several clustering engines and can be integrated with Solr.
This may be way off but check out OpenCalais. They have a web service which allows you to pass a block of text in and it will pass you back a parseable response of things that it found in the text, such as places, people, facts etc. You could use these categories to build your "clouds" and too choose which results to display.
I've used this library a few times in php and it's always been quite easy to work with.
Again, might not be relevant to what your trying to do. Maybe you could post an example of what your trying to accomplish?
If you can pre-define the filters for your faceted search (the named groups) then it will be much easier.
Rather than relying on an algorithm that uses the current searcher's input and their particular results to generate the filter list, you would use an aggregate of the most commonly performed searches by all users and then tag results with them if they match.
You would end up with a table (or something) of URLs in a many-to-many join to a table of tags, so each result url could have several appropriate tags.
When the user searches, you simply match their search against the full index. But for the filters, you take the top results from among the current resultset.
I'll work on query examples if you want.

PHP Detect Pages Genre/Category

I was wondering if their was any sort of way to detect a pages genre/category.
Possibly their is a way to find keywords or something?
Unfortunately I don't have any idea so far, so I don't have any code to show you.
But if anybody has any ideas at all, let me know.
Thanks!
EDIT #Nican
Perhaps their is a way to set, let's say 10 category's (Entertainment, Funny, Tech).
Then creating keywords for these category's (Funny = Laughter, Funny, Joke etc).
Then searching through a webpage (maybe using a cUrl) for these keywords and assigning it to the right category.
Hope that makes sense.
What you are talking about is basically what Google Adsense and similar services do, and it's based on analyzing the content of a page and matching it to topics. Generally, this kind of stuff is beyond what you would call simple programming / development and would require significant resources to be invested to get it to work "right".
A basic system might work along the following lines:
Get page content
Get X most commonly used words (omitting stuff like "and" "or" etc.)
Get words used in headings
Assign weights to different words according to a set of factors (is used in heading, is used in more than one paragraph, is used in link anchors)
Match the filtered words against a database of words related to a specific "category"
If cumulative score > treshold, classify site as belonging to category
Rinse and repeat
Folksonomy may be a way of accomplishing what you're looking for:
http://en.wikipedia.org/wiki/Folksonomy
For instance, in Drupal they have a Folksonomy module:
http://drupal.org/node/19697 (Note this module appears to be dead, see http://drupal.org/taxonomy/term/71)
Couple that with a tag cloud generator, and you may get somewhere:
http://drupal.org/project/searchcloud
Plus, a little more complexity may be able to derive mapped relationships to other terms, especially if you control the structure of the tagging options.
http://intranetblog.blogware.com/blog/_archives/2008/5/22/3707044.html
EDIT
In general, the type of system you're trying to build relies on unique word values on a page. So you would need to...
Get unique word values from your content (index values or create a bot to crawl your site)
Remove all words and symbols you can't use (at, the, or, and, etc...)
Count the number of times the unique words appear on the page
Add them to some type of datastore so you can call them based on the relationships you're mapping
If you have a root label system in place, associate those values with the word counts on the page (such as a query or derived table)
This is very general, and there are a number of ways this can be implemented/interpreted. Folksonomies are meant to "crowdsource" much of the effort for you, in a "natural way", as long as you have a user base that will contribute.

Which third party search engine (free) should I use?

As the title says, I need a search engine... for mysql searching.
My website is PHP based.
I was going with sphinx but my hosting company doesn't support full-text indexes!
So a search engine to be used without full-text!
It should be pretty powerful, and must include atleast these functions below:
When searching for 'bmw 520' only matches where these two words come in exactly this order is returned. not matches for only 'bmw' or only '520'.
When searching for 'bmw 330ci' results as the above will be returned, but, WITH AND WITHOUT the ci extension. There are a nr of extensions in cars as you all know (i, ci, si, fi etc).
I want the 'minus sign' to 'exclude' all returns containing the word after the sign, ex: 'bmw -330' will return all 'bmw' results without the '330' ones. (a NOT instead of minus sign is also ok)
all special character accents like 'é' are converted to their simple values, in this case 'e'.
list of words to ignore completely in the search
Thanks guys!
The Zend_Lucene search competent works fairly well. I am not sure how it would cope with your second requirement, however if you customized the tokenized you should be able to do it by treating a change from letters to numbers as a new word.
The one I am really not sure about is the top requirement. Given how it is indexed, order becomes irreverent in the search, so you may not be able to do it without heavy editing of Lucene, writing a filter (using lucene to pull the matches, then checking the order), or writing your own solution. All of these will slow the search down, and add load to your server.
There is also solr, but I have never used it and don't know anything about it. Sphinx was another one, but I see you have already ruled that out.
Xapian is very good (very comprehensive) if you have the time for the initial setup.
It functions as you would expect a search engine to work, tell the indexer what bits of information to index under what namespace/table/object (Page, Profile, Products etc), then issue a query for your users based on keywords, it also supports google style tags e.g. "profile:Mark icecream" would search my profile for the word icecream, i seem to remember it supporting ranges too for data you specify as numeric.
Can be used in local mode which can offer spelling modifications (Did you mean?), or remote mode that many sites can index to and query from.
What really saved me one time was the ability to attach transient non searchable data to an indexed item, e.g. attaching the DB id to all data indexed for that record, very good for then going and getting the whole record from the DB when your matches come back from xapian.
I have used a couple of Search Engines on my site during it's time, but in the next rebuild I'm planning to move to Google Site Search.
There are several reasons for this:
Users are very familiar with the Google style of search result listings which improves usability and hence click-through rates
The Google engine is very good at guessing when to use the page description and when to use a fragment of the page (it also very good at getting relevant fragments compared to some other engines)
It's used by thousands of very popular websites
Google is the most popular search engine around so you know their technology is both reliable and accurate
Google Site Search begins at $100 per annum for 1000 pages or less (and a limit on queries)
or you can use the free Google Custom Search Engine (but this has much less customizability)

Categories