Match a word with similar words using Solr? - php

I want to search for threads in my mysql database with Solr.
But i want it to not just search the thread words, but for similar words.
Eg. if a thread title is "dog for sale" and if the user searches for dogs the title will be in the result.
and also if a user searches for "mac os x" the word "snow leopard" will appear.
and the ability to link words the application thinks is related eg. house and apartment.
how is this kind of logic done?
i know that you can with solr look up words in a dictionary file you create/add, so solr will look for dogs and see what related words there are (eg. dog).
but where do you find such a dictionary?
i have no idea about this kind of implementation.
please point me into right direction.
thanks

I think you'll have to build such a dictionary yourself, since it's very application-specific. "House" and "Apartment" might be similar terms for your application but very distant in another application.
Once you have this dictionary you can use it through the SynonymFilterFactory.
Matching "dog" when the user searches for "dogs" is managed by the stemmer and doesn't require any dictionary.

You could use the synonym.txt file and create your own dictionary.
Another option for you could be fuzzy search.

Related

Optimize my search engine

I am trying to optimize my search engine. Right now, I am running a strcmp between the search words the user entered and keywords stored in the database. I am trying to come up with a way so that the more matches the users search words has with the keywords the sooner it will show up in the search results.
For example, if the user search for "red apple painting" and I have two entries for that item with the following keywords 1. "old apple painting green" 2. "apple painting red new york" I would like the second entry to come up first in the search result because all of the users search words were found in the keywords stored in the db.
Any help on how I can achieve this?
Take a look at full text search.
You may also want to consider an external text search engine such as Lucene or Sphinx.
you need to create an index of words. The index would contain word id, doc id, number of hits, position of hits. Then the searcher will be able to give results like you want. There are free indexing tools available in market. But if you want to develop your own then follow the original paper bt google founders-
http://infolab.stanford.edu/~backrub/google.html
Find applicable keywords with accurate seek site visitors capability.
Create and optimize pages for engines like google and customers alike.
Make sure your internet site is offered to each bots and human beings.
Build applicable links from different notable web sites.

PHP library for word clustering/NLP?

What I am trying to implement is a rather trivial "take search results (as in title & short description), cluster them into meaningful named groups" program in PHP.
After hours of googling and countless searches on SO (yielding interesting results as always, albeit nothing really useful) I'm still unable to find any PHP library that would help me handle clustering.
Is there such a PHP library out there that I might have missed?
If not, is there any FOSS that handles clustering and has a decent API?
Like this:
Use a list of stopwords, get all words or phrases not in the stopwords, count occurances of each, sort in descending order.
The stopwords needs to be a list of all common English terms. It should also include punctuation, and you will need to preg_replace all the punctuation to be a separate word first, e.g. "Something, like this." -> "Something , like this ." OR, you can just remove all punctuation.
$content=preg_replace('/[^a-z\s]/', '', $content); // remove punctuation
$stopwords='the|and|is|your|me|for|where|etc...';
$stopwords=explode('|',$stopwords);
$stopwords=array_flip($stopwords);
$result=array(); $temp=array();
foreach ($content as $s)
if (isset($stopwords[$s]) OR strlen($s)<3)
{
if (sizeof($temp)>0)
{
$result[]=implode(' ',$temp);
$temp=array();
}
} else $temp[]=$s;
if (sizeof($temp)>0) $result[]=implode(' ',$temp);
$phrases=array_count_values($result);
arsort($phrases);
Now you have an associative array in order of the frequency of terms that occur in your input data.
How you want to do the matches depends upon you, and it depends largely on the length of the strings in the input data.
I would see if any of the top 3 array keys match any of the top 3 from any other in the data. These are then your groups.
Let me know if you have any trouble with this.
"... cluster them into meaningful groups" is a bit to vague, you'll need to be more specific.
For starters you could look into K-Means clustering.
Have a look at this page and website:
PHP/irInformation Retrieval and other interesting topics
EDIT: You could try some data mining yourself by cross referencing search results with something like the open directory dmoz RDF data dump and then enumerate the matching categories.
EDIT2: And here is a dmoz/category question that also mentions "Faceted Search"!
Dmoz/Monster algorithme to calculate count of each category and sub category?
If you're doing this for English only, you could use WordNet: http://wordnet.princeton.edu/. It's a lexicon widely used in research which provides, among other things, sets of synonyms for English words. The shortest distance between two words could then serve as a similarity metric to do clustering yourself as zaf proposed.
Apparently there is a PHP interface to WordNet here: http://www.foxsurfer.com/wordnet/. It came up in this question: How to use word Net with php, but I have not tried it. However, interfacing with a command line tool from PHP yourself is feasible as well.
You could also have a look at Programming Collective Intelligence (Chapter 3 : Discovering Groups) by Toby Segaran which goes through just this use case using Python. However, you should be able to implement things in PHP once you understand how it works.
Even though it is not PHP, the Carrot2 project offers several clustering engines and can be integrated with Solr.
This may be way off but check out OpenCalais. They have a web service which allows you to pass a block of text in and it will pass you back a parseable response of things that it found in the text, such as places, people, facts etc. You could use these categories to build your "clouds" and too choose which results to display.
I've used this library a few times in php and it's always been quite easy to work with.
Again, might not be relevant to what your trying to do. Maybe you could post an example of what your trying to accomplish?
If you can pre-define the filters for your faceted search (the named groups) then it will be much easier.
Rather than relying on an algorithm that uses the current searcher's input and their particular results to generate the filter list, you would use an aggregate of the most commonly performed searches by all users and then tag results with them if they match.
You would end up with a table (or something) of URLs in a many-to-many join to a table of tags, so each result url could have several appropriate tags.
When the user searches, you simply match their search against the full index. But for the filters, you take the top results from among the current resultset.
I'll work on query examples if you want.

PHP Detect Pages Genre/Category

I was wondering if their was any sort of way to detect a pages genre/category.
Possibly their is a way to find keywords or something?
Unfortunately I don't have any idea so far, so I don't have any code to show you.
But if anybody has any ideas at all, let me know.
Thanks!
EDIT #Nican
Perhaps their is a way to set, let's say 10 category's (Entertainment, Funny, Tech).
Then creating keywords for these category's (Funny = Laughter, Funny, Joke etc).
Then searching through a webpage (maybe using a cUrl) for these keywords and assigning it to the right category.
Hope that makes sense.
What you are talking about is basically what Google Adsense and similar services do, and it's based on analyzing the content of a page and matching it to topics. Generally, this kind of stuff is beyond what you would call simple programming / development and would require significant resources to be invested to get it to work "right".
A basic system might work along the following lines:
Get page content
Get X most commonly used words (omitting stuff like "and" "or" etc.)
Get words used in headings
Assign weights to different words according to a set of factors (is used in heading, is used in more than one paragraph, is used in link anchors)
Match the filtered words against a database of words related to a specific "category"
If cumulative score > treshold, classify site as belonging to category
Rinse and repeat
Folksonomy may be a way of accomplishing what you're looking for:
http://en.wikipedia.org/wiki/Folksonomy
For instance, in Drupal they have a Folksonomy module:
http://drupal.org/node/19697 (Note this module appears to be dead, see http://drupal.org/taxonomy/term/71)
Couple that with a tag cloud generator, and you may get somewhere:
http://drupal.org/project/searchcloud
Plus, a little more complexity may be able to derive mapped relationships to other terms, especially if you control the structure of the tagging options.
http://intranetblog.blogware.com/blog/_archives/2008/5/22/3707044.html
EDIT
In general, the type of system you're trying to build relies on unique word values on a page. So you would need to...
Get unique word values from your content (index values or create a bot to crawl your site)
Remove all words and symbols you can't use (at, the, or, and, etc...)
Count the number of times the unique words appear on the page
Add them to some type of datastore so you can call them based on the relationships you're mapping
If you have a root label system in place, associate those values with the word counts on the page (such as a query or derived table)
This is very general, and there are a number of ways this can be implemented/interpreted. Folksonomies are meant to "crowdsource" much of the effort for you, in a "natural way", as long as you have a user base that will contribute.

So i've been recommended Zend search lucene php for search functionality, but is it better than mySql full text?

i've playing around with mysql's full text search and I'm not sure if it is the best.
let's say I search for "how do I book an appointment?" it will give me the correct results from the database, so "how can I book appointments?" I think it is because "book" is an uncommon word.
What if the user searches "how can I schedule an appointment?" I find it will not retrieve any records.
I think it is because there are so many records with appointment in it.
So am I to understand a user can only get information on how to book an appointment if they use "book" in their question?
Lastly, should I be cleaning out words like "how","I" etc... in a mysql full text search? Should I be Stemming words as well?
give a look also to SOLR

php / sql search

I want to find articles when searched on following keyword:
"maruti sx4 maintenance costs against honda city"
I want a query or php regular expression which can find a article which having below text
"SX4 maintenance cost is lesser because of Maruti. Honda City maintenance is also okay."
i.e i want a function/code which can find article by matching "maintenance cost" ( which is common text )
Please guide me how to do it
Thanks
Satish Kalepu
Straightforward solution
The not-so-efficient solution is to match each of your search term one by one to your complete set of articles. For each new query you would repeat this process.
Use explode to split your query string into an array of individual search terms, stripos too test if the term occurs within the text of an article.
Document Retrieval System
If you want to create a full document retrieval system from scratch, you probably should start by creating an inverted index mapping search terms to documents (articles).
Then for each individual search term you can retrieve matching document.
The document which has most matches would be the output of the search system, or you can rank the found documents by the number of search terms matched.
This simple idea can become more advanced if you take into account word stemming and document/term frequency (i.e. the word "the" is less interesting as a search query than "honda" as almost all documents contain "the" but few contain "honda").

Categories