I have very simple company index with Zend Lucene using this to create the index:
// store company primary key to identify it in the search results
$doc->addField(Zend_Search_Lucene_Field::Keyword('pk', $this->getId()));
// index company fields
$doc->addField(Zend_Search_Lucene_Field::Unstored('zipcode', $this->getZipcode(), 'utf-8'));
$doc->addField(Zend_Search_Lucene_Field::Unstored('name', $this->getName(), 'utf-8'));
I can search on the company name but not the zipcode. Is there a problem with Zend Lucene Search indexing integers? If s/o could shed some light who was experience, please help me out. I can only imagine using Lucene to search by zipcode is pretty common.
I believe the default text Analyzer for Zend Lucene does not search numbers by default. Zend comes packaged with several different text analyzers. Use the TextNum analyzer to search both numbers and characters. There are also a handful of other analyzers in the zend/search/lucene/analysis/analyzer/common folder that you may find useful.
You can change your default analyzer with the following code:
Zend_Search_Lucene_Analysis_Analyzer::setDefault(
new Zend_Search_Lucene_Analysis_Analyzer_Common_TextNum());
I believe your problem is with the Analyzer.
I suggest you use Zend_Search_Lucene_Field::Keyword,
instead of Zend_Search_Lucene_Field::Unstored for the zip code field.
This way, the Lucene analyzer will not modify the zip code while indexing.
The Java Lucene has explain() which can be used to debug searches.
You may have to print some interim values to simulate explain(), and see whether this is indeed the problem.
If you are searching for 123, you will get all hits with 123 as well as 34123 for instance. So you have to make sure, that you're index and your query string is unambiguous.
I suggest to index the zipcode as a string such as "000123". After that you can search on the index with "000123" and you will get the correct resultset and nothing like 34123. you only have to translate the zipcode into the "correct" querystring.
Related
We have a website written in Codeigniter framework. Now we want to have a nice and fast soundex based search function to the site. It's just a micro blog so we would only search in the titles of the posts.
So what would be the best for us?
I have two ideas:
Create another column in the post table with the soundex copy of the title and simply have FULL-TEXT index on it.
Explode the words from the titles and save the soundex equivalent of the words in a new table with the id of the post. Just like an automatic tag system.
Which method is the better and why? Can you suggest a better way?
Thanks for all the answers!
Soundex is great - but it usually doesn't meet user expectations for search (established by Google etc.).
The common solution to text searching, including fuzzy searches and stemming, is to use something like SOLR; it's relatively easy to integrate with PHP using web service calls.
The Zend framework has Lucene integration (never used it, but it might save you some time) - Lucene is an open source free text search platform .
May use Double Metaphone algorithm
What I am trying to implement is a rather trivial "take search results (as in title & short description), cluster them into meaningful named groups" program in PHP.
After hours of googling and countless searches on SO (yielding interesting results as always, albeit nothing really useful) I'm still unable to find any PHP library that would help me handle clustering.
Is there such a PHP library out there that I might have missed?
If not, is there any FOSS that handles clustering and has a decent API?
Like this:
Use a list of stopwords, get all words or phrases not in the stopwords, count occurances of each, sort in descending order.
The stopwords needs to be a list of all common English terms. It should also include punctuation, and you will need to preg_replace all the punctuation to be a separate word first, e.g. "Something, like this." -> "Something , like this ." OR, you can just remove all punctuation.
$content=preg_replace('/[^a-z\s]/', '', $content); // remove punctuation
$stopwords='the|and|is|your|me|for|where|etc...';
$stopwords=explode('|',$stopwords);
$stopwords=array_flip($stopwords);
$result=array(); $temp=array();
foreach ($content as $s)
if (isset($stopwords[$s]) OR strlen($s)<3)
{
if (sizeof($temp)>0)
{
$result[]=implode(' ',$temp);
$temp=array();
}
} else $temp[]=$s;
if (sizeof($temp)>0) $result[]=implode(' ',$temp);
$phrases=array_count_values($result);
arsort($phrases);
Now you have an associative array in order of the frequency of terms that occur in your input data.
How you want to do the matches depends upon you, and it depends largely on the length of the strings in the input data.
I would see if any of the top 3 array keys match any of the top 3 from any other in the data. These are then your groups.
Let me know if you have any trouble with this.
"... cluster them into meaningful groups" is a bit to vague, you'll need to be more specific.
For starters you could look into K-Means clustering.
Have a look at this page and website:
PHP/irInformation Retrieval and other interesting topics
EDIT: You could try some data mining yourself by cross referencing search results with something like the open directory dmoz RDF data dump and then enumerate the matching categories.
EDIT2: And here is a dmoz/category question that also mentions "Faceted Search"!
Dmoz/Monster algorithme to calculate count of each category and sub category?
If you're doing this for English only, you could use WordNet: http://wordnet.princeton.edu/. It's a lexicon widely used in research which provides, among other things, sets of synonyms for English words. The shortest distance between two words could then serve as a similarity metric to do clustering yourself as zaf proposed.
Apparently there is a PHP interface to WordNet here: http://www.foxsurfer.com/wordnet/. It came up in this question: How to use word Net with php, but I have not tried it. However, interfacing with a command line tool from PHP yourself is feasible as well.
You could also have a look at Programming Collective Intelligence (Chapter 3 : Discovering Groups) by Toby Segaran which goes through just this use case using Python. However, you should be able to implement things in PHP once you understand how it works.
Even though it is not PHP, the Carrot2 project offers several clustering engines and can be integrated with Solr.
This may be way off but check out OpenCalais. They have a web service which allows you to pass a block of text in and it will pass you back a parseable response of things that it found in the text, such as places, people, facts etc. You could use these categories to build your "clouds" and too choose which results to display.
I've used this library a few times in php and it's always been quite easy to work with.
Again, might not be relevant to what your trying to do. Maybe you could post an example of what your trying to accomplish?
If you can pre-define the filters for your faceted search (the named groups) then it will be much easier.
Rather than relying on an algorithm that uses the current searcher's input and their particular results to generate the filter list, you would use an aggregate of the most commonly performed searches by all users and then tag results with them if they match.
You would end up with a table (or something) of URLs in a many-to-many join to a table of tags, so each result url could have several appropriate tags.
When the user searches, you simply match their search against the full index. But for the filters, you take the top results from among the current resultset.
I'll work on query examples if you want.
What are the best practices to configure Zend Lucene to make the search results more relevant?
i have the following fields and document type
productname (Text)
description (Text)
category (Keyword)
Please give some sample codes.
There are two concepts that come to my mind with your question, yet not sure exactly what you're looking for.
Score: A rating that indicates to what extent a document matches the search query. From the manual:
Zend_Search_Lucene uses the same
scoring algorithms as Java Lucene. All
hits in the search result are ordered
by score by default.
$hits = $index->find($query);
foreach ($hits as $hit) {
echo $hit->id;
echo $hit->score;
}
The score is by default retrieved and applied to order the results from more to less relevant, thus it must be assumed that you need something else.
Term Boosting: Used to influence the relevance of individual terms within a query. Quoting once more the manual:
Boosting allows you to control the
relevance of a document by boosting
individual terms. For example, if you
are searching for
PHP framework
and you want the term "PHP" to be more
relevant boost it using the ^ symbol
along with the boost factor next to
the term. You would type:
PHP^4 framework
This will make documents with the term
PHP appear more relevant. You can also
boost phrase terms and subqueries as
in the example:
"PHP framework"^4 "Zend Framework"
Does this help at all?
Getting relevant result from any search engine is hard work.
With the level of detail you specify, it is hard to give you any specific advice.
I suggest you start with this paper.
As the title says, I need a search engine... for mysql searching.
My website is PHP based.
I was going with sphinx but my hosting company doesn't support full-text indexes!
So a search engine to be used without full-text!
It should be pretty powerful, and must include atleast these functions below:
When searching for 'bmw 520' only matches where these two words come in exactly this order is returned. not matches for only 'bmw' or only '520'.
When searching for 'bmw 330ci' results as the above will be returned, but, WITH AND WITHOUT the ci extension. There are a nr of extensions in cars as you all know (i, ci, si, fi etc).
I want the 'minus sign' to 'exclude' all returns containing the word after the sign, ex: 'bmw -330' will return all 'bmw' results without the '330' ones. (a NOT instead of minus sign is also ok)
all special character accents like 'é' are converted to their simple values, in this case 'e'.
list of words to ignore completely in the search
Thanks guys!
The Zend_Lucene search competent works fairly well. I am not sure how it would cope with your second requirement, however if you customized the tokenized you should be able to do it by treating a change from letters to numbers as a new word.
The one I am really not sure about is the top requirement. Given how it is indexed, order becomes irreverent in the search, so you may not be able to do it without heavy editing of Lucene, writing a filter (using lucene to pull the matches, then checking the order), or writing your own solution. All of these will slow the search down, and add load to your server.
There is also solr, but I have never used it and don't know anything about it. Sphinx was another one, but I see you have already ruled that out.
Xapian is very good (very comprehensive) if you have the time for the initial setup.
It functions as you would expect a search engine to work, tell the indexer what bits of information to index under what namespace/table/object (Page, Profile, Products etc), then issue a query for your users based on keywords, it also supports google style tags e.g. "profile:Mark icecream" would search my profile for the word icecream, i seem to remember it supporting ranges too for data you specify as numeric.
Can be used in local mode which can offer spelling modifications (Did you mean?), or remote mode that many sites can index to and query from.
What really saved me one time was the ability to attach transient non searchable data to an indexed item, e.g. attaching the DB id to all data indexed for that record, very good for then going and getting the whole record from the DB when your matches come back from xapian.
I have used a couple of Search Engines on my site during it's time, but in the next rebuild I'm planning to move to Google Site Search.
There are several reasons for this:
Users are very familiar with the Google style of search result listings which improves usability and hence click-through rates
The Google engine is very good at guessing when to use the page description and when to use a fragment of the page (it also very good at getting relevant fragments compared to some other engines)
It's used by thousands of very popular websites
Google is the most popular search engine around so you know their technology is both reliable and accurate
Google Site Search begins at $100 per annum for 1000 pages or less (and a limit on queries)
or you can use the free Google Custom Search Engine (but this has much less customizability)
I am creating a search engine for my php based website. I need to search a mysql table.
Thing is, the search engine must be pretty 'smart', so that users can easily find their items (it's a classifieds website).
I have currently set up a FULLTEXT search with this piece of code:
MATCH (headline) AGAINST ($querystring)
But this isn't enough...
For instance, lets say the field headline contains something like Bmw 330ci.
If I search for 330, I wont get any results. The ending ('ci') is just one of many endings in car models which must be taken into account when searching the table.
Or what if the headline field is bmw330? Also no results, because it only matches full words.
Or also, what if the headline is bmw 330, and I search for bmw 520, still with FULLTEXT I will get the bmw 330 as a result, even though I searched for bmw 520... Not good!
How should I solve this problem?
When it comes to fulltext search, people who want free solutions often tend to use either Sphinx or Solr.
I've not used any of those two, but I've read several times that they were great, and easy to use from/with PHP and MySQL.
Don't reinvent the wheel: inverted-index search engine are already there, free of charge, open source, easy and powerful. They have all what you need for such kind of search requirements.
Depending on your context, you can choose between a search library like Apache Lucene or a search platform like Apache Solr or Elastic Search.
All of them have a great documentation and they are widely used. That extremely minimizes the learning curve, even if you never worked with fulltext search world.