Spelling correction in Sphinx? - php

I was about to integrate the Sphinx-based search into the website, but I've found that there's no built support for spelling correction.
Folks on the web suggest using pspell or other third-party libraries to get things done, but the problem is the data I'm going to search in, contains mostly "technical" terms like brand names, thus I don't think common libraries will include them.
On the other hand, Xapian states to have spelling correction support based on the data indexed, so exactly what I want. Is it worth using Xapian instead? I'm still quite confused of which fulltext search engine I should use: Sphinx seems to be quite good, but lacking some cool features of Xapian (or maybe Lucene?), while it looks like the latter has smaller community and less documentation.
I think I can solve the problem with words not present in pspell dictionary using the custom one for it, but I'm not sure whether that will impose noticeable performance losses? I'm going to use the search system for the spotlight search (separate search via ajax on every letter entered) on a pretty popular website, so performance matters.
Ideally, I'd like to make some fields like brand names have more priority over common dictionary but I guess that's not really important since most brand names a quite distinct from the other words.
Any suggestions on the general design of the custom full-text search engine are welcome too.
Thanks

Sphinx has no built-in spelling-correction, but that can be implemented using Sphinx. Only one how-to article (by Sphinx author) about this can be found there http://habrahabr.ru/blogs/sphinx/61807 (in Russian, You can use GoogleTranslate for read this article. Look on the second part of article named "Я понял, это намек.")
I implement that method recently - works perfect!

Sphinx allows you to use morphology preprocessors and word forms dictionaries. Both of these combined could get you closer to what you want to achieve. You can read more about both topics here: http://sphinxsearch.com/docs/manual-0.9.8.html#conf-morphology and further below.
There are several "flavours" of morphology preprocessors available, choose one that best fits your needs. The docs also mention the Snowball project, which can be used to add stems in other languages than the built-in english and russian, if needed. The project website: http://snowball.tartarus.org/
Sphinx is a very fast full text search engine and using stemmers is not likely to slow it down to the extent that you start noticing it.

Related

How to enhance SilverStripe FullTextSearchable search results?

I have enabled SilverStripe's FulltextSearchable in my _config.php file. I want to enhance the results of FulltextSearchable's default search.
The default search results are as follows:
If I search a single word that exists, it shows a result. OK
If I change only a letter from this word, it does not find anything. BAD
If I search multiple words, it does not find anything, except if these words are exactly like in the database. BAD
I don't want to use a Google custom Search module in my site.
Is there an easy way to enhance FullTextSearchable to find multiple words and to return better results?
Take a look at the Fulltextsearch module (Different from FullTextSearchable): https://github.com/silverstripe-labs/silverstripe-fulltextsearch. It uses Solr, and allows many different and flexible ways to index and subsequently search your SiteTree and DataObject subclasses using the Lucene search-syntax (Which is abstracted away from you).
Warning: While the module is stable, and flexible, with this comes the potential for complexity. Read the docs (well!) and don't be afraid to ask more questions on silverstripe.org or SO :-)

Loose searching approach

I want to make a searching option for my site, and for fun I decided I should at least try to make it myself (If I fail, there's always Google Custom Search).
The problem is, I don't even know how to approach this monster! Here are the requirements:
Not all keywords will be required in the search (Should one search for "Big happy world", it would also search for "Big world" "happy world" etc)
Common spelling mistakes considerations (from a database, via edit difference or a predefined list of common mistakes (rather then => rather than, etc).
Search in both content and titles of posts, with an emphesis on titles.
Don't suck
I've searched my old pal Google for it, but the only reasonable things I found were academic level papers on the subject (English isn't my native, I'm good but not that good =( ).
So in short: does anyone know of a good place to start, a tutorial, an article, an example?
Thanks in advance.
There are several options you could try:
Apache Lucene (A PHP based implementation exists in the Zend Framework)
ElasticSearch (provides a REST-like API on top of Lucene)
Xapian
Sphinx
Probably a bunch of others too.
If you want to create your own search engine, apache lucene is a mature open source library that can take care of a big part of the functionality for you.
Using lucene, you first index your information [using an IndexWriter]. This is done off line, to create the index.
On serach - you use an IndexSearcher to find documents that match your query.
If you want some theoretical knowledge on "how it works", you should read more on information retrieval. A good place to start is stanford's introduction to information retrieval

What's the best way to implement typo correction into a search in php/mysql?

I have a site that lists movies. Naturally people make spelling mistakes when searching for movies, and of course there is the fact that some movies have apostrophes, use letters to spell out numbers in the title, etc.
How do I get my search script to overlook these errors? Probably need something that's a little more intelligent than WHERE mov_title LIKE '%keyword%'.
It was suggested that I use a fulltext search engine, but all of those things look really complicated, and I feel that building them into my application will be like hell on earth. If I do have to use one, what's the least invasive one, that will be most painless to implement into existing code?
I think you'll have to implement an external fulltext search engine. MySQL just isn't good at fulltext search. I'd say you should give Lucene a go (tutorials). Zend Framework has an API that plugs into Lucene, making it easier to learn and utilize.
Presuming that you use MySQL - MySQL has no in-built functionality that is capable of doing this.
This means you will have to implement a full-text search yourself, or use a third party full text search tool.
If you implement it yourself, you should look into the metaphone or double metaphone algorithms (I'd recommend them over soundex, which is not nearly as good at this type of task), to store phoenetic representations of all your words. However, building your own full text search is no task for the faint-hearted. Don't attempt it if you don't consider yourself a database wizard.
If you want a third party tool, Lucene is the way to go. It is ported into tons of different languages/platforms including PHP - you don't have to use Java.
I've used neither php nor mysql, but an alternative to full text search might be soundex searches.

PHP-based search frameworks

I'm going to make a small site which requires advanced search capabilities. Since reinventing the wheel isn't such a worthwhile activity, I've done a little googling and found there are some PHP based search frameworks, one of which is integrated into Zend framework.
What I would like to have in the framework:
Both full-text and catalogue search capabilities
Display results sorted by relevance
Ability to filter results by category
Sorting results by various criteria
Fast search
Fast insertion not required
Since the site will feature pretty much static content (some text and a product catalogue), I might go with some pre-generated index.
Are there any (free) frameworks that could meet the above requirements? Suggestions, tips and ideas are more than welcome. It'd be great if you could share your experiences implementing a search system.
Have a look at Omega (based on Xapian) - a link to the Xapian project page
You can integrate it cgi-wise. Because it's based on the blindingly fast Xapian it will be one of the fastest options if you set it up correctly. It can do everything you ask for (including relevance for search results, index web server documents (html, pdf, word, excel, sql databases...) do 'stemming' etc...)
Another (also very good option) would off course be Apache Lucene --> it's this one that is included in the Zend framework you referenced ("Zend Search"). It can do all the same tricks, although i personally prefer Xapian.
Edit: be aware that Omega (and Xapian) are GPL whereas Apache Lucene is LGPL if i recall correctly.
You may want to go with a CMS such as Joomla or Drupal if the site will have static content only. Both have good search systems. However, search really depends on what sort of content you have. If its simply searching the HTML of the page, that's one thing, but searching the database for a particular model # of a product is another, in which case you need a shopping cart/e-commerce system rather than a CMS.
definitely use SOLR. Solr uses lucene. this can we useful for a medium/big site....
the good thing is you can request result in php serialized format from solr...
EDIT:
this is what you are looking for, I complete forgot about it: Lucene Port To PHP by Zend
I recently developed a suggestive fulltext search to use with my Zend Framework based web application - I couldn't find any ready-made solution that fit my requirements, so I went all out and developed a simple(fulltext) keyword search mechanism from scratch. I found the following articles helpful:
http://devzone.zend.com/node/view/id/1304
http://dev.mysql.com/doc/refman/5.1/en/fulltext-search.html
What I have now is a system that matches items based on a 'text summary' that is generated at the time the item is saved (or updated) in the database. I have a table called kw_search_summary that contains the text summary of each item (script generated), its id and its category id. The 'summary' column is a mysql fulltext index, so I simply MATCH() the summary column AGAINST() a given expression, and display the results by relevancy. The code that builds this query looks a bit like this:
$select = $this->db->select()
->from(array('kwi' => 'kw_search_index'),
array('id','prodcatid','itemid','useradid','summary','relevance' => "match(summary) against($safeExp in boolean mode)"))
->where("match(summary) against($safeExp in boolean mode)")
->order('relevance desc')
->limitPage($currentPage,self::RESULTS_PER_PAGE);
Hope that was at least a bit helpful.

How would I implement a simple site search with php and mySQL?

I'm creating a site that allows users to submit quotes. How would I go about creating a (relatively simple?) search that returns the most relevant quotes?
For example, if the search term was "turkey" then I'd return quotes where the word "turkey" appears twice before quotes where it only appears once.
(I would add a few other rules to help filter out irrelevant results, but my main concern is that.)
Everyone is suggesting MySQL fulltext search, however you should be aware of a HUGE caveat. The Fulltext search engine is only available for the MyISAM engine (not InnoDB, which is the most commonly used engine due to its referential integrity and ACID compliance).
So you have a few options:
1. The simplest approach is outlined by Particle Tree. You can actaully get ranked searches off of pure SQL (no fulltext, no nothing). The SQL query below will search a table and rank results based off the number of occurrences of a string in the search fields:
SELECT
SUM(((LENGTH(p.body) - LENGTH(REPLACE(p.body, 'term', '')))/4) +
((LENGTH(p.body) - LENGTH(REPLACE(p.body, 'search', '')))/6))
AS Occurrences
FROM
posts AS p
GROUP BY
p.id
ORDER BY
Occurrences DESC
edited their example to provide a bit more clarity
Variations on the above SQL query, adding WHERE statements (WHERE p.body LIKE '%whatever%you%want'), etc. will probably get you exactly what you need.
2. You can alter your database schema to support full text. Often what is done to keep the InnoDB referential integrity, ACID compliance, and speed without having to install plugins like Sphinx Fulltext Search Engine for MySQL is to split the quote data into it's own table. Basically you would have a table Quotes that is an InnoDB table that, rather than having your TEXT field "data" you have a reference "quote_data_id" which points to the ID on a Quote_Data table which is a MyISAM table. You can do your fulltext on the MyISAM table, join the IDs returned with your InnoDB tables and voila you have your results.
3. Install Sphinx. Good luck with this one.
Given what you described, I would HIGHLY recommend you take the 1st approach I presented since you have a simple database driven site. The 1st solution is simple, gets the job done quickly. Lucene will be a bitch to setup especially if you want to integrate it with the database as Lucene is designed mainly to index files not databases. Google custom site search just makes your site lose tons of reputation (makes you look amateurish and hacked), and MySQL fulltext will most likely cause you to alter your database schema.
Use Google Custom Site Search. I've heard they know a thing or two about searching.
Stackoverflow plans to use the Lucene search engine. There is a PHP port of this written for the Zend Framework but can be downloaded as a separate entity without needing all the ZF bloat. This is called Zend_Search_Lucene, documentation for which can be found here.
Your sql for that will look something like this (where you're trying to find quotes with 'turkey' in it):
SELECT * FROM Quotes
WHERE the_quote LIKE "%turkeyt%";
From there you can figure out what to do with whatever it spits out at you.
Be careful to properly handle cases where a malicious user might inject malicious SQL into your database, especially if you're planning on putting this on the www. If you're doing this for fun though, I guess it's just about what you want to learn.
If you're new to databases and sql, I recommend sqlite over mysql. Much easier to set up and work with, as in no set up. It'll get you around the potential headaches of having to install and set up mysql for the first time.
I'd go with Full Text Search, look at it here: http://hockinson.com/fulltext-search-of-mysql-database-table.html
If you want to write your own, take a look at phpBB's implementation. They have two tables, the first is a unique list of all the words that appear in entries, and the second is a many-to-many reference between the words and the entries. You could then do a group and count to sort the entries in the manner you're looking for.
It's a lot more work then implementing a third-party search engine (or full text search), but it will allow you greater control over the results.
As an alternative to Sphinx and Lucene, a relatively simple search engine can be created using the Xapian library.
+ Supports many advanced search features (such as relevancy ranking)
+ Fast
- You would need to learn the API to create your interface
- Requires a php extension to be installed
Note also that Xapian stores its data in a separate index to mysql.
You might also be interested in Forage which is a wrapper for Solr, Xapian and Lucene.
The Xapian people also created the Omega search engine which is a frontend to Xapian, and can be called via cgi.
Here's a much simpler and easier to operate open source alternative to Solr / Lucene:
http://github.com/typesense/typesense
Google Custom Site Search is great, if you don't query it much (I think you get 1k queries/ day for free) or if you're willing to pay.
MySQL's fulltext search is also a great resource (as has been mentioned previously).
Yahoo's BOSS is an intriguing project -- I'm going to give it a shot during my next search project.
And, finally, Lucene is a great resource if you need more power than fulltext, but want to tweak your own search engine. http://lucene.apache.org
I came across the Zoom Search Engine a few days ago and think this might be the simplest search engine I have ever used.
The Windows based tool creates a database of the site, then it also asks you what language (PHP, ASP.NET, JavaScript, etc), you want to use. I picked PHP and it built the PHP code for me. All, I had to do then was upload the files to the server and (optionally) customize the template and site search was working.
This is free to small sites, and the only con I can find is that the spider tool (database builder) has to run on Windows.

Categories