I want to make a searching option for my site, and for fun I decided I should at least try to make it myself (If I fail, there's always Google Custom Search).
The problem is, I don't even know how to approach this monster! Here are the requirements:
Not all keywords will be required in the search (Should one search for "Big happy world", it would also search for "Big world" "happy world" etc)
Common spelling mistakes considerations (from a database, via edit difference or a predefined list of common mistakes (rather then => rather than, etc).
Search in both content and titles of posts, with an emphesis on titles.
Don't suck
I've searched my old pal Google for it, but the only reasonable things I found were academic level papers on the subject (English isn't my native, I'm good but not that good =( ).
So in short: does anyone know of a good place to start, a tutorial, an article, an example?
Thanks in advance.
There are several options you could try:
Apache Lucene (A PHP based implementation exists in the Zend Framework)
ElasticSearch (provides a REST-like API on top of Lucene)
Xapian
Sphinx
Probably a bunch of others too.
If you want to create your own search engine, apache lucene is a mature open source library that can take care of a big part of the functionality for you.
Using lucene, you first index your information [using an IndexWriter]. This is done off line, to create the index.
On serach - you use an IndexSearcher to find documents that match your query.
If you want some theoretical knowledge on "how it works", you should read more on information retrieval. A good place to start is stanford's introduction to information retrieval
Related
I have a MySQL database with two main tables that contain the data I need to index. I am looking for a search engine API that can index and return appropriate search results - as close as possible to Google quality -. The application uses the keywords and creates pages based on the search results.
I have tried SOLR but am not sure if that is the best one. Any other paid or open source alternatives you may have come across? The project is LAMP based.
Thanks,
Sameer
Solr/Lucene are definitely the de-facto when it comes to open source search World. I love Solr. No! you dont need to go for anything "Paid" :). In my opinion (if you want to go for something else) you try out Sphinx Search Engine, its absolutely amazing, integrates extremely well with LAMP. Infact the PHP API that ships with it is really good and you can get started with Search using Sphinx in almost no time.
I am creating a social site and for search want to try solr or lucene as I have very indepth searches required. Platform is PHP codeignitor and MySQL. However my php developers have 0 experience outside of PHP/MySQL. So before i make them implement this I need to know:
1) How easy or how much time would it normally take to setup and get it implemented?
2) Is there coding involved or is it ready out of the box? ( I know there will be some to link it with my system objects)
3) Which one to use out of the two?
For your use, I would suggest Solr. To use Lucene, you will need in depth Java knowledge, where as with Solr, you don't necessarily need this.
Solr will be ready out of the box, but you will need to do some configuration to "describe" your search index. You need to configure it so that it understands what your documents look like, what fields within that document to search on, how to search them, etc. This does have a learning curve. However, it's not overly difficult. The time this takes is greatly affected by how complex you want your searches to be.
For simple searches, I would think a developer should be able to insert documents and perform searches within a week of starting with Solr. Depending on how in depth your searches are, a developer could spend weeks or months learning and fiddling to tweak things. However, the bulk of the work should be doable within a few weeks of concentrated effort.
For what it's worth, the wiki and mailing lists for Solr are great resources. AND the developers themselves are very responsive.
EDIT: The coding involved with Solr would be on the PHP side. You need to write something to put your data into the XML format that Solr needs to insert documents into it's index, as all of this is done via XML over HTTP.
I was about to integrate the Sphinx-based search into the website, but I've found that there's no built support for spelling correction.
Folks on the web suggest using pspell or other third-party libraries to get things done, but the problem is the data I'm going to search in, contains mostly "technical" terms like brand names, thus I don't think common libraries will include them.
On the other hand, Xapian states to have spelling correction support based on the data indexed, so exactly what I want. Is it worth using Xapian instead? I'm still quite confused of which fulltext search engine I should use: Sphinx seems to be quite good, but lacking some cool features of Xapian (or maybe Lucene?), while it looks like the latter has smaller community and less documentation.
I think I can solve the problem with words not present in pspell dictionary using the custom one for it, but I'm not sure whether that will impose noticeable performance losses? I'm going to use the search system for the spotlight search (separate search via ajax on every letter entered) on a pretty popular website, so performance matters.
Ideally, I'd like to make some fields like brand names have more priority over common dictionary but I guess that's not really important since most brand names a quite distinct from the other words.
Any suggestions on the general design of the custom full-text search engine are welcome too.
Thanks
Sphinx has no built-in spelling-correction, but that can be implemented using Sphinx. Only one how-to article (by Sphinx author) about this can be found there http://habrahabr.ru/blogs/sphinx/61807 (in Russian, You can use GoogleTranslate for read this article. Look on the second part of article named "Я понял, это намек.")
I implement that method recently - works perfect!
Sphinx allows you to use morphology preprocessors and word forms dictionaries. Both of these combined could get you closer to what you want to achieve. You can read more about both topics here: http://sphinxsearch.com/docs/manual-0.9.8.html#conf-morphology and further below.
There are several "flavours" of morphology preprocessors available, choose one that best fits your needs. The docs also mention the Snowball project, which can be used to add stems in other languages than the built-in english and russian, if needed. The project website: http://snowball.tartarus.org/
Sphinx is a very fast full text search engine and using stemmers is not likely to slow it down to the extent that you start noticing it.
I am building a forum from scratch in PHP. I have used the most of phpBB:s database structure.
But now I am thinking about the search functionality?, what is a good design to be able to search really fast in all posts. I guess there must be some better way than just %query_string% in mysql :)
Maybe explode all sentences into words, let the words be keys in a hash table, and the value is a comma separated list of all the post the word is in? Then there is little more trouble if you delete a post but I think that approach is better.
From start I guess I can use the simple solution, but I dont want to change the code when the forum grows bigger.
Thanks for any ideas or if you can point me to the right direction!
Zend Lucene is a powerful way to add search to a PHP site.
Here's an article about how to do exactly that: Roll Your Own Search Engine with Zend_Search_Lucene
The best option for me today is sphinx search. It can be used with php, rails, perl and until now for me worked like a charm. You can check a php solution. Craiglist for example use it.
Don't reinvent the wheel. Have a look at Lucene. There is also a port for php:
Zend Lucene
Lucene does the parsing and indexing for you and the queries are fast as lightning.
Most forum users will want more than just a string-search. They might not know the exact phrase they need and when they search for "forum search" they would be delighted to find a result for "How to search a forum", which contains the relevant terms but in a different order and separated by other words.
They may also need some fuzzy searching if they don't know the spelling of what they need. They might search for "sequal" and want "sql".
All of this points towards a more complex solution than your like-search.
The most important pointer for now is that whatever you implement, you should make sure it is easy to switch it out in favour of something better later. Make sure your search is hot-swappable as you know you will want to change it later.
I'm going to make a small site which requires advanced search capabilities. Since reinventing the wheel isn't such a worthwhile activity, I've done a little googling and found there are some PHP based search frameworks, one of which is integrated into Zend framework.
What I would like to have in the framework:
Both full-text and catalogue search capabilities
Display results sorted by relevance
Ability to filter results by category
Sorting results by various criteria
Fast search
Fast insertion not required
Since the site will feature pretty much static content (some text and a product catalogue), I might go with some pre-generated index.
Are there any (free) frameworks that could meet the above requirements? Suggestions, tips and ideas are more than welcome. It'd be great if you could share your experiences implementing a search system.
Have a look at Omega (based on Xapian) - a link to the Xapian project page
You can integrate it cgi-wise. Because it's based on the blindingly fast Xapian it will be one of the fastest options if you set it up correctly. It can do everything you ask for (including relevance for search results, index web server documents (html, pdf, word, excel, sql databases...) do 'stemming' etc...)
Another (also very good option) would off course be Apache Lucene --> it's this one that is included in the Zend framework you referenced ("Zend Search"). It can do all the same tricks, although i personally prefer Xapian.
Edit: be aware that Omega (and Xapian) are GPL whereas Apache Lucene is LGPL if i recall correctly.
You may want to go with a CMS such as Joomla or Drupal if the site will have static content only. Both have good search systems. However, search really depends on what sort of content you have. If its simply searching the HTML of the page, that's one thing, but searching the database for a particular model # of a product is another, in which case you need a shopping cart/e-commerce system rather than a CMS.
definitely use SOLR. Solr uses lucene. this can we useful for a medium/big site....
the good thing is you can request result in php serialized format from solr...
EDIT:
this is what you are looking for, I complete forgot about it: Lucene Port To PHP by Zend
I recently developed a suggestive fulltext search to use with my Zend Framework based web application - I couldn't find any ready-made solution that fit my requirements, so I went all out and developed a simple(fulltext) keyword search mechanism from scratch. I found the following articles helpful:
http://devzone.zend.com/node/view/id/1304
http://dev.mysql.com/doc/refman/5.1/en/fulltext-search.html
What I have now is a system that matches items based on a 'text summary' that is generated at the time the item is saved (or updated) in the database. I have a table called kw_search_summary that contains the text summary of each item (script generated), its id and its category id. The 'summary' column is a mysql fulltext index, so I simply MATCH() the summary column AGAINST() a given expression, and display the results by relevancy. The code that builds this query looks a bit like this:
$select = $this->db->select()
->from(array('kwi' => 'kw_search_index'),
array('id','prodcatid','itemid','useradid','summary','relevance' => "match(summary) against($safeExp in boolean mode)"))
->where("match(summary) against($safeExp in boolean mode)")
->order('relevance desc')
->limitPage($currentPage,self::RESULTS_PER_PAGE);
Hope that was at least a bit helpful.