Text search technologies for mysql data

Text search technologies for mysql data - php

the task is to implement text search in MySQL in my project(PHP/Zend Framework 2 + MySQL). The issue is that text fields are not big at all, it is mostly VARCHAR fields or joined fields like city names, company names and so on, about 5-10 fields for each entity.
So currently I decided to choose Lucene(zend framework 2 module - Zend Search), but will it be effective to use technologies like Lucene or Sphinx for small varchar fields?
Thank you.

Sure, Lucene or Sphinx can work with any varchar columns that contain text.* They don't have to be huge.
Any fulltext indexing solution is hundreds or thousands of times better than using LIKE '%word%'!
You might be interested in my presentation, Fulltext Search Throwdown.
You can also watch a recording of me delivering that presentation as a webinar: http://www.percona.com/webinars/2012-08-22-full-text-search-throwdown (it's free but requires registration).
* Lucene and Sphinx can do some things with numeric columns too.
PS: I was the project lead on Zend Framework 1.0. Zend_Search_Lucene was an interesting experiment at circa 2007, but it's way outdated, relative to Apache Lucene/Solr, and Zend_Search_Lucene is orders of magnitude slower than the Java implementation. I wouldn't bother with it.

Related

Building a fast semantic MySQL search engine for private articles from scratch

I am working on a project that will involve full-text and semantic searches of articles within the site (if it's not possible to combine it, the user can select either option). These articles are subscription-based and can only be searched after logging in; so they are not accessible to external search engines or their APIs.
I read about Sphinx for full text keywords searches (and I intend to implement it for that aspect) but I am not sure how to go about building a semantic search engine out of this. e.g. Searching for "U.S. President" should list articles that contain references to the actual names of the U.S. presidents e.g. George Washington, Bill Clinton (or William Jefferson Clinton).
I have ideas that maybe a sort of tagging system can be used to relate various keywords e.g. relate President to George Washington and President to Bill Clinton, but since the data is really huge and many such relations will exist I don't know how to further this idea.
Please advice me on how to go about building a semantic search engine (I guess Sphinx can handle the full-text keyword search) from scratch. Otherwise, please inform me of any internet-based resources or if there are already existent software in any language that I can integrate into my application.
P.S. My database of choice is MySQL (please advice if another database system is more suitable for the task), and I prefer to program in PHP but if I need to learn Python or any other language that will be more effective to this task, I will be willing.
I already searched at answers.semanticweb.com

I would use Apache Solr. I think it's more flexible than Sphinx. Solr supports full-text search and I believe has add-ons for semantic support (like siren). Solr is the serverized version of Lucene.
Solr supports a SynonymFilter: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#SynonymFilter
This post discusses some strategies for optimizing content retrieval http://www.lucidimagination.com/devzone/technical-articles/optimizing-findability-lucene-and-solr

This book may be useful for someone reading this thread. I just found it on Amazon.
http://www.amazon.com/E-Librarian-Service-User-Friendly-Libraries-X-media-publishing/dp/3642177425

Choosing search engine for tube site? (SilverStripe specific or in general)

I'm developing a site that could be compared with a tube site (like YouTube). I'm in the design phase and am trying to figure out what search method to go with.
I'm using SilverStripe framework which has modules for Sphinx, Solr, and Lucene so they are obviously interesting. Another option is to simply query the database (MySQL) and not use any search engine.
What would you do? And why?
Any input is appreciated! Thanks in advance!

simply query the database (MySQL) and not use any search engine
I assume you want to use MyISAM's full-text search capabilities? This is possible, SilverStripe's default configuration is currently (at least until version 2.4) set to MyISAM and not InnoDB. However, this is only recommended for simple, small, and not performance hungry tasks - I assume that's not what you want.
More powerful (both in terms of speed and feature wise) are dedicated search services.
For a general overview, take a look at ElasticSearch, Sphinx, Lucene, Solr, Xapian. Which fits for which usage? for example.
With the details you've given, any of the five should get your job done, but you might give that some more consideration.
However, I would also take into consideration, for which search services SilverStripe modules are already available, how well they fit your requirements, and how much you "like" them. Unless you'd want to write a module for ElasticSearch for example - that would be pretty cool, but I'm not sure it's really worth the effort.
Personally, I'd probably go with https://code.google.com/p/lucene-silverstripe-plugin/ as it's easy to set up and seems to be working well (haven't tried it myself, but I have only heard good things from others about it).

Zend lucene and MySql database

I have a PHP web site with data stored in a MySql database. (approximately 50 000 articles)
I want to improve the results of the full text search functionality and stop using just a simple LIKE query.
I find Zend_Search_Lucene from the Zend framework that seems to be a great tool.
Do you think zend search lucene is a good choice in my case ?
After indexing all my articles with lucene, do I need to keep the data in MySql or zend search lucene is enough to keep all the data ?
Thanks in advance,

I would investigate if MySQLs native Full-Text Searching would meet your needs first before jumping to a Lucene based solution. It is a major improvement upon using LIKE statements without the additional implementation required for Lucene.
Zend_Search_Lucene is a pure PHP implementation of Lucene and can therefore be pretty slow when used with large datasets. I would skip it and look at implementing Apache Solr. There is PECL extension for it, which is documented here.

I have used MySQL's fulltext on over 200,000 docs with a good amount of data and my search times are around .5 seconds to 2 seconds on popular terms and a very rare 5 or 6 second response every so often. I update some data each day so long term caching doesn't work the best but if I could cache searches I could be looking at .2 second times or lower after caching.
I am testing moving over to Zend Lucene and so far the same searches come in under 1.5 seconds for the most used terms.
All of the above is on a dedicated server with 2 gigs of ram and a core 2 duo.
I am no expert but for 50,000 articles I agree with Treffynnon to check out fulltext searching instead of using LIKE. If you do move to a new version of Zend Lucene I believe the indexes are compatible with the java version so it may make for a good gateway if down the road you add more articles and need more speed?

PHP-based search frameworks

I'm going to make a small site which requires advanced search capabilities. Since reinventing the wheel isn't such a worthwhile activity, I've done a little googling and found there are some PHP based search frameworks, one of which is integrated into Zend framework.
What I would like to have in the framework:
Both full-text and catalogue search capabilities
Display results sorted by relevance
Ability to filter results by category
Sorting results by various criteria
Fast search
Fast insertion not required
Since the site will feature pretty much static content (some text and a product catalogue), I might go with some pre-generated index.
Are there any (free) frameworks that could meet the above requirements? Suggestions, tips and ideas are more than welcome. It'd be great if you could share your experiences implementing a search system.

Have a look at Omega (based on Xapian) - a link to the Xapian project page
You can integrate it cgi-wise. Because it's based on the blindingly fast Xapian it will be one of the fastest options if you set it up correctly. It can do everything you ask for (including relevance for search results, index web server documents (html, pdf, word, excel, sql databases...) do 'stemming' etc...)
Another (also very good option) would off course be Apache Lucene --> it's this one that is included in the Zend framework you referenced ("Zend Search"). It can do all the same tricks, although i personally prefer Xapian.
Edit: be aware that Omega (and Xapian) are GPL whereas Apache Lucene is LGPL if i recall correctly.

You may want to go with a CMS such as Joomla or Drupal if the site will have static content only. Both have good search systems. However, search really depends on what sort of content you have. If its simply searching the HTML of the page, that's one thing, but searching the database for a particular model # of a product is another, in which case you need a shopping cart/e-commerce system rather than a CMS.

definitely use SOLR. Solr uses lucene. this can we useful for a medium/big site....
the good thing is you can request result in php serialized format from solr...
EDIT:
this is what you are looking for, I complete forgot about it: Lucene Port To PHP by Zend

I recently developed a suggestive fulltext search to use with my Zend Framework based web application - I couldn't find any ready-made solution that fit my requirements, so I went all out and developed a simple(fulltext) keyword search mechanism from scratch. I found the following articles helpful:
http://devzone.zend.com/node/view/id/1304
http://dev.mysql.com/doc/refman/5.1/en/fulltext-search.html
What I have now is a system that matches items based on a 'text summary' that is generated at the time the item is saved (or updated) in the database. I have a table called kw_search_summary that contains the text summary of each item (script generated), its id and its category id. The 'summary' column is a mysql fulltext index, so I simply MATCH() the summary column AGAINST() a given expression, and display the results by relevancy. The code that builds this query looks a bit like this:
$select = $this->db->select()
->from(array('kwi' => 'kw_search_index'),
array('id','prodcatid','itemid','useradid','summary','relevance' => "match(summary) against($safeExp in boolean mode)"))
->where("match(summary) against($safeExp in boolean mode)")
->order('relevance desc')
->limitPage($currentPage,self::RESULTS_PER_PAGE);
Hope that was at least a bit helpful.

How would I implement a simple site search with php and mySQL?

I'm creating a site that allows users to submit quotes. How would I go about creating a (relatively simple?) search that returns the most relevant quotes?
For example, if the search term was "turkey" then I'd return quotes where the word "turkey" appears twice before quotes where it only appears once.
(I would add a few other rules to help filter out irrelevant results, but my main concern is that.)

Everyone is suggesting MySQL fulltext search, however you should be aware of a HUGE caveat. The Fulltext search engine is only available for the MyISAM engine (not InnoDB, which is the most commonly used engine due to its referential integrity and ACID compliance).
So you have a few options:
1. The simplest approach is outlined by Particle Tree. You can actaully get ranked searches off of pure SQL (no fulltext, no nothing). The SQL query below will search a table and rank results based off the number of occurrences of a string in the search fields:
SELECT
SUM(((LENGTH(p.body) - LENGTH(REPLACE(p.body, 'term', '')))/4) +
((LENGTH(p.body) - LENGTH(REPLACE(p.body, 'search', '')))/6))
AS Occurrences
FROM
posts AS p
GROUP BY
p.id
ORDER BY
Occurrences DESC
edited their example to provide a bit more clarity
Variations on the above SQL query, adding WHERE statements (WHERE p.body LIKE '%whatever%you%want'), etc. will probably get you exactly what you need.
2. You can alter your database schema to support full text. Often what is done to keep the InnoDB referential integrity, ACID compliance, and speed without having to install plugins like Sphinx Fulltext Search Engine for MySQL is to split the quote data into it's own table. Basically you would have a table Quotes that is an InnoDB table that, rather than having your TEXT field "data" you have a reference "quote_data_id" which points to the ID on a Quote_Data table which is a MyISAM table. You can do your fulltext on the MyISAM table, join the IDs returned with your InnoDB tables and voila you have your results.
3. Install Sphinx. Good luck with this one.
Given what you described, I would HIGHLY recommend you take the 1st approach I presented since you have a simple database driven site. The 1st solution is simple, gets the job done quickly. Lucene will be a bitch to setup especially if you want to integrate it with the database as Lucene is designed mainly to index files not databases. Google custom site search just makes your site lose tons of reputation (makes you look amateurish and hacked), and MySQL fulltext will most likely cause you to alter your database schema.

Use Google Custom Site Search. I've heard they know a thing or two about searching.

Stackoverflow plans to use the Lucene search engine. There is a PHP port of this written for the Zend Framework but can be downloaded as a separate entity without needing all the ZF bloat. This is called Zend_Search_Lucene, documentation for which can be found here.

Your sql for that will look something like this (where you're trying to find quotes with 'turkey' in it):
SELECT * FROM Quotes
WHERE the_quote LIKE "%turkeyt%";
From there you can figure out what to do with whatever it spits out at you.
Be careful to properly handle cases where a malicious user might inject malicious SQL into your database, especially if you're planning on putting this on the www. If you're doing this for fun though, I guess it's just about what you want to learn.
If you're new to databases and sql, I recommend sqlite over mysql. Much easier to set up and work with, as in no set up. It'll get you around the potential headaches of having to install and set up mysql for the first time.

I'd go with Full Text Search, look at it here: http://hockinson.com/fulltext-search-of-mysql-database-table.html

If you want to write your own, take a look at phpBB's implementation. They have two tables, the first is a unique list of all the words that appear in entries, and the second is a many-to-many reference between the words and the entries. You could then do a group and count to sort the entries in the manner you're looking for.
It's a lot more work then implementing a third-party search engine (or full text search), but it will allow you greater control over the results.

As an alternative to Sphinx and Lucene, a relatively simple search engine can be created using the Xapian library.
+ Supports many advanced search features (such as relevancy ranking)
+ Fast
- You would need to learn the API to create your interface
- Requires a php extension to be installed
Note also that Xapian stores its data in a separate index to mysql.
You might also be interested in Forage which is a wrapper for Solr, Xapian and Lucene.
The Xapian people also created the Omega search engine which is a frontend to Xapian, and can be called via cgi.

Here's a much simpler and easier to operate open source alternative to Solr / Lucene:
http://github.com/typesense/typesense

Google Custom Site Search is great, if you don't query it much (I think you get 1k queries/ day for free) or if you're willing to pay.
MySQL's fulltext search is also a great resource (as has been mentioned previously).
Yahoo's BOSS is an intriguing project -- I'm going to give it a shot during my next search project.
And, finally, Lucene is a great resource if you need more power than fulltext, but want to tweak your own search engine. http://lucene.apache.org

I came across the Zoom Search Engine a few days ago and think this might be the simplest search engine I have ever used.
The Windows based tool creates a database of the site, then it also asks you what language (PHP, ASP.NET, JavaScript, etc), you want to use. I picked PHP and it built the PHP code for me. All, I had to do then was upload the files to the server and (optionally) customize the template and site search was working.
This is free to small sites, and the only con I can find is that the spider tool (database builder) has to run on Windows.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.