Which DB/DB Engine supports search well?

Which DB/DB Engine supports search well? - php

I'm starting a site which relies heavily on search. While it's probably going to search basic meta data in the beginning, it might grow to something bigger in the future.
So which DB/DB Engine is best in your opinion when it comes to search performance and future scalability?
Appreciate your help

It depends on what you are searching.
If you doing a lot of text searching then you want more than just a database - you also want a search algorithm. You can find them around the web and they can use several databases as backends.
However, if you only simple text searches then MySQL MyISAM offers full-text searching which I use for small amounts of text (less than a few GB).
Other searches include using keys and indexes which might lead you to PostgreSQL for it's rock solid ACID compliance or MySQL with INNODB.

What is "search" ? What are you looking for and what kind of queries do you expect?
PostgreSQL is very powerfull, has full text search, btree, hash, gin and gist indexes. You can also configure you own types and operators, everything is there to optimize your searches in the database. It's up to you to use and tweak it for you situation.
PostgreSQL is easy to use with PHP, no problem at all. And it's free, sort of BSD-licence.

Depending on whatever you mean by “search” any database system might work. (MySQL is a well know and fast RDBMS).
If what you are looking for really is “full text search” then you should take a look into MySQL FULLTEXT indices (only usable with the MyISAM backend, IIRC), Lucene or Xapian.
The Zend Framework (written in PHP) has a ready adapter for lucene, see: http://devzone.zend.com/article/91

Related

Handling simple grammar in a PHP search engine

I am creating a simple search function for my website using MySQL and PHP. Right now, if type the word "cat" into the search bar, I will NOT be able to retrieve articles with the word "cats", and vice-versa. It is the same with the ending "ed".
The only way that I can think of to solve this problem is by removing all "s" and "ed" from the end of each word that is longer than a certain length (to avoid turning "Ted" into "T", etc). However, this simple solution is nowhere near perfect. I'm hoping someone can provide me with a better solution.

The technique you are referring to is called stemming. Because of the great many influences on languages this is a difficult thing to handle on your own at the application level. If you do not want to deal with this you can let MySQL do the heavy lifting for you depending on what version of MySQL you are running. If you are on version 5.6.4 or later it is built into the full-text search mechanism for both MyISAM tables and InnoDB tables. In versions 5.5 through 5.6.3 it is built in for MyISAM but not InnoDB tables. For version 5.1 there is a plugin available from mnoGoSearch. Prior to 5.1 I think you need to handle it at the application level but I have not confirmed that.
These links might help get you started.
http://dev.mysql.com/doc/refman/5.6/en/glossary.html#glos_stemming
http://dev.mysql.com/doc/refman/5.6/en/glossary.html#glos_full_text_search
http://dev.mysql.com/doc/refman/5.6/en/glossary.html#glos_fulltext_index
http://dev.mysql.com/doc/refman/5.6/en/fulltext-search.html
Be aware of the stopword list which is a list of very common and often short words that are ignored in your search text when the query is processed. There are settings to control the stopword list if it is preventing you from getting expected results. You will likely want to set the minimum word length to 2 or 3 (default is 4) and remove many of the words on the default list.
If you do want to handle stemming on your own or with PHP there is a detailed technical discussion of the Porter Stemming Algorithm by Martin Porter and there are at least two PHP implementations available, an older one in PHP4 by Jon Abernathy that may have some flaws and a newer one in PHP5 by Richard Heyes.
I am assuming that you are primarily concerned with English but I believe that there is some support for other languages as well.
As mentioned by rnmccall if you need more advanced search capabilities you may need to go with Sphinx or Apache Lucene.

The strategy of removing suffixes described in the question is generally called stemming. If you are still interested in pursing that strategy, you should check out http://tartarus.org/~martin/PorterStemmer/ for the background of stemming. That page also has a PHP implementation of the Porter stemmer and links to more modern algorithms.
This stemming search approach is used by Sphinx, which is used for pydoc among other things.
The main benefit of the stemming approach is that it is straightforward and can be lightweight.
But, if you want more sophisticated search capabilities, you probably should use something like Apache Lucene.

I'd recommend using Lucene. It will also cause less stress on your db as you aren't running complex queries - just looking up an index. You can also run fuzzy searches with Lucene.

You can simply use
SELECT * FROM topics WHERE Title LIKE '%cat%'
in query to search topics with title cat and cats. You can use FullTextSearch if you want to search data from large text content. In this case you have to use MyISAM tables only. You can read the FullTextSearch Documentation here

There is no mean of ed or any thing you want to remove. Because you are searching a string from a paragraph you need to provide a particular keyword for search that.That keyword can be full string(word) or can be a sub-string(part of a word).
Example:-
You are in a black hole.
Now you want to search black by providing bla as a search string.Then the query like :-
SELECT * FROM TABLE_NAME WHERE YOUR_FIELD_NAME LIKE '%BLA%'
Use this above query for make a exact match with your content.You can provide any sub-string from your para/passage that you want to search from.
Hope it will help you.

Possible Solution :
1.Simplest To implement -> use %operator
like %cats%
2.Use solr for fast implementation as optimal algo are implemented there.
Note: u can also cache your results in cache

A simple query will be:
select * from table where item like '%name%'
To avoid the t and ted thing, use the substr() function and get the string into a universal size and then put that string in where clause.

Search engine for website

I'm trying to build a search engine for a website. It's mostly a collection of HTML/CSS pages with some PHP. Now that's all there is. All of my content in on the pages.
From what I understand to be able to do this I would need to have the content on a Database, am I correct?
If so I was considering doing as such, creating a MySQL table with four columns "Keywords" "Titles" "Content" and "Link".
Keywords - will hold the a word that if its in the query will show this as the most likely result.
Titles - after searching Keywords searches the titles produce the most relevant results
Content - should be a last resource for finding something as it will be messier I believe
Link - is just the link that belongs to the particular row.
I will be implementing it with PHP and MySQL, and it will be tiresome to put all the content, titles etc into a db. Is this a good method or should I be looking at something else?
Thanks.
---------------EDIT-------------------
Lucene seems like a good option, however even after reading the Getting started and looking around a bit on the web I cant understand how it works, can someone point me somewhere that explains this in a very very basic manner? Especially taking in consideration I do not know how to compile anything.
Thank you.

Building a search engine from scratch is painful. It is an interesting task, indeed, so if it is for learning, then do it!
However, if you just need a good search function for your web site, please use something that others have done for you. Apache Lucene is one option.

Sphinxsearch is an open-source full-text search server, designed from the ground up with performance, relevance (aka search quality), and integration simplicity in mind.
Sphinx lets you either batch index and search data stored in an SQL database, NoSQL storage, or just files quickly and easily — or index and search data on the fly, working with Sphinx pretty much as a database server.

I'm assuming your pages are static HTML. You can do two things at once and transfer the content of the pages in the DB, so that they will be generated on the fly by reading their content from the DB.
Anyway, I think your strategy is OK at least for a basic search engine. Also have a look into MySQL fulltext search.

MySQL fulltext search will be the easiest to setup but it will be a lot slower than Sphinxsearch. Even Lucene is slower than Sphinx. So if speed is a criteria, I would suggest taking time out to lean and implement Sphinx.
In one of his presentations, Andrew Aksyonoff (creator of Sphinx) presented the following
benchmarking results. Approximately 3.5 Million records with around 5 GB of text were used
for the purpose.
MySQL Lucene Sphinx
Indexing time, min 1627 176 84
Index size, MB 3011 6328 2850
Match all, ms/q 286 30 22
Match phrase, ms/q 3692 29 21
Match bool top-20, ms/q 24 29 13
Apart from a basic search, there are many features that make Sphinx a better solution for
searching. These features include multivalve attributes, tokenizing settings, wordforms,
HTML processing, geosearching, ranking, and many others

Zend Lucene is a pure PHP implementation of search which is quite useful.
Another search option is solr, which is based on lucene, but does a lot of the heavy lifting for you in order to produce more google like results. This is probably your easiest option, besides using Mysql MyISAM fulltext search capabilities.

Exploring search options for PHP

I have innoDB table using numerous foreign keys, but we just want to look up some basic info out of it.
I've done some research but still lost.
How can I tell if my host has Sphinx
installed already? I don't see it
as an option for table storage
method (i.e. innodb, myisam).
Zend_Search_Lucene, responsive
enough for AJAX functionality of
millions of records?
Mirror my
innoDB with a myisam? Make every
innodb transaction end with a write
to the myisam, then use 1:1 lookups?
How would I do this automagically?
This should make MyISAM
ACID-compliant and free(er) from
corruption no?
PostgreSQL fulltext
queries don't even look like SQL to
me wtf, I don't have time to learn a
new SQL syntax I need noob options
????????????????????
This is high volume site on a decently-equipped VPS
Thanks very much for any ideas.

Sphinx is very good choice. Very scalable, built-in clustering and sharding.

Your question is very vague on what you're actually wanting to accomplish here but I can tell you to stay away from Zend_Search_Lucene with record counts that high. In my experience (and many others, including Zend Certified Engineers) ZSL's performance on large record-sets is poor at best. Use a tool like Apache Lucene instead if you go that route.

Article search engine in php

I am using sphinx as a search engine on my website its working perfect and I have no complain with it. The only thing it lacks is, it does not allow me to search articles whose query length is more than 15 words. I know in reality people don't use more than 3-4 words i want to use it for finding duplicate contents.
I was wondering if there is any alternative solution to sphinx. I want to cope with duplicate contents.
My main articles table is in innodb but I am also caching articles into MyISAM table as well for full text searching but when I search an article it takes ages to perform one search. Its not the query problem, i think mysql lacks the fulltext searching facility.
Thanks
Jason

Apache Solr is an alternative. It's based on Apache's Lucene project...
you might want to check Lucene as well.
And since you're using MySQL, check it's full-text searching MySQL Full Text Searching

Check Zend_Search_Lucene as well: http://framework.zend.com/manual/en/zend.search.lucene.html
Though it's slower than sphinx.

Perhaps not helpful, but could you simply add a unique index to the MySQL field to prevent insertion of duplicates?
I have not come across any query length limitations in the Sphinx version I'm using (0.9.9), but maybe I have not tried hard enough.

Is Full Text search the answer?

OK I have a mySQL Database that looks something like this
ID - an int and the unique ID of the recorded
Title - The name of the item
Description - The items description
I want to search both title and description of key words, currently I'm using.
SELECT * From ‘item’ where title LIKE %key%
And this works and as there’s not much in the database, as however searching for “this key” doesn’t find “this that key” I want to improve the search engine of the site, and may be even add some kind of ranking system to it (but that’s a long time away).
So to the question, I’ve heard about something called “Full text search” it is (as far as I can tell) a staple of database design, but being a Newby to this subject I know nothing about it so…
1) Do you think it would be useful?
And an additional questron…
2) What can I read about database design / search engine design that will point me in the right direction.
If it’s of relevance the site is currently written in stright PHP (I.E. without a framework) (thro the thought of converting it to Ruby on Rails has crossed my mind)
update
Thanks all, I'll go for Fulltext search.
And for any one finding this later, I found a good tutorial on fulltext search as well.

The problem with the '%keyword%' type search is that there is no way to efficiently search on it in a regular table, even if you create an index on that column. Think about how you would look that string up in the phone book. There is actually no way to optimize it - you have to scan the entire phone book - and that is what MySQL does, a full table scan.
If you change that search to 'keyword%' and use an index, you can get very fast searching. It sounds like this is not what you want, though.
So with that in mind, I have used fulltext indexing/searching quite a bit, and here are a few pros and cons:
Pros
Very fast
Returns results sorted by relevance (by default, although you can use any sorting)
Stop words can be used.
Cons
Only works with MyISAM tables
Words that are too short are ignored (default minimum is 4 letters)
Requires different SQL in where clause, so you will need to modify existing queries.
Does not match partial strings (for example, 'word' does not match 'keyword', only 'word')
Here is some good documentation on full-text searching.
Another option is to use a searching system such as Sphinx. It can be extremely fast and flexible. It is optimized for searching and integrates well with MySQL.

You might also consider Zend_Lucene. It's slightly easier to integrate than Sphinx, because it is pure PHP.

I would guess that MySQL fulltext is sufficient for your needs, but it's worth noting that the built in support doesn't scale very well. For average size documents it starts to become unusable for table sizes as small as a few hundred thousand rows. If you think that this might become a problem further on you should probably look into Sphinx already. It's becoming the defacto standard for MYSQL-users, even though I personally prefer to implement my own solution using java lucene. :)
Also, I'd like to mention that full text search is fundamentally different from the standard LIKE '%keyword%'-search. Unlike the LIKE-search full text indexing allows you to search for several keywords that doesn't have to appear right next to each other. Standard search engines such as google are full text search engines, for example.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.