I am creating a simple search function for my website using MySQL and PHP. Right now, if type the word "cat" into the search bar, I will NOT be able to retrieve articles with the word "cats", and vice-versa. It is the same with the ending "ed".
The only way that I can think of to solve this problem is by removing all "s" and "ed" from the end of each word that is longer than a certain length (to avoid turning "Ted" into "T", etc). However, this simple solution is nowhere near perfect. I'm hoping someone can provide me with a better solution.
The technique you are referring to is called stemming. Because of the great many influences on languages this is a difficult thing to handle on your own at the application level. If you do not want to deal with this you can let MySQL do the heavy lifting for you depending on what version of MySQL you are running. If you are on version 5.6.4 or later it is built into the full-text search mechanism for both MyISAM tables and InnoDB tables. In versions 5.5 through 5.6.3 it is built in for MyISAM but not InnoDB tables. For version 5.1 there is a plugin available from mnoGoSearch. Prior to 5.1 I think you need to handle it at the application level but I have not confirmed that.
These links might help get you started.
http://dev.mysql.com/doc/refman/5.6/en/glossary.html#glos_stemming
http://dev.mysql.com/doc/refman/5.6/en/glossary.html#glos_full_text_search
http://dev.mysql.com/doc/refman/5.6/en/glossary.html#glos_fulltext_index
http://dev.mysql.com/doc/refman/5.6/en/fulltext-search.html
Be aware of the stopword list which is a list of very common and often short words that are ignored in your search text when the query is processed. There are settings to control the stopword list if it is preventing you from getting expected results. You will likely want to set the minimum word length to 2 or 3 (default is 4) and remove many of the words on the default list.
If you do want to handle stemming on your own or with PHP there is a detailed technical discussion of the Porter Stemming Algorithm by Martin Porter and there are at least two PHP implementations available, an older one in PHP4 by Jon Abernathy that may have some flaws and a newer one in PHP5 by Richard Heyes.
I am assuming that you are primarily concerned with English but I believe that there is some support for other languages as well.
As mentioned by rnmccall if you need more advanced search capabilities you may need to go with Sphinx or Apache Lucene.
The strategy of removing suffixes described in the question is generally called stemming. If you are still interested in pursing that strategy, you should check out http://tartarus.org/~martin/PorterStemmer/ for the background of stemming. That page also has a PHP implementation of the Porter stemmer and links to more modern algorithms.
This stemming search approach is used by Sphinx, which is used for pydoc among other things.
The main benefit of the stemming approach is that it is straightforward and can be lightweight.
But, if you want more sophisticated search capabilities, you probably should use something like Apache Lucene.
I'd recommend using Lucene. It will also cause less stress on your db as you aren't running complex queries - just looking up an index. You can also run fuzzy searches with Lucene.
You can simply use
SELECT * FROM topics WHERE Title LIKE '%cat%'
in query to search topics with title cat and cats. You can use FullTextSearch if you want to search data from large text content. In this case you have to use MyISAM tables only. You can read the FullTextSearch Documentation here
There is no mean of ed or any thing you want to remove. Because you are searching a string from a paragraph you need to provide a particular keyword for search that.That keyword can be full string(word) or can be a sub-string(part of a word).
Example:-
You are in a black hole.
Now you want to search black by providing bla as a search string.Then the query like :-
SELECT * FROM TABLE_NAME WHERE YOUR_FIELD_NAME LIKE '%BLA%'
Use this above query for make a exact match with your content.You can provide any sub-string from your para/passage that you want to search from.
Hope it will help you.
Possible Solution :
1.Simplest To implement -> use %operator
like %cats%
2.Use solr for fast implementation as optimal algo are implemented there.
Note: u can also cache your results in cache
A simple query will be:
select * from table where item like '%name%'
To avoid the t and ted thing, use the substr() function and get the string into a universal size and then put that string in where clause.
Related
I'm trying to build a search engine for a website. It's mostly a collection of HTML/CSS pages with some PHP. Now that's all there is. All of my content in on the pages.
From what I understand to be able to do this I would need to have the content on a Database, am I correct?
If so I was considering doing as such, creating a MySQL table with four columns "Keywords" "Titles" "Content" and "Link".
Keywords - will hold the a word that if its in the query will show this as the most likely result.
Titles - after searching Keywords searches the titles produce the most relevant results
Content - should be a last resource for finding something as it will be messier I believe
Link - is just the link that belongs to the particular row.
I will be implementing it with PHP and MySQL, and it will be tiresome to put all the content, titles etc into a db. Is this a good method or should I be looking at something else?
Thanks.
---------------EDIT-------------------
Lucene seems like a good option, however even after reading the Getting started and looking around a bit on the web I cant understand how it works, can someone point me somewhere that explains this in a very very basic manner? Especially taking in consideration I do not know how to compile anything.
Thank you.
Building a search engine from scratch is painful. It is an interesting task, indeed, so if it is for learning, then do it!
However, if you just need a good search function for your web site, please use something that others have done for you. Apache Lucene is one option.
Sphinxsearch is an open-source full-text search server, designed from the ground up with performance, relevance (aka search quality), and integration simplicity in mind.
Sphinx lets you either batch index and search data stored in an SQL database, NoSQL storage, or just files quickly and easily — or index and search data on the fly, working with Sphinx pretty much as a database server.
I'm assuming your pages are static HTML. You can do two things at once and transfer the content of the pages in the DB, so that they will be generated on the fly by reading their content from the DB.
Anyway, I think your strategy is OK at least for a basic search engine. Also have a look into MySQL fulltext search.
MySQL fulltext search will be the easiest to setup but it will be a lot slower than Sphinxsearch. Even Lucene is slower than Sphinx. So if speed is a criteria, I would suggest taking time out to lean and implement Sphinx.
In one of his presentations, Andrew Aksyonoff (creator of Sphinx) presented the following
benchmarking results. Approximately 3.5 Million records with around 5 GB of text were used
for the purpose.
MySQL Lucene Sphinx
Indexing time, min 1627 176 84
Index size, MB 3011 6328 2850
Match all, ms/q 286 30 22
Match phrase, ms/q 3692 29 21
Match bool top-20, ms/q 24 29 13
Apart from a basic search, there are many features that make Sphinx a better solution for
searching. These features include multivalve attributes, tokenizing settings, wordforms,
HTML processing, geosearching, ranking, and many others
Zend Lucene is a pure PHP implementation of search which is quite useful.
Another search option is solr, which is based on lucene, but does a lot of the heavy lifting for you in order to produce more google like results. This is probably your easiest option, besides using Mysql MyISAM fulltext search capabilities.
I'm currently working on an indexer for a search feature. The indexer will work over data from "fields".
Fields looks like:
Field_id Field_type Field_name Field_Data
- 101 text Name Intel i7
- 102 integer Cores 4 physical, 4 virtual
- 103 select Vendor Intel
- 104 multitext Description The i7 is intel's next gen range of cpus.
The indexer would generate the following results/index:
Keyword Occurrences
- intel 101, 103, 104
- i7 101, 104
- physical 102
- virtual 102
- next 104
- gen 104
- range 104
- cpus 104 (*)
- cpu 104 (*)
So it somewhat looks all nice and fine, however, there are some issues which I'd like to sort out:
filtering out common words (as you perhaps noticed, "the" "is" "of" and "intel's" are missing from list)
With regards to "cpus" (plurals vs singulars), would it be best to use a particular type (singular or plural), both or exact (ie, "cpus" is different "cpu")?
Continuing with previous item, how can I determine a plural (different flavors: test=>tests fish=>fish and leaf=>leaves)
I'm currently using MySql and I'm very concerned with performance issues; we have 500+ categories and we didn't even launch the site
Let's say I wanted to use the search term "vendor:intel", where vendor specifies the field name (field_name), do you think there would be a huge impact on the sql server?
Search throttling; I don't like this at all, but it's a possibility, and if you know of any workarounds, make yourself heard!
There were other issues which I probably forgot about, if you spot any, you're welcome to yell at me ;-)
I do not need the search engine to crawl links, in fact, I specifically want it to not crawl links.
(by the way, I'm not biased towards intel, it simply happens that I own an i7-based pc ;-) )
Grab a list of stop words(non-keywords) from here, the guy has even formatted them in php for you.
http://armandbrahaj.blog.al/2009/04/14/list-of-english-stop-words/
Then simply do a preg_replace on the string you are indexing.
What I've done in past is remove suffixes like 's', 'ed' etc with regex and use the same regex on the search string. It's not ideal though. This was for a basic website with only 200 pages.
If you are concerned about performance you might want to consider using a search engine like Lucine (solr) instead of a database. This will make indexing much easier. You don't want to reinvent the wheel here.
This is in response to your original question, and your later answer/question.
I've used the Sphinx search engine before (quite a while ago, so I'm a bit rusty), and found it to be very good, even if the documentation is sometimes a bit lacking.
I'm sure there are other ways to do this, both with your own custom code, or with other search engines—Sphinx just happens to be the one I've used. I'm not suggesting that it will do everything you want, just the way you want, but I am reasonably certain that it will do most of it quite easily, and a lot faster than anything written in PHP/MySQL alone.
I recommend reading Build a custom search engine with PHP before digging into the Sphinx documentation. If you don't think it's suitable after reading that, fair enough.
In answer to your specific questions, I've put together some links from the documentation, together with some relevant quotes:
filtering out common words (as you perhaps noticed, "the" "is" "of" and "intel's" are missing from list)
11.2.8. stopwords
Stopwords are the words that will not
be indexed. Typically you'd put most
frequent words in the stopwords list
because they do not add much value to
search results but consume a lot of
resources to process.
With regards to "cpus" (plurals vs singulars), would it be best to use a particular type (singular or plural), both or exact (ie, "cpus" is different "cpu")?
11.2.9. wordforms
Word forms are applied after
tokenizing the incoming text by
charset_table rules. They essentialy
let you replace one word with another.
Normally, that would be used to bring
different word forms to a single
normal form (eg. to normalize all the
variants such as "walks", "walked",
"walking" to the normal form "walk").
It can also be used to implement
stemming exceptions, because stemming
is not applied to words found in the
forms list.
Continuing with previous item, how can I determine a plural (different flavors: test=>tests fish=>fish and leaf=>leaves)
Sphinx supports the Porter Stemming Algorithm
The Porter stemming algorithm (or
‘Porter stemmer’) is a process for
removing the commoner morphological
and inflexional endings from words in
English. Its main use is as part of a
term normalisation process that is
usually done when setting up
Information Retrieval systems.
Let's say I wanted to use the search term "vendor:intel", where vendor specifies the field name (field_name), do you think there would be a huge impact on the sql server?
3.2. Attributes
A good example for attributes would be
a forum posts table. Assume that only
title and content fields need to be
full-text searchable - but that
sometimes it is also required to limit
search to a certain author or a
sub-forum (ie. search only those rows
that have some specific values of
author_id or forum_id columns in the
SQL table); or to sort matches by
post_date column; or to group matching
posts by month of the post_date and
calculate per-group match counts.
This can be achieved by specifying all
the mentioned columns (excluding title
and content, that are full-text
fields) as attributes, indexing them,
and then using API calls to setup
filtering, sorting, and grouping.
You can also use the 5.3. Extended query syntax to search specific fields (as opposed to filtering results by attributes):
field search operator:
#vendor intel
How does a search engine index a set of fields and bind the found phrases/keywords/etc with the particular field id?
8.6.1. Query
On success, Query() returns a result set that contains some of the found matches (as requested by SetLimits()) and additional general per-query statistics. > The result set is a hash (PHP specific; other languages might utilize other structures instead of hash) with the following keys and values:
"matches":
Hash which maps found document IDs to another small hash containing document weight and attribute values (or an array of the similar small hashes if SetArrayResult() was enabled).
"total":
Total amount of matches retrieved on server (ie. to the server side result set) by this query. You can retrieve up to this amount of matches from server for this query text with current query settings.
"total_found":
Total amount of matching documents in index (that were found and procesed on server).
"words":
Hash which maps query keywords (case-folded, stemmed, and otherwise processed) to a small hash with per-keyword statitics ("docs", "hits").
"error":
Query error message reported by searchd (string, human readable). Empty if there were no errors.
"warning":
Query warning message reported by searchd (string, human readable). Empty if there were no warnings.
Also see Listing 11 and Listing 13 from Build a custom search engine with PHP.
filtering out common words (as you
perhaps noticed, "the" "is" "of" and
"intel's" are missing from list)
Find (or create) a list of common words and filter user input.
With regards to "cpus" (plurals vs
singulars), would it be best to use a
particular type (singular or plural),
both or exact (ie, "cpus" is different
"cpu")?
Depends. I would search for both if that's not a big burden; or for the singular form using the LIKE clause if possible.
Continuing with previous item, how can
I determine a plural (different
flavors: test=>tests fish=>fish and
leaf=>leaves)
Create an Inflector method or class. ie: Inflect::plural('fish') gives you 'fish'. There might be classes like these for the English language, look them up.
I'm currently using MySql and I'm very
concerned with performance issues; we
have 500+ categories and we didn't
even launch the site
Having good schema and code design helps, but I can't really give you much advice on that one.
Let's say I wanted to use the search
term "vendor:intel", where vendor
specifies the field name (field_name),
do you think there would be a huge
impact on the sql server?
That would really help, since you'd be looking up a single column instead of multiple. Just be careful to filter user input and/or allow looking up only particular columns.
Search throttling; I don't like this
at all, but it's a possibility, and if
you know of any workarounds, make
yourself heard!
Not many options here. To help here and in performance, you should consider having some sort of caching.
I would heartily suggest you take a look at Solr. It's a Java based self contained Search and index system and probably has more benefits than a PHP solution.
Search is tough to implement. Would recommend using a package if you're new to it.
Have you considered http://framework.zend.com/manual/en/zend.search.lucene.html ?
Since many are suggesting to use an existing package, (and I want to make it harder for you than just suggesting a package ;-) ), let's presume I will use such a package (over in this answer thread).
How does a search engine index a set of fields and bind the found phrases/keywords/etc with the particular field id?
That's not the question I want answered, at least not directly. My issue is, how easy is it to make the search engine work as I want?
Given my above requirements, is this even possible/feasible?
From personal experience, I'd rather wasted some time tweaking my system rather than fixing someone else's code, which I have to waste way more time to understand first.
Call me conservative, but I rarely stick to someone else's code/programs, and when I did, it was because of a desperate situation - and I usually end up somehow contributing to said project.
There's a PHP implementation of a Brill Part of Speech tagger on php/ir. This might provide a framework for identifying those words that should be discarded and those you want to index, while it also identifies plurals (and the root singular). It's not perfect, though a custom dictionary to handle technical terms, it could prove useful for resolving your first three questions.
I want to build a product-search engine.
I was thinking of using google-site-search but that really searches Google's index of your site. I do not want to search that. I want to search a specific table (all the fields, even ones the user never sees) on my data-base for given keywords.
But I want this search to be as robust as possible, I was wondering if there was something already out there I could use? if not whats the best way to go about making it myself?
You can try using Sphinx full-text search for MySQL.
Here's also a tutorial from IBM using PHP.
I'd focus on MySQL Full-Text search first. Take a look at these links:
http://dev.mysql.com/doc/refman/4.1/en/fulltext-search.html
http://dev.mysql.com/doc/refman/5.1/en/fulltext-boolean.html
Here is a snippet from the first link:
Full-text searching is performed using
MATCH() ... AGAINST syntax. MATCH()
takes a comma-separated list that
names the columns to be searched.
AGAINST takes a string to search for,
and an optional modifier that
indicates what type of search to
perform. The search string must be a
literal string, not a variable or a
column name. There are three types of
full-text searches:
As far as stuff that's already out there, take a look at these :
Search all tables (for SQL Server, but you could probably adapt it to MySQL)
Another search all tables (for SQL Server, but you could probably adapt it to MySQL)
Search all varchar columns in database
MySQL Full-Text Search
Using MySQL Full-Text Search
SELECT * FROM table WHERE value REGEXP 'searchterm'
Allows you to use many familiar search tricks such as +, "", etc
This is a native function of MySQL. No need to use go to a new language or plugin which might be faster, but is also extra time for maintenance, troubleshooting, etc.
It may be a little slower than doing some crazy C++ based mashup, but users don't generally notice a difference between milliseconds......
One thing you might also want to look into (if you're not going to utilize sphinx), is stemming your keywords. It will make matching keywords a bit easier (as stemming 'cheese' and 'cheesy' would end up producing the same stemmed word) which makes your keyword matching a bit more flexible.
I am using sphinx as a search engine on my website its working perfect and I have no complain with it. The only thing it lacks is, it does not allow me to search articles whose query length is more than 15 words. I know in reality people don't use more than 3-4 words i want to use it for finding duplicate contents.
I was wondering if there is any alternative solution to sphinx. I want to cope with duplicate contents.
My main articles table is in innodb but I am also caching articles into MyISAM table as well for full text searching but when I search an article it takes ages to perform one search. Its not the query problem, i think mysql lacks the fulltext searching facility.
Thanks
Jason
Apache Solr is an alternative. It's based on Apache's Lucene project...
you might want to check Lucene as well.
And since you're using MySQL, check it's full-text searching MySQL Full Text Searching
Check Zend_Search_Lucene as well: http://framework.zend.com/manual/en/zend.search.lucene.html
Though it's slower than sphinx.
Perhaps not helpful, but could you simply add a unique index to the MySQL field to prevent insertion of duplicates?
I have not come across any query length limitations in the Sphinx version I'm using (0.9.9), but maybe I have not tried hard enough.
OK I have a mySQL Database that looks something like this
ID - an int and the unique ID of the recorded
Title - The name of the item
Description - The items description
I want to search both title and description of key words, currently I'm using.
SELECT * From ‘item’ where title LIKE %key%
And this works and as there’s not much in the database, as however searching for “this key” doesn’t find “this that key” I want to improve the search engine of the site, and may be even add some kind of ranking system to it (but that’s a long time away).
So to the question, I’ve heard about something called “Full text search” it is (as far as I can tell) a staple of database design, but being a Newby to this subject I know nothing about it so…
1) Do you think it would be useful?
And an additional questron…
2) What can I read about database design / search engine design that will point me in the right direction.
If it’s of relevance the site is currently written in stright PHP (I.E. without a framework) (thro the thought of converting it to Ruby on Rails has crossed my mind)
update
Thanks all, I'll go for Fulltext search.
And for any one finding this later, I found a good tutorial on fulltext search as well.
The problem with the '%keyword%' type search is that there is no way to efficiently search on it in a regular table, even if you create an index on that column. Think about how you would look that string up in the phone book. There is actually no way to optimize it - you have to scan the entire phone book - and that is what MySQL does, a full table scan.
If you change that search to 'keyword%' and use an index, you can get very fast searching. It sounds like this is not what you want, though.
So with that in mind, I have used fulltext indexing/searching quite a bit, and here are a few pros and cons:
Pros
Very fast
Returns results sorted by relevance (by default, although you can use any sorting)
Stop words can be used.
Cons
Only works with MyISAM tables
Words that are too short are ignored (default minimum is 4 letters)
Requires different SQL in where clause, so you will need to modify existing queries.
Does not match partial strings (for example, 'word' does not match 'keyword', only 'word')
Here is some good documentation on full-text searching.
Another option is to use a searching system such as Sphinx. It can be extremely fast and flexible. It is optimized for searching and integrates well with MySQL.
You might also consider Zend_Lucene. It's slightly easier to integrate than Sphinx, because it is pure PHP.
I would guess that MySQL fulltext is sufficient for your needs, but it's worth noting that the built in support doesn't scale very well. For average size documents it starts to become unusable for table sizes as small as a few hundred thousand rows. If you think that this might become a problem further on you should probably look into Sphinx already. It's becoming the defacto standard for MYSQL-users, even though I personally prefer to implement my own solution using java lucene. :)
Also, I'd like to mention that full text search is fundamentally different from the standard LIKE '%keyword%'-search. Unlike the LIKE-search full text indexing allows you to search for several keywords that doesn't have to appear right next to each other. Standard search engines such as google are full text search engines, for example.