I have a constantly growing database of keywords. I need to parse incoming text inputs (articles, feeds etc) and find which keywords from the database are present in the text. The database of keywords is much larger than the text.
Since the database is constantly growing (users add more and more keywords to watch for), I figure the best option will be to break the text input into words and compare those against the database. My main dilemma is implementing this comparison scheme (PHP and MySQL will be used for this project).
The most naive implementation would be to create a simple SELECT query against the keywords table, with a giant IN clause listing all the found keywords.
SELECT user_id,keyword FROM keywords WHERE keyword IN ('keyword1','keyword2',...,'keywordN');
Another approach would be to create a hash-table in memory (using something like memcache) and to check against it in the same manner.
Does anyone has any experience with this kind of searching and has any suggestions on how to better implement this? I haven't tried yet any of those approaches, I'm just gathering ideas at this point.
The classic way of searching a text stream for multiple keywords is the Aho-Corasick finite automaton, which uses time linear in the text to be searched. You'll want minor adaptations to recognize strings only on word boundaries, or perhaps it would be simpler just to check the keywords found and make sure they are not embedded in larger words.
You can find an implementation in fgrep. Even better, Preston Briggs wrote a pretty nice implementation in C that does exactly the kind of keyword search you are talking about. (It searches programs for occurrences of 'interesting' identifiers'.) Preston's implementation is distributed as part of the Noweb literate-programming tool. You could find a way to call this code from PHP or you could rewrite it in PHP---the recognize itself is about 220 lines of C, and the main program is another 135 lines.
All the proposed solutions, including Aho-Corasick, have these properties in common:
A preprocessing step that takes time and space proportional to the number of keywords in the database.
A search step that takes time and space proportional to the length of the text plus the number of keywords found.
Aho-Corasick offers considerably better constants of proportionality on the search step, but if your texts are small, this won't matter. In fact, if your texts are small and your database is large, you probably want to minimize the amount of memory used in the preprocessing step. Andrew Appel's DAWG data structure from the world's fastest scrabble program will probably do the trick.
In general,
break the text into words
b. convert words back to canonical root form
c. drop common conjunction words
d. strip duplicates
insert the words into a temporary table then do an inner join against the keywords table,
or (as you suggested) build the keywords into a complex query criteria
It may be worthwhile to cache a 3- or 4-letter hash array with which to pre-filter potential keywords; you will have to experiment to find the best tradeoff between memory size and effectiveness.
I'm not 100% clear on what you're asking, but maybe what you're looking for is an inverted index?
Update:
You can use an inverted index to match multiple keywords at once.
Split up the new document into tokens, and insert the tokens paired with an identifier for the document into the inverted index table. A (rather denormalized) inverted index table:
inverted_index
-----
document_id keyword
If you're searching for 3 keywords manually:
select document_id, count(*) from inverted_index
where keyword in (keyword1, keyword2, keyword3)
group by document_id
having count(*) = 3
If you have a table of the keywords you care about, just use an inner join rather than an in() operation:
keyword_table
----
keyword othercols
select keyword_table.keyword, keyword_table.othercols from inverted_index
inner join keyword_table on keyword_table.keyword=inverted_index.keyword
where inverted_index.document_id=id_of_some_new_document
is any of this closer to what you want?
Have you considered graduating to a fulltext solution such as Sphinx?
I'm talking out of my hat here, because I haven't used it myself. But it's getting a lot of attention as a high-speed fulltext search solution. It will probably scale better than any relational solution you use.
Here's a blog about using Sphinx as a fulltext search solution in MySQL.
I would do 2 things here.
First (and this isn't directly related to the question) I'd break up and partition user keywords by users. Having more tables with fewer data, ideally on different servers for distributed lookups where slices or ranges of users exist on different slices. Aka, all of usera's data exists on slice one, userb on slice two, etc.
Second, I'd have some sort of in-memory hash table to determine existence of keywords. This would likely be federated as well to distribute the lookups. For n keyword-existence servers, hash the keyword and mod it by n then distribute ranges of those keys across all of the memcached servers. This quick way lets you say is keyword x being watched, hash it and determine what server it would live on. Then make the lookup and collect/aggregate keywords being tracked.
At that point you'll at least know which keywords are being tracked and you can take your user slices and perform subsequent lookups to determine which users are tracking which keywords.
In short: SQL is not an ideal solution here.
I hacked up some code for scanning for multiple keywords using a dawg (as suggested above referencing the Scrabble paper) although I wrote it from first principles and I don't know whether it is anything like the AHO algorithm or not.
http://www.gtoal.com/wordgames/spell/multiscan.c.html
A friend made some hacks to my code after I first posted it on the wordgame programmers mailing list, and his version is probably more efficient:
http://www.gtoal.com/wordgames/spell/multidawg.c.html
Scales fairly well...
G
Related
I have been having an issue with building a search function for a while now that I'm building for a cooking blog.
In Dutch (similar to German), one can add as many compound words together to create a new word. This has been giving me a headache when wanting to include search results that include a relevant singular word inside compound words. It's kind of like a reverse Scunthorpe problem, I actually want to include certain words inside other words, but only sometimes.
For example, the word rice in Dutch is rijst. Brown rice is zilvervliesrijst and pandan rice is pandanrijst. If I want these two to pop up in search results, I have to search whether words exist inside a word, rather than whether they are the word.
However, this immediately causes issues for smaller words that can exist inside other words accidentally. For example, the word for egg is ei, while leek is prei. Onion is ui, while Brussel sprouts are spruitjes. You can see that accepting subsections of strings being matching the search strings could cause major problems.
I initially tried to grade what percentage of a word contains the search string, but this also causes issues as prei is 50% ei, while zilvervliesrijst is only about 25% rijst. This also makes using a levenshtein distance to solve this very impractical.
My current solution is as follows: I have an SQL table list of ingredients that are being used to automatically calculate the price and calorie total for each recipe based on the ingredient list, and I have used this to add all relevant synonyms to the name column. Basically, zilvervliesrijst is listed as zilvervliesrijst|rijst. I also use this to add both the plural and singular version of a term such that I will not have to test those.
However, this excludes any compound words in any place other than the ingredient list. Things such as title, cuisine, cooking equipment, dietary preferences and so on are still having this problem.
My question is this, is there a non-library-esque method that addresses this within the field of computer science? Or will I be doomed to include every single possible searchable compound word and its singular components, every time I want to add in a new recipe? I just hope that's not the case, as that will massively increase the processing time required for each additional library entry.
I think it will be hard to do this well without using a library, and probably also a dictionary (which may be bundled as part of the library).
There are really two somewhat orthogonal problems:
Splitting compound words into their constituent parts.
Identifying the stem of a simple (non-compound) word. (For example, removing plural markers and inflections.) This is often called "stemming" but that's not really the best strategy; you'll also find the rather awkward term "lemmatization".
Both of these tasks are plagued with ambiguities in all the languages I know about. (A German example, taken from an Arxiv paper describing the German-language morphological analyser DEMorphy, is "Rohrohrzucker", which means "raw cane sugar" -- Roh Rohr Zucker -- but could equally be split into Rohr Ohr Zucker, pipe-ear sugar, if there were such a thing.)
The basic outline of how these tasks can be done in reasonable time (with lots of CPU power) is:
Using ngram analysis to figure out plausible word division points.
Lemmatize each candidate component word to get plausible POS (part-of-speech) markers.
Use a trained machine-learning model (or something of that form) to reject non-sensical (or at least highly improbable) divisions.
At each step, check possible corner cases in a dictionary (of corner cases).
That's just a rough outline, of course.
I was able to find, without too much trouble, a couple of fairly recent discussions of how to do this with Dutch words. I'm not even vaguely competent to discuss the validity of these papers, so I'll leave you to do the search yourself. (I used the search query "split compound words in Dutch".) But I can tell you two things:
The problem is being worked on, but not necessarily to produce freely-available products.
If you choose to tackle it yourself, you'll end up devoting quite a lot of time to the project, although you might find it interesting. If you do succeed, you'll end up with a useful product and the beginning of a thesis (perhaps useful if you have academic ambitions).
However you choose to do it, you're best off only doing it once for each new recipe. Analyse the contents of each recipe as it is entered, to build a list of search terms which you can store in your database along with the recipe. You will probably also want to split and lemmatize search queries, but those are generally short enough that the CPU time is reasonable. Even so, consider caching the analyses in order to save time on common queries.
I have a dictionary(in form of sql table) containing model numbers of mobile phones and an article(or just a line) about mobiles phones(in form of a string in php or C). I want to find out the mobile phone models discussed in that article but I don't want to do a brute force search i.e. search each and every model name in the text one by one.
Also I was thinking of maintain a hash table of the entire dictionary and then try to match then against the hashes of each and every work in the article and look for the collisions. But since the dictionary is very large, memory overhead in this approach is too much.
Also, If there is no database at all i.e. we have everything in the language scope only, dictionary in form of array and text in form of string.
You definitely need to use FULLTEXT index on your article field and perform searches with MATCH/AGAINST for performing searches.
SELECT * FROM your_table MATCH('phonemodel') AGAINST ('article');
Inverted index would help. Link: Inverted index
Split your articles into tokens, filter tokens of model name. So you can build an index, the key of the index is model name, and the value of the index is an article list.
Maybe you can add some additional information like the position of model name appears in the article.
If you thinking of using C and performance is what you wish for. I would suggest building a trie (http://en.wikipedia.org/wiki/Trie) to all the words in the articles. It's a little faster than hashing and consumes much less memory than Dictionary.
It's not easy to implement in c, but I'm sure you can find one ready some where.
Good Luck (:
If you have huge data then use one of them -
Sphinx
Lucene
Trie/DAWG(Directed Acyclic Word Graph) are elegant solutions but also hard to implement & maintain. And, MySQL FULLTEXT search is good but not for large data.
I'm currently working on an indexer for a search feature. The indexer will work over data from "fields".
Fields looks like:
Field_id Field_type Field_name Field_Data
- 101 text Name Intel i7
- 102 integer Cores 4 physical, 4 virtual
- 103 select Vendor Intel
- 104 multitext Description The i7 is intel's next gen range of cpus.
The indexer would generate the following results/index:
Keyword Occurrences
- intel 101, 103, 104
- i7 101, 104
- physical 102
- virtual 102
- next 104
- gen 104
- range 104
- cpus 104 (*)
- cpu 104 (*)
So it somewhat looks all nice and fine, however, there are some issues which I'd like to sort out:
filtering out common words (as you perhaps noticed, "the" "is" "of" and "intel's" are missing from list)
With regards to "cpus" (plurals vs singulars), would it be best to use a particular type (singular or plural), both or exact (ie, "cpus" is different "cpu")?
Continuing with previous item, how can I determine a plural (different flavors: test=>tests fish=>fish and leaf=>leaves)
I'm currently using MySql and I'm very concerned with performance issues; we have 500+ categories and we didn't even launch the site
Let's say I wanted to use the search term "vendor:intel", where vendor specifies the field name (field_name), do you think there would be a huge impact on the sql server?
Search throttling; I don't like this at all, but it's a possibility, and if you know of any workarounds, make yourself heard!
There were other issues which I probably forgot about, if you spot any, you're welcome to yell at me ;-)
I do not need the search engine to crawl links, in fact, I specifically want it to not crawl links.
(by the way, I'm not biased towards intel, it simply happens that I own an i7-based pc ;-) )
Grab a list of stop words(non-keywords) from here, the guy has even formatted them in php for you.
http://armandbrahaj.blog.al/2009/04/14/list-of-english-stop-words/
Then simply do a preg_replace on the string you are indexing.
What I've done in past is remove suffixes like 's', 'ed' etc with regex and use the same regex on the search string. It's not ideal though. This was for a basic website with only 200 pages.
If you are concerned about performance you might want to consider using a search engine like Lucine (solr) instead of a database. This will make indexing much easier. You don't want to reinvent the wheel here.
This is in response to your original question, and your later answer/question.
I've used the Sphinx search engine before (quite a while ago, so I'm a bit rusty), and found it to be very good, even if the documentation is sometimes a bit lacking.
I'm sure there are other ways to do this, both with your own custom code, or with other search engines—Sphinx just happens to be the one I've used. I'm not suggesting that it will do everything you want, just the way you want, but I am reasonably certain that it will do most of it quite easily, and a lot faster than anything written in PHP/MySQL alone.
I recommend reading Build a custom search engine with PHP before digging into the Sphinx documentation. If you don't think it's suitable after reading that, fair enough.
In answer to your specific questions, I've put together some links from the documentation, together with some relevant quotes:
filtering out common words (as you perhaps noticed, "the" "is" "of" and "intel's" are missing from list)
11.2.8. stopwords
Stopwords are the words that will not
be indexed. Typically you'd put most
frequent words in the stopwords list
because they do not add much value to
search results but consume a lot of
resources to process.
With regards to "cpus" (plurals vs singulars), would it be best to use a particular type (singular or plural), both or exact (ie, "cpus" is different "cpu")?
11.2.9. wordforms
Word forms are applied after
tokenizing the incoming text by
charset_table rules. They essentialy
let you replace one word with another.
Normally, that would be used to bring
different word forms to a single
normal form (eg. to normalize all the
variants such as "walks", "walked",
"walking" to the normal form "walk").
It can also be used to implement
stemming exceptions, because stemming
is not applied to words found in the
forms list.
Continuing with previous item, how can I determine a plural (different flavors: test=>tests fish=>fish and leaf=>leaves)
Sphinx supports the Porter Stemming Algorithm
The Porter stemming algorithm (or
‘Porter stemmer’) is a process for
removing the commoner morphological
and inflexional endings from words in
English. Its main use is as part of a
term normalisation process that is
usually done when setting up
Information Retrieval systems.
Let's say I wanted to use the search term "vendor:intel", where vendor specifies the field name (field_name), do you think there would be a huge impact on the sql server?
3.2. Attributes
A good example for attributes would be
a forum posts table. Assume that only
title and content fields need to be
full-text searchable - but that
sometimes it is also required to limit
search to a certain author or a
sub-forum (ie. search only those rows
that have some specific values of
author_id or forum_id columns in the
SQL table); or to sort matches by
post_date column; or to group matching
posts by month of the post_date and
calculate per-group match counts.
This can be achieved by specifying all
the mentioned columns (excluding title
and content, that are full-text
fields) as attributes, indexing them,
and then using API calls to setup
filtering, sorting, and grouping.
You can also use the 5.3. Extended query syntax to search specific fields (as opposed to filtering results by attributes):
field search operator:
#vendor intel
How does a search engine index a set of fields and bind the found phrases/keywords/etc with the particular field id?
8.6.1. Query
On success, Query() returns a result set that contains some of the found matches (as requested by SetLimits()) and additional general per-query statistics. > The result set is a hash (PHP specific; other languages might utilize other structures instead of hash) with the following keys and values:
"matches":
Hash which maps found document IDs to another small hash containing document weight and attribute values (or an array of the similar small hashes if SetArrayResult() was enabled).
"total":
Total amount of matches retrieved on server (ie. to the server side result set) by this query. You can retrieve up to this amount of matches from server for this query text with current query settings.
"total_found":
Total amount of matching documents in index (that were found and procesed on server).
"words":
Hash which maps query keywords (case-folded, stemmed, and otherwise processed) to a small hash with per-keyword statitics ("docs", "hits").
"error":
Query error message reported by searchd (string, human readable). Empty if there were no errors.
"warning":
Query warning message reported by searchd (string, human readable). Empty if there were no warnings.
Also see Listing 11 and Listing 13 from Build a custom search engine with PHP.
filtering out common words (as you
perhaps noticed, "the" "is" "of" and
"intel's" are missing from list)
Find (or create) a list of common words and filter user input.
With regards to "cpus" (plurals vs
singulars), would it be best to use a
particular type (singular or plural),
both or exact (ie, "cpus" is different
"cpu")?
Depends. I would search for both if that's not a big burden; or for the singular form using the LIKE clause if possible.
Continuing with previous item, how can
I determine a plural (different
flavors: test=>tests fish=>fish and
leaf=>leaves)
Create an Inflector method or class. ie: Inflect::plural('fish') gives you 'fish'. There might be classes like these for the English language, look them up.
I'm currently using MySql and I'm very
concerned with performance issues; we
have 500+ categories and we didn't
even launch the site
Having good schema and code design helps, but I can't really give you much advice on that one.
Let's say I wanted to use the search
term "vendor:intel", where vendor
specifies the field name (field_name),
do you think there would be a huge
impact on the sql server?
That would really help, since you'd be looking up a single column instead of multiple. Just be careful to filter user input and/or allow looking up only particular columns.
Search throttling; I don't like this
at all, but it's a possibility, and if
you know of any workarounds, make
yourself heard!
Not many options here. To help here and in performance, you should consider having some sort of caching.
I would heartily suggest you take a look at Solr. It's a Java based self contained Search and index system and probably has more benefits than a PHP solution.
Search is tough to implement. Would recommend using a package if you're new to it.
Have you considered http://framework.zend.com/manual/en/zend.search.lucene.html ?
Since many are suggesting to use an existing package, (and I want to make it harder for you than just suggesting a package ;-) ), let's presume I will use such a package (over in this answer thread).
How does a search engine index a set of fields and bind the found phrases/keywords/etc with the particular field id?
That's not the question I want answered, at least not directly. My issue is, how easy is it to make the search engine work as I want?
Given my above requirements, is this even possible/feasible?
From personal experience, I'd rather wasted some time tweaking my system rather than fixing someone else's code, which I have to waste way more time to understand first.
Call me conservative, but I rarely stick to someone else's code/programs, and when I did, it was because of a desperate situation - and I usually end up somehow contributing to said project.
There's a PHP implementation of a Brill Part of Speech tagger on php/ir. This might provide a framework for identifying those words that should be discarded and those you want to index, while it also identifies plurals (and the root singular). It's not perfect, though a custom dictionary to handle technical terms, it could prove useful for resolving your first three questions.
When searching the db with terms that retrieve no results I want to allow "did you mean..." suggestion (like Google).
So for example if someone looks for "jquyer"
", it would output "did you mean jquery?"
Of course, suggestion results have to be matched against the values inside the db (i'm using mysql).
Do you know a library that can do this? I've googled this but haven't found any great results.
Or perhaps you have an idea how to construct this on my own?
A quick and easy solution involves SOUNDEX or SOUNDEX-like functions.
In a nutshell the SOUNDEX function was originally used to deal with common typos and alternate spellings for family names, and this function, encapsulates very well many common spelling mistakes (in the english language). Because of its focus on family names, the original soundex function may be limiting (for example encoding stops after the third or fourth non-repeating consonant letter), but it is easy to expend the algorithm.
The interest of this type of function is that it allows computing, ahead of time, a single value which can be associated with the word. This is unlike string distance functions such as edit distance functions (such as Levenshtein, Hamming or even Ratcliff/Obershelp) which provide a value relative to a pair of strings.
By pre-computing and indexing the SOUNDEX value for all words in the dictionary, one can, at run-time, quickly search the dictionary/database based on the [run-time] calculated SOUNDEX value of the user-supplied search terms. This Soundex search can be done systematically, as complement to the plain keyword search, or only performed when the keyword search didn't yield a satisfactory number of records, hence providing the hint that maybe the user-supplied keyword(s) is (are) misspelled.
A totally different approach, only applicable on user queries which include several words, is based on running multiple queries against the dictionary/database, excluding one (or several) of the user-supplied keywords. These alternate queries' result lists provide a list of distinct words; This [reduced] list of words is typically small enough that pair-based distance functions can be applied to select, within the list, the words which are closer to the allegedly misspelled word(s). The word frequency (within the results lists) can be used to both limit the number of words (only evaluate similarity for the words which are found more than x times), as well as to provide weight, to slightly skew the similarity measurements (i.e favoring words found "in quantity" in the database, even if their similarity measurement is slightly less).
How about the levenshtein function, or similar_text function?
Actually, I believe Google's "did you mean" function is generated by what users type in after they've made a typo. However, that's obviously a lot easier for them since they have unbelievable amounts of data.
You could use Levenshtein distance as mgroves suggested (or Soundex), but store results in a database. Or, run separate scripts based on common misspellings and your most popular misspelled search terms.
http://www.phpclasses.org/browse/package/4859.html
Here's an off-the-shelf class that's rather easy to implement, which employs minimum edit distance. All you need to do is have a token (not type) list of all the words you want to work with handy. My suggestion is to make sure it's the complete list of words within your search index, and only within your search index. This helps in two ways:
Domain specificity helps avoid misleading probabilities from overtaking your implementation
Ex: "Memoize" may be spell-corrected to "Memorize" for most off-the-shelf, dictionaries, but that's a perfectly good search term for a computer science page.
Proper nouns that are available within your search index are now accounted for.
Ex: If you're Dell, and someone searches for 'inspiran', there's absolutely no chance the spell-correct function will know you mean 'inspiron'. It will probably spell-correct to 'inspiring' or something more common, and, again, less domain-specific.
When I did this a couple of years ago, I already had a custom built index of words that the search engine used. I studied what kinds of errors people made the most (based on logs) and sorted the suggestions based on how common the mistake was.
If someone searched for jQuery, I would build a select-statement that went
SELECT Word, 1 AS Relevance
FROM keywords
WHERE Word IN ('qjuery','juqery','jqeury' etc)
UNION
SELECT Word, 2 AS Relevance
FROM keywords
WHERE Word LIKE 'j_query' OR Word LIKE 'jq_uery' etc etc
ORDER BY Relevance, Word
The resulting words were my suggestions and it worked really well.
You should keep track of common misspellings that come through your search (or generate some yourself with a typo generator) and store the misspelling and the word it matches in a database. Then, when you have nothing matching any search results, you can check against the misspelling table, and use the suggested word.
Writing your own custom solution will take quite some time and is not guaranteed to work if your dataset isn't big enough, so I'd recommend using an API from a search giant such as Yahoo. Yahoo's results aren't as good as Google's but I'm not sure whether Google's is meant to be public.
You can simply use an Api like this one https://www.mashape.com/marrouchi/did-you-mean
OK I have a mySQL Database that looks something like this
ID - an int and the unique ID of the recorded
Title - The name of the item
Description - The items description
I want to search both title and description of key words, currently I'm using.
SELECT * From ‘item’ where title LIKE %key%
And this works and as there’s not much in the database, as however searching for “this key” doesn’t find “this that key” I want to improve the search engine of the site, and may be even add some kind of ranking system to it (but that’s a long time away).
So to the question, I’ve heard about something called “Full text search” it is (as far as I can tell) a staple of database design, but being a Newby to this subject I know nothing about it so…
1) Do you think it would be useful?
And an additional questron…
2) What can I read about database design / search engine design that will point me in the right direction.
If it’s of relevance the site is currently written in stright PHP (I.E. without a framework) (thro the thought of converting it to Ruby on Rails has crossed my mind)
update
Thanks all, I'll go for Fulltext search.
And for any one finding this later, I found a good tutorial on fulltext search as well.
The problem with the '%keyword%' type search is that there is no way to efficiently search on it in a regular table, even if you create an index on that column. Think about how you would look that string up in the phone book. There is actually no way to optimize it - you have to scan the entire phone book - and that is what MySQL does, a full table scan.
If you change that search to 'keyword%' and use an index, you can get very fast searching. It sounds like this is not what you want, though.
So with that in mind, I have used fulltext indexing/searching quite a bit, and here are a few pros and cons:
Pros
Very fast
Returns results sorted by relevance (by default, although you can use any sorting)
Stop words can be used.
Cons
Only works with MyISAM tables
Words that are too short are ignored (default minimum is 4 letters)
Requires different SQL in where clause, so you will need to modify existing queries.
Does not match partial strings (for example, 'word' does not match 'keyword', only 'word')
Here is some good documentation on full-text searching.
Another option is to use a searching system such as Sphinx. It can be extremely fast and flexible. It is optimized for searching and integrates well with MySQL.
You might also consider Zend_Lucene. It's slightly easier to integrate than Sphinx, because it is pure PHP.
I would guess that MySQL fulltext is sufficient for your needs, but it's worth noting that the built in support doesn't scale very well. For average size documents it starts to become unusable for table sizes as small as a few hundred thousand rows. If you think that this might become a problem further on you should probably look into Sphinx already. It's becoming the defacto standard for MYSQL-users, even though I personally prefer to implement my own solution using java lucene. :)
Also, I'd like to mention that full text search is fundamentally different from the standard LIKE '%keyword%'-search. Unlike the LIKE-search full text indexing allows you to search for several keywords that doesn't have to appear right next to each other. Standard search engines such as google are full text search engines, for example.