PHP word index, performance and reasonable results

PHP word index, performance and reasonable results - php

I'm currently working on an indexer for a search feature. The indexer will work over data from "fields".
Fields looks like:
Field_id Field_type Field_name Field_Data
- 101 text Name Intel i7
- 102 integer Cores 4 physical, 4 virtual
- 103 select Vendor Intel
- 104 multitext Description The i7 is intel's next gen range of cpus.
The indexer would generate the following results/index:
Keyword Occurrences
- intel 101, 103, 104
- i7 101, 104
- physical 102
- virtual 102
- next 104
- gen 104
- range 104
- cpus 104 (*)
- cpu 104 (*)
So it somewhat looks all nice and fine, however, there are some issues which I'd like to sort out:
filtering out common words (as you perhaps noticed, "the" "is" "of" and "intel's" are missing from list)
With regards to "cpus" (plurals vs singulars), would it be best to use a particular type (singular or plural), both or exact (ie, "cpus" is different "cpu")?
Continuing with previous item, how can I determine a plural (different flavors: test=>tests fish=>fish and leaf=>leaves)
I'm currently using MySql and I'm very concerned with performance issues; we have 500+ categories and we didn't even launch the site
Let's say I wanted to use the search term "vendor:intel", where vendor specifies the field name (field_name), do you think there would be a huge impact on the sql server?
Search throttling; I don't like this at all, but it's a possibility, and if you know of any workarounds, make yourself heard!
There were other issues which I probably forgot about, if you spot any, you're welcome to yell at me ;-)
I do not need the search engine to crawl links, in fact, I specifically want it to not crawl links.
(by the way, I'm not biased towards intel, it simply happens that I own an i7-based pc ;-) )

Grab a list of stop words(non-keywords) from here, the guy has even formatted them in php for you.
http://armandbrahaj.blog.al/2009/04/14/list-of-english-stop-words/
Then simply do a preg_replace on the string you are indexing.
What I've done in past is remove suffixes like 's', 'ed' etc with regex and use the same regex on the search string. It's not ideal though. This was for a basic website with only 200 pages.
If you are concerned about performance you might want to consider using a search engine like Lucine (solr) instead of a database. This will make indexing much easier. You don't want to reinvent the wheel here.

This is in response to your original question, and your later answer/question.
I've used the Sphinx search engine before (quite a while ago, so I'm a bit rusty), and found it to be very good, even if the documentation is sometimes a bit lacking.
I'm sure there are other ways to do this, both with your own custom code, or with other search engines—Sphinx just happens to be the one I've used. I'm not suggesting that it will do everything you want, just the way you want, but I am reasonably certain that it will do most of it quite easily, and a lot faster than anything written in PHP/MySQL alone.
I recommend reading Build a custom search engine with PHP before digging into the Sphinx documentation. If you don't think it's suitable after reading that, fair enough.
In answer to your specific questions, I've put together some links from the documentation, together with some relevant quotes:
filtering out common words (as you perhaps noticed, "the" "is" "of" and "intel's" are missing from list)
11.2.8. stopwords
Stopwords are the words that will not
be indexed. Typically you'd put most
frequent words in the stopwords list
because they do not add much value to
search results but consume a lot of
resources to process.
With regards to "cpus" (plurals vs singulars), would it be best to use a particular type (singular or plural), both or exact (ie, "cpus" is different "cpu")?
11.2.9. wordforms
Word forms are applied after
tokenizing the incoming text by
charset_table rules. They essentialy
let you replace one word with another.
Normally, that would be used to bring
different word forms to a single
normal form (eg. to normalize all the
variants such as "walks", "walked",
"walking" to the normal form "walk").
It can also be used to implement
stemming exceptions, because stemming
is not applied to words found in the
forms list.
Continuing with previous item, how can I determine a plural (different flavors: test=>tests fish=>fish and leaf=>leaves)
Sphinx supports the Porter Stemming Algorithm
The Porter stemming algorithm (or
‘Porter stemmer’) is a process for
removing the commoner morphological
and inflexional endings from words in
English. Its main use is as part of a
term normalisation process that is
usually done when setting up
Information Retrieval systems.
Let's say I wanted to use the search term "vendor:intel", where vendor specifies the field name (field_name), do you think there would be a huge impact on the sql server?
3.2. Attributes
A good example for attributes would be
a forum posts table. Assume that only
title and content fields need to be
full-text searchable - but that
sometimes it is also required to limit
search to a certain author or a
sub-forum (ie. search only those rows
that have some specific values of
author_id or forum_id columns in the
SQL table); or to sort matches by
post_date column; or to group matching
posts by month of the post_date and
calculate per-group match counts.
This can be achieved by specifying all
the mentioned columns (excluding title
and content, that are full-text
fields) as attributes, indexing them,
and then using API calls to setup
filtering, sorting, and grouping.
You can also use the 5.3. Extended query syntax to search specific fields (as opposed to filtering results by attributes):
field search operator:
#vendor intel
How does a search engine index a set of fields and bind the found phrases/keywords/etc with the particular field id?
8.6.1. Query
On success, Query() returns a result set that contains some of the found matches (as requested by SetLimits()) and additional general per-query statistics. > The result set is a hash (PHP specific; other languages might utilize other structures instead of hash) with the following keys and values:
"matches":
Hash which maps found document IDs to another small hash containing document weight and attribute values (or an array of the similar small hashes if SetArrayResult() was enabled).
"total":
Total amount of matches retrieved on server (ie. to the server side result set) by this query. You can retrieve up to this amount of matches from server for this query text with current query settings.
"total_found":
Total amount of matching documents in index (that were found and procesed on server).
"words":
Hash which maps query keywords (case-folded, stemmed, and otherwise processed) to a small hash with per-keyword statitics ("docs", "hits").
"error":
Query error message reported by searchd (string, human readable). Empty if there were no errors.
"warning":
Query warning message reported by searchd (string, human readable). Empty if there were no warnings.
Also see Listing 11 and Listing 13 from Build a custom search engine with PHP.

filtering out common words (as you
perhaps noticed, "the" "is" "of" and
"intel's" are missing from list)
Find (or create) a list of common words and filter user input.
With regards to "cpus" (plurals vs
singulars), would it be best to use a
particular type (singular or plural),
both or exact (ie, "cpus" is different
"cpu")?
Depends. I would search for both if that's not a big burden; or for the singular form using the LIKE clause if possible.
Continuing with previous item, how can
I determine a plural (different
flavors: test=>tests fish=>fish and
leaf=>leaves)
Create an Inflector method or class. ie: Inflect::plural('fish') gives you 'fish'. There might be classes like these for the English language, look them up.
I'm currently using MySql and I'm very
concerned with performance issues; we
have 500+ categories and we didn't
even launch the site
Having good schema and code design helps, but I can't really give you much advice on that one.
Let's say I wanted to use the search
term "vendor:intel", where vendor
specifies the field name (field_name),
do you think there would be a huge
impact on the sql server?
That would really help, since you'd be looking up a single column instead of multiple. Just be careful to filter user input and/or allow looking up only particular columns.
Search throttling; I don't like this
at all, but it's a possibility, and if
you know of any workarounds, make
yourself heard!
Not many options here. To help here and in performance, you should consider having some sort of caching.

I would heartily suggest you take a look at Solr. It's a Java based self contained Search and index system and probably has more benefits than a PHP solution.

Search is tough to implement. Would recommend using a package if you're new to it.
Have you considered http://framework.zend.com/manual/en/zend.search.lucene.html ?

Since many are suggesting to use an existing package, (and I want to make it harder for you than just suggesting a package ;-) ), let's presume I will use such a package (over in this answer thread).
How does a search engine index a set of fields and bind the found phrases/keywords/etc with the particular field id?
That's not the question I want answered, at least not directly. My issue is, how easy is it to make the search engine work as I want?
Given my above requirements, is this even possible/feasible?
From personal experience, I'd rather wasted some time tweaking my system rather than fixing someone else's code, which I have to waste way more time to understand first.
Call me conservative, but I rarely stick to someone else's code/programs, and when I did, it was because of a desperate situation - and I usually end up somehow contributing to said project.

There's a PHP implementation of a Brill Part of Speech tagger on php/ir. This might provide a framework for identifying those words that should be discarded and those you want to index, while it also identifies plurals (and the root singular). It's not perfect, though a custom dictionary to handle technical terms, it could prove useful for resolving your first three questions.

Related

Dutch (or German) compound words in search functions (in PHP)

I have been having an issue with building a search function for a while now that I'm building for a cooking blog.
In Dutch (similar to German), one can add as many compound words together to create a new word. This has been giving me a headache when wanting to include search results that include a relevant singular word inside compound words. It's kind of like a reverse Scunthorpe problem, I actually want to include certain words inside other words, but only sometimes.
For example, the word rice in Dutch is rijst. Brown rice is zilvervliesrijst and pandan rice is pandanrijst. If I want these two to pop up in search results, I have to search whether words exist inside a word, rather than whether they are the word.
However, this immediately causes issues for smaller words that can exist inside other words accidentally. For example, the word for egg is ei, while leek is prei. Onion is ui, while Brussel sprouts are spruitjes. You can see that accepting subsections of strings being matching the search strings could cause major problems.
I initially tried to grade what percentage of a word contains the search string, but this also causes issues as prei is 50% ei, while zilvervliesrijst is only about 25% rijst. This also makes using a levenshtein distance to solve this very impractical.
My current solution is as follows: I have an SQL table list of ingredients that are being used to automatically calculate the price and calorie total for each recipe based on the ingredient list, and I have used this to add all relevant synonyms to the name column. Basically, zilvervliesrijst is listed as zilvervliesrijst|rijst. I also use this to add both the plural and singular version of a term such that I will not have to test those.
However, this excludes any compound words in any place other than the ingredient list. Things such as title, cuisine, cooking equipment, dietary preferences and so on are still having this problem.
My question is this, is there a non-library-esque method that addresses this within the field of computer science? Or will I be doomed to include every single possible searchable compound word and its singular components, every time I want to add in a new recipe? I just hope that's not the case, as that will massively increase the processing time required for each additional library entry.

I think it will be hard to do this well without using a library, and probably also a dictionary (which may be bundled as part of the library).
There are really two somewhat orthogonal problems:
Splitting compound words into their constituent parts.
Identifying the stem of a simple (non-compound) word. (For example, removing plural markers and inflections.) This is often called "stemming" but that's not really the best strategy; you'll also find the rather awkward term "lemmatization".
Both of these tasks are plagued with ambiguities in all the languages I know about. (A German example, taken from an Arxiv paper describing the German-language morphological analyser DEMorphy, is "Rohrohrzucker", which means "raw cane sugar" -- Roh Rohr Zucker -- but could equally be split into Rohr Ohr Zucker, pipe-ear sugar, if there were such a thing.)
The basic outline of how these tasks can be done in reasonable time (with lots of CPU power) is:
Using ngram analysis to figure out plausible word division points.
Lemmatize each candidate component word to get plausible POS (part-of-speech) markers.
Use a trained machine-learning model (or something of that form) to reject non-sensical (or at least highly improbable) divisions.
At each step, check possible corner cases in a dictionary (of corner cases).
That's just a rough outline, of course.
I was able to find, without too much trouble, a couple of fairly recent discussions of how to do this with Dutch words. I'm not even vaguely competent to discuss the validity of these papers, so I'll leave you to do the search yourself. (I used the search query "split compound words in Dutch".) But I can tell you two things:
The problem is being worked on, but not necessarily to produce freely-available products.
If you choose to tackle it yourself, you'll end up devoting quite a lot of time to the project, although you might find it interesting. If you do succeed, you'll end up with a useful product and the beginning of a thesis (perhaps useful if you have academic ambitions).
However you choose to do it, you're best off only doing it once for each new recipe. Analyse the contents of each recipe as it is entered, to build a list of search terms which you can store in your database along with the recipe. You will probably also want to split and lemmatize search queries, but those are generally short enough that the CPU time is reasonable. Even so, consider caching the analyses in order to save time on common queries.

MongoDB search - autocomplete

We are currently running our app on MySQL and are planning to move to MongoDB. We have moved some parts already but having issues with MongoRegex performance.
We have an autocomplete search box that joins 6 tables (indexed / non-indexed fields) and returns results super fast on mysql. The same thing on MongoDB performs really slow. It takes about 2.3 seconds only on one collection. The user has to wait for a long time. The connection time is 0.064 secs. Query time 2.36 seconds. I did a bit of Googling and couldn't find a perfect answer. Everyone said MongoRegex is slow. If that's true how are other companies overcoming this problem?
What is best way to improve autocomplete performance / experience when running in on MongoDB?

First of all, you will have to design your query carefully. Carefully as in, selecting properly indexed fields and designing accordingly. Also if you are using regex make sure you are writing the regex in a way which forces the query to use an indexed field. Something like /^prefix/ will do. [ See this link : http://docs.mongodb.org/manual/reference/operator/query/regex/#index-use ]
I have seen many implementations using range query of mongodb, but Im not sure if thats the best one, since instantaneous results are a key thing.
Apart from that, I have seen some one who recommended prefix-trees. Which effectively stores the prefixes in a field, and then stores all the words starting with that particular prefix in the next fields as an array. This solution sounds convincing and fast, since the prefix fields are supposed to be indexed, but you will have to think about the storage factor also.

It is hard to tell as we don't have the search query, but worth mentioning that, after taking a look at the documentation, it appears that MongoDB is able to use an index when using RE search if the pattern is prefixed by a constant string only if it is anchored:
http://docs.mongodb.org/manual/reference/operator/query/regex/#index-use
So, to quote the doc:
A regular expression is a “prefix expression” if it starts with a caret (^) or a left anchor (\A), followed by a string of simple symbols. For example, the regex /^abc.*/ will be optimized by matching only against the values from the index that start with abc.
But /abc.*/ ou /^.*abc/ will not use the index.

Handling simple grammar in a PHP search engine

I am creating a simple search function for my website using MySQL and PHP. Right now, if type the word "cat" into the search bar, I will NOT be able to retrieve articles with the word "cats", and vice-versa. It is the same with the ending "ed".
The only way that I can think of to solve this problem is by removing all "s" and "ed" from the end of each word that is longer than a certain length (to avoid turning "Ted" into "T", etc). However, this simple solution is nowhere near perfect. I'm hoping someone can provide me with a better solution.

The technique you are referring to is called stemming. Because of the great many influences on languages this is a difficult thing to handle on your own at the application level. If you do not want to deal with this you can let MySQL do the heavy lifting for you depending on what version of MySQL you are running. If you are on version 5.6.4 or later it is built into the full-text search mechanism for both MyISAM tables and InnoDB tables. In versions 5.5 through 5.6.3 it is built in for MyISAM but not InnoDB tables. For version 5.1 there is a plugin available from mnoGoSearch. Prior to 5.1 I think you need to handle it at the application level but I have not confirmed that.
These links might help get you started.
http://dev.mysql.com/doc/refman/5.6/en/glossary.html#glos_stemming
http://dev.mysql.com/doc/refman/5.6/en/glossary.html#glos_full_text_search
http://dev.mysql.com/doc/refman/5.6/en/glossary.html#glos_fulltext_index
http://dev.mysql.com/doc/refman/5.6/en/fulltext-search.html
Be aware of the stopword list which is a list of very common and often short words that are ignored in your search text when the query is processed. There are settings to control the stopword list if it is preventing you from getting expected results. You will likely want to set the minimum word length to 2 or 3 (default is 4) and remove many of the words on the default list.
If you do want to handle stemming on your own or with PHP there is a detailed technical discussion of the Porter Stemming Algorithm by Martin Porter and there are at least two PHP implementations available, an older one in PHP4 by Jon Abernathy that may have some flaws and a newer one in PHP5 by Richard Heyes.
I am assuming that you are primarily concerned with English but I believe that there is some support for other languages as well.
As mentioned by rnmccall if you need more advanced search capabilities you may need to go with Sphinx or Apache Lucene.

The strategy of removing suffixes described in the question is generally called stemming. If you are still interested in pursing that strategy, you should check out http://tartarus.org/~martin/PorterStemmer/ for the background of stemming. That page also has a PHP implementation of the Porter stemmer and links to more modern algorithms.
This stemming search approach is used by Sphinx, which is used for pydoc among other things.
The main benefit of the stemming approach is that it is straightforward and can be lightweight.
But, if you want more sophisticated search capabilities, you probably should use something like Apache Lucene.

I'd recommend using Lucene. It will also cause less stress on your db as you aren't running complex queries - just looking up an index. You can also run fuzzy searches with Lucene.

You can simply use
SELECT * FROM topics WHERE Title LIKE '%cat%'
in query to search topics with title cat and cats. You can use FullTextSearch if you want to search data from large text content. In this case you have to use MyISAM tables only. You can read the FullTextSearch Documentation here

There is no mean of ed or any thing you want to remove. Because you are searching a string from a paragraph you need to provide a particular keyword for search that.That keyword can be full string(word) or can be a sub-string(part of a word).
Example:-
You are in a black hole.
Now you want to search black by providing bla as a search string.Then the query like :-
SELECT * FROM TABLE_NAME WHERE YOUR_FIELD_NAME LIKE '%BLA%'
Use this above query for make a exact match with your content.You can provide any sub-string from your para/passage that you want to search from.
Hope it will help you.

Possible Solution :
1.Simplest To implement -> use %operator
like %cats%
2.Use solr for fast implementation as optimal algo are implemented there.
Note: u can also cache your results in cache

A simple query will be:
select * from table where item like '%name%'
To avoid the t and ted thing, use the substr() function and get the string into a universal size and then put that string in where clause.

PHP - How to suggest terms for search, "did you mean...?"

When searching the db with terms that retrieve no results I want to allow "did you mean..." suggestion (like Google).
So for example if someone looks for "jquyer"
", it would output "did you mean jquery?"
Of course, suggestion results have to be matched against the values inside the db (i'm using mysql).
Do you know a library that can do this? I've googled this but haven't found any great results.
Or perhaps you have an idea how to construct this on my own?

A quick and easy solution involves SOUNDEX or SOUNDEX-like functions.
In a nutshell the SOUNDEX function was originally used to deal with common typos and alternate spellings for family names, and this function, encapsulates very well many common spelling mistakes (in the english language). Because of its focus on family names, the original soundex function may be limiting (for example encoding stops after the third or fourth non-repeating consonant letter), but it is easy to expend the algorithm.
The interest of this type of function is that it allows computing, ahead of time, a single value which can be associated with the word. This is unlike string distance functions such as edit distance functions (such as Levenshtein, Hamming or even Ratcliff/Obershelp) which provide a value relative to a pair of strings.
By pre-computing and indexing the SOUNDEX value for all words in the dictionary, one can, at run-time, quickly search the dictionary/database based on the [run-time] calculated SOUNDEX value of the user-supplied search terms. This Soundex search can be done systematically, as complement to the plain keyword search, or only performed when the keyword search didn't yield a satisfactory number of records, hence providing the hint that maybe the user-supplied keyword(s) is (are) misspelled.
A totally different approach, only applicable on user queries which include several words, is based on running multiple queries against the dictionary/database, excluding one (or several) of the user-supplied keywords. These alternate queries' result lists provide a list of distinct words; This [reduced] list of words is typically small enough that pair-based distance functions can be applied to select, within the list, the words which are closer to the allegedly misspelled word(s). The word frequency (within the results lists) can be used to both limit the number of words (only evaluate similarity for the words which are found more than x times), as well as to provide weight, to slightly skew the similarity measurements (i.e favoring words found "in quantity" in the database, even if their similarity measurement is slightly less).

How about the levenshtein function, or similar_text function?

Actually, I believe Google's "did you mean" function is generated by what users type in after they've made a typo. However, that's obviously a lot easier for them since they have unbelievable amounts of data.
You could use Levenshtein distance as mgroves suggested (or Soundex), but store results in a database. Or, run separate scripts based on common misspellings and your most popular misspelled search terms.

http://www.phpclasses.org/browse/package/4859.html
Here's an off-the-shelf class that's rather easy to implement, which employs minimum edit distance. All you need to do is have a token (not type) list of all the words you want to work with handy. My suggestion is to make sure it's the complete list of words within your search index, and only within your search index. This helps in two ways:
Domain specificity helps avoid misleading probabilities from overtaking your implementation
Ex: "Memoize" may be spell-corrected to "Memorize" for most off-the-shelf, dictionaries, but that's a perfectly good search term for a computer science page.
Proper nouns that are available within your search index are now accounted for.
Ex: If you're Dell, and someone searches for 'inspiran', there's absolutely no chance the spell-correct function will know you mean 'inspiron'. It will probably spell-correct to 'inspiring' or something more common, and, again, less domain-specific.

When I did this a couple of years ago, I already had a custom built index of words that the search engine used. I studied what kinds of errors people made the most (based on logs) and sorted the suggestions based on how common the mistake was.
If someone searched for jQuery, I would build a select-statement that went
SELECT Word, 1 AS Relevance
FROM keywords
WHERE Word IN ('qjuery','juqery','jqeury' etc)
UNION
SELECT Word, 2 AS Relevance
FROM keywords
WHERE Word LIKE 'j_query' OR Word LIKE 'jq_uery' etc etc
ORDER BY Relevance, Word
The resulting words were my suggestions and it worked really well.

You should keep track of common misspellings that come through your search (or generate some yourself with a typo generator) and store the misspelling and the word it matches in a database. Then, when you have nothing matching any search results, you can check against the misspelling table, and use the suggested word.

Writing your own custom solution will take quite some time and is not guaranteed to work if your dataset isn't big enough, so I'd recommend using an API from a search giant such as Yahoo. Yahoo's results aren't as good as Google's but I'm not sure whether Google's is meant to be public.

You can simply use an Api like this one https://www.mashape.com/marrouchi/did-you-mean

Is Full Text search the answer?

OK I have a mySQL Database that looks something like this
ID - an int and the unique ID of the recorded
Title - The name of the item
Description - The items description
I want to search both title and description of key words, currently I'm using.
SELECT * From ‘item’ where title LIKE %key%
And this works and as there’s not much in the database, as however searching for “this key” doesn’t find “this that key” I want to improve the search engine of the site, and may be even add some kind of ranking system to it (but that’s a long time away).
So to the question, I’ve heard about something called “Full text search” it is (as far as I can tell) a staple of database design, but being a Newby to this subject I know nothing about it so…
1) Do you think it would be useful?
And an additional questron…
2) What can I read about database design / search engine design that will point me in the right direction.
If it’s of relevance the site is currently written in stright PHP (I.E. without a framework) (thro the thought of converting it to Ruby on Rails has crossed my mind)
update
Thanks all, I'll go for Fulltext search.
And for any one finding this later, I found a good tutorial on fulltext search as well.

The problem with the '%keyword%' type search is that there is no way to efficiently search on it in a regular table, even if you create an index on that column. Think about how you would look that string up in the phone book. There is actually no way to optimize it - you have to scan the entire phone book - and that is what MySQL does, a full table scan.
If you change that search to 'keyword%' and use an index, you can get very fast searching. It sounds like this is not what you want, though.
So with that in mind, I have used fulltext indexing/searching quite a bit, and here are a few pros and cons:
Pros
Very fast
Returns results sorted by relevance (by default, although you can use any sorting)
Stop words can be used.
Cons
Only works with MyISAM tables
Words that are too short are ignored (default minimum is 4 letters)
Requires different SQL in where clause, so you will need to modify existing queries.
Does not match partial strings (for example, 'word' does not match 'keyword', only 'word')
Here is some good documentation on full-text searching.
Another option is to use a searching system such as Sphinx. It can be extremely fast and flexible. It is optimized for searching and integrates well with MySQL.

You might also consider Zend_Lucene. It's slightly easier to integrate than Sphinx, because it is pure PHP.

I would guess that MySQL fulltext is sufficient for your needs, but it's worth noting that the built in support doesn't scale very well. For average size documents it starts to become unusable for table sizes as small as a few hundred thousand rows. If you think that this might become a problem further on you should probably look into Sphinx already. It's becoming the defacto standard for MYSQL-users, even though I personally prefer to implement my own solution using java lucene. :)
Also, I'd like to mention that full text search is fundamentally different from the standard LIKE '%keyword%'-search. Unlike the LIKE-search full text indexing allows you to search for several keywords that doesn't have to appear right next to each other. Standard search engines such as google are full text search engines, for example.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.