PHP/MySQL Search Engine Using Levenshtein Distance

PHP/MySQL Search Engine Using Levenshtein Distance - php

I'm trying to create a simple search engine where users can query a database and be returned results that both match and are close to their query. At first I was just using wildcards (%) to find results that were relevant to a users search. The PHP for that looked something like this:
// Users search terms is saved in $_POST['q']
$q = $_POST['q'];
// Prepare statement
$search = $db->prepare("SELECT `id`, `name` FROM `users` WHERE `name` LIKE ?");
// Execute with wildcards
$search->execute(array("%$q%"));
// Echo results
foreach($search as $s) {
echo $s['name'];
}
The above code works fine, however, it's rather limited. While it can fetch results that are close to but don't exactly match the users query (because of the wildcards), it still doesn't return all relevant results; the user's query still has to have an exact match to something in the database. For example, if I had a database with the name "Tim" as a row, searching for "Timothy" wouldn't work. So my new approach looks something like this:
// Users search terms is saved in $_POST['q']
$q = $_POST['q'];
// Create array for the names that are close to or match the search term
$results = array();
foreach($db->query('SELECT `id`, `name` FROM `users`') as $name) {
// Keep only relevant results
if (levenshtein($q, $name['name']) < 4) {
array_push($results,$name['name']);
}
}
// Echo out results
foreach ($results as $result) {
echo $result."\n";
}
This code technically works, however, it's pretty inefficient and I'm wondering how it can be improved. The biggest problem is that as all results have to be retrieved from the database and then sorted, an unnecessarily large SQL query is created, which is especially problematic as I have a big database. Furthermore I wanted to know if simply using the levenshtein function is sufficient for getting relevant results, or if there is a better way to sort out the irrelevant results. Some other ways of sorting the relevant results I came up with:
if (levenshtein(metaphone($q), metaphone($name['name'])) < 4) {
array_push($results,$name['name']);
}
or
if (similar_text(metaphone($q), metaphone($name)['name']) < 2) {
array_push($results,$name['name']);
}
or
if (similar_text($q, $name['name']) > 2) {
array_push($results,$name['name']);
}
I think using levenshtein with metaphone may actually work the best as it would better take into account simple spelling errors. But I'm not sure which would be the best to use, especially considering that the way I'm doing it now is already very expensive (the large SQL query + the expensive functions that take place in a loop don't bode well for performance).
Thanks in advance

Related

Optimization of search function MySQL or PHP wise

After running a few test I realized that my search method does not perform very well if some words of the query is short (2~3 letters).
The way I made the search is by making a MySQL query for every words in the string the visitor entered and then filtering result from each word to see if every words had that result. Once one result has been returned for all words its a match and il show that result to the visitor.
But I was wondering if that's an effective way to do it. Is there any better way while keeping the same functionality ?
Currently the code I have takes about .7Sec making MySQL queries. And the rest of the stuff is under .1Sec.
Normally I would not care much about my search taking .7Sec, But Id like to create a "LiveSearch" and is critical that it loads faster than that.
Here is my code
public static function Search($Query){
$Querys = explode(' ',$Query);
foreach($Querys as $Query)
{
$MatchingRow = \Database\dbCon::$dbCon -> prepare("
SELECT
`Id`
FROM
`product_products` as pp
WHERE
CONCAT(
`Id`,
' ',
(SELECT `Name` FROM `product_brands` WHERE `Id` = pp.BrandId),
' ',
`ModelNumber`,
' ',
`Title`,
IF(`isFreeShipping` = 1 OR `isFreeShippingOnOrder` = 1, ' FreeShipping', '')
)
LIKE :Title;
");
$MatchingRow -> bindValue(':Title','%'.$Query.'%');
try{
$MatchingRow -> execute();
foreach($MatchingRow -> fetchAll(\PDO::FETCH_ASSOC) as $QueryInfo)
$Matchings[$Query][$QueryInfo['Id']] = $QueryInfo['Id'];
}catch(\PDOException $e){
echo 'Error MySQL: '.$e->getMessage();
}
}
$TmpMatch = $Matchings;
$Matches = array_shift(array_values($TmpMatch));
foreach($TmpMatch as $Query)
{
$Matches = array_intersect($Matches,$Query);
}
foreach($Matches as $Match){
$Products[] = new Product($Match);
}
return $Products;
}

As others have already suggested, fulltext search is your friend.
The logic should go more or less like this.
Add a text column to your "product_products" table called, say, "FtSearch"
Write a small script that will run only once, in which you write, for each existing product, the text that has to be searched for into the "FtSearch" column. This is, of course, the text you compose in your query (id + brand name + title and so forth, including the FreeShipping part). You can probably do this with a single SQL statement, but I have no mysql at hand and can't provide you the code for that... you might be able to find it out by yourself.
Create a fulltext index on the "FtSearch" column (doing this after having populated the FtSearch column saves you a little execution time).
Now you have to add the code necessary to ensure that every time any of the fields involved in the search string is inserted/updated, you insert/update the search string as well. Pay attention here, since this includes not only the "Title", "ModelNumber" and "FreeShipping" of the "product_product", but as well the "Name" of the "product_brand". This means that if the name of a product_brand is updated, you will have to regenerate all search strings of all products having that brand. This might be a little slow, depending on how many products are bound to that brand. However I assume it does not happen too often that a brand changes its name, and if it does, it certainly happens in some sort of administration interface where such things are usually acceptable.
At this point you can query the table using the MATCH construct, which is way way faster you could ever get by using your current approach.

Is it worth to save keyword <-> link relation into "hastable" like structure in mysql?

im working on PHP + MySQL application, which will crawl HDD/shared drive and index all files and directories into database, to provide "fulltext" search on it. So far im doing well, but im stuck on question, if i chosed good way how to store data into database.
On picture below, you can see part schema of my database. Thought is, that i'm saving domain (which represents part of disk which i wana to index) then there are some link(s) (which represents files and folder (with content, filepath, etc) then i have table to store sole (uniq) keywords, which i find in file/folder name or content.
And finaly, i have 16 tables linkkeyword to store relations between links and keywords. I have 16 of them because i thought it might be good to make something like hashtable, because im expecting high number of relations between link <-> keyword. (so far for 15k links and 400k keywords i have about 2.5milion of linkkeyword records). So to avoid storing so much data into one table (and later search above them) i thought that this hastable can be faster. It works like i wana to search for word, i compute it md5 and look at first character of md5 and then i know to which linkkeyword table i should use. So there is only about 150~200k records in each linkkeyword table (against 2.5milions)
So there im curious, if this approach can be of any use, or if will be better to store all linkkeyword information to single table and mysql will take care of it (and to how much link<->keyword it can work?)
So far this was great solution to me, but i crushed hard when i tried to implement regular-expression search. So user can use e.g. "tem*" which can result in temp, temporary, temple etc... In normal way when searching for word, i will conpute in md5 hash and then i know to which linkkeyword table i need to look. But for regular expression i need to get all keywords from keywords table (which matches regular expression) and then process them one by one.
Im also attaching part of code for normal keyword search
private function searchKeywords($selectedDomains) {
$searchValues = $this->searchValue;
$this->resultData = array();
foreach (explode(" ", $searchValues) as $keywordName) {
$keywordName = strtolower($keywordName);
$keywordMd5 = md5($keywordName);
$selection = $this->database->table('link');
$results = $selection->where('domain.id', $selectedDomains)->where('domain.searchable = ?', '1')->where(':linkkeyword' . $keywordMd5[0] . '.keyword.keyword LIKE ?', $keywordName)
->select('link.*,:linkkeyword' . $keywordMd5[0] . '.weight,:linkkeyword' . $keywordMd5[0] . '.keyword.keyword');
foreach ($results as $result) {
$keyExists = array_key_exists($result->linkId, $this->resultData);
if ($keyExists) {
$this->resultData[$result->linkId]->updateWeight($result->weight);
$this->resultData[$result->linkId]->addKeyword($result->keyword);
} else {
$domain = $result->ref('domain');
$linkClass = new search\linkClass($result, $domain);
$linkClass->updateWeight($result->weight);
$linkClass->addKeyword($result->keyword);
$this->resultData[$result->linkId] = $linkClass;
}
}
}
}
and regular expression search function
private function searchRegexp($selectedDomains) {
//get stored search value
$searchValues = $this->searchValue;
//replace astering and exclamation mark (counted as characters for regular expression) and replace them by their mysql equivalent
$searchValues = str_replace("*", "%", $searchValues);
$searchValues = str_replace("!", "_", $searchValues);
// empty result array to prevent previous results to interfere
$this->resultData = array();
//searched phrase can be multiple keywords, so split it by space and get results for each keyword
foreach (explode(" ", $searchValues) as $keywordName) {
//set default link result weight to -1 (default value)
$weight = -1;
//select all keywords, which match searched keyword (or its regular expression)
$keywords = $this->database->table('keyword')->where('keyword LIKE ?', $keywordName);
foreach ($keywords as $keyword) {
//count keyword md5 sum to determine which table should be use to match it links
$md5 = md5($keyword->keyword);
//get all link ids from linkkeyword relation table
$keywordJoinLink = $keyword->related('linkkeyword' . $md5[0])->where('link.domain.searchable','1');
//loop found links
foreach ($keywordJoinLink as $link) {
//store link weight, for later result sort
$weight = $link->weight;
//get link ID
$linkId = $link->linkId;
//check if link already exists in results, to prevent duplicity
$keyExists = array_key_exists($linkId, $this->resultData);
//if link already exists in result set, just update its weight and insert matching keyword for later keyword tag specification
if ($keyExists) {
$this->resultData[$linkId]->updateWeight($weight);
$this->resultData[$linkId]->addKeyword($keyword->keyword);
//if link isnt in result yet, insert it
} else {
//get link reference
$linkData = $link->ref('link', 'linkId');
//get information about domain, to which link belongs (location, flagPath,...)
$domainData = $linkData->ref('domain', 'domainId');
//if is domain searchable and was selected before search, add link to result set. Otherwise ignore it
if ($domainData->searchable == 1 && in_array($domainData->id, $selectedDomains)) {
//create new link instance
$linkClass = new search\linkClass($linkData, $domainData);
//insert matching keyword to links keyword set
$linkClass->addKeyword($keyword->keyword);
//set links weight
$linkClass->updateWeight($weight);
//insert link into result set
$this->resultData[$linkId] = $linkClass;
}
}
}
}
}
}

Your question is mostly one of opinion, so you may want to include the criteria that allow us to answer "worth it' more objectively.
It appears you've re-invented the concept of database sharding (though without distributing your data across multiple servers).
I assume you are trying to optimize search time; if that's the case, I'd suggest that 2.5 million records on a modern hardware is not a particularly big performance challenge, as long as your queries can use an index. If you can't use an index (e.g. because you're doing a regular expression search), sharding will probably not help at all.
My general recommendation with database performance tuning is to start with the simplest possible relational solution, keep tuning that until it breaks your performance goals, then add more hardware, and only once you've done that should you go for "exotic" solutions like sharding.
This doesn't mean using prayer as a strategy. For performance-critical application, I typically build a test database, where I can experiment with solutions. In your case, I'd build a database with your schema without the "sharding" tables, and then populate it with test data (either write your own population routines, or use a tool like DBMonster). Typically, I'd go for at least double the size I expect in production. You can then run and tune queries to prove, one way or another, whether your schema is good enough. It sounds like a lot of work, but it's much less work than your sharding solution is likely to bring along.
There are (as #danFromGermany comments) solutions that are optimized for text serach, and you could use MySQL fulltext search features rather than regular expressions.

How to filter by multiple fields in MySQL/PHP

I'm writing a filter/sorting feature for an application right now that will have text fields above each column. As the user types in each field, requests will be sent to the back-end for sorting. Since there are going to be around 6 text fields, I was wondering if there's a better way to sort instead of using if statements to check for each variable, and writing specific queries if say all fields were entered, just one, or just two fields, etc.
Seems like there would be a lot of if statements. Is there a more intuitive way of accomplishing this?
Thanks!

Any initial data manipulation, such as sorting, is usually done by the database engine.
Put an ORDER BY clause in there, unless you have a specific reason the sorting needs done in the application itself.
Edit: You now say that you want to filter the data instead. I would still do this at the database level. There is no sense in sending a huge dataset to PHP, just for PHP to have to wade through it and filter out data there. In most cases, doing this within MySQL will be far more efficient than what you can build in PHP.

Since there are going to be around 6 text fields, I was wondering if there's a better way to sort instead of using if statements to check for each variable
Definitely NO.
First, nothing wrong in using several if's in order.
Trust me - I myself being a huge fan of reducing repetitions of code, but consider these manually written blocks being the best solution.
Next, although there can be a way to wrap these condition ns some loop, most of time different conditions require different treatment.
however, in your next statements you are wrong:
and writing specific queries
you need only one query
Seems like there would be a lot of if statements.
why? no more than number of fields you have.
here goes a complete example of custom search query building code:
$w = array();
$where = '';
if (!empty($_GET['rooms'])) $w[]="rooms='".mesc($_GET['rooms'])."'";
if (!empty($_GET['space'])) $w[]="space='".mesc($_GET['space'])."'";
if (!empty($_GET['max_price'])) $w[]="price < '".mesc($_GET['max_price'])."'";
if (count($w)) $where="WHERE ".implode(' AND ',$w);
$query="select * from table $where";
the only fields filled by the user going to the query.
the ordering is going to be pretty the same way.
mesc is an abbreviation for the mysql_real_escape_string or any other applicable database-specific string escaping function

select * from Users
order by Creadted desc, Name asc, LastName desc, Status asc
And your records will be sorted by order from query.
First by Created desc, then by Name asc and so on.
But from your question I can see that you are searching for filtering results.
So to filter by multiple fileds just append your where, or if you are using any ORM you can do it through object methods.
But if its simple you can do it this way
$query = "";
foreach($_POST['grid_fields'] as $key => $value)
{
if(strlen($query) > 0)
$query .= ' and '
$query .= sprintf(" %s LIKE '%s' ", mysql_real_escape_string($key), '%' .mysql_real_escape_string($value) .'%');
}
if(strlen($query) > 0)
$original_query .= ' where ' . $query;
this could help you to achieve your result.

No. You cannot avoid the testing operations when sorting the set, as you have to compare the elements in the set in same way. The vehicle for this is an if statement.

Could you take a look at this?
WHERE (ifnull(#filter1, 1) = 1 or columnFilter1 = #filter1)
and (ifnull(#filter2, 1) = 1 or columnFilter2 = #filter2)
and (ifnull(#filter3, 1) = 1 or columnFilter3 = #filter3)
and (ifnull(#filter4, 1) = 1 or columnFilter4 = #filter4)
and (ifnull(#filter5, 1) = 1 or columnFilter5 = #filter5)
and (ifnull(#filter6, 1) = 1 or columnFilter6 = #filter6)
Please let me know if I'm misunderstanding your question.. It's not like an IF statement batch, and is pretty lengthy, but what do you think?

Extremely slow search page load (MySQL and PHP)

I made a simple search box on a page, where a user can type in keywords to look for photos of certain items, using PHP. I'm using an MySQL database. I trim the result and show only 10 to make the loading quicker, but certain set of keywords causes the browser to hang on both IE and Firefox. When this happens on IE, I can see outlines of photos (just the silhouette) beyond the 10 results with an "X" mark at the top right corner, similar to when you load a photo and the photo doesn't exist on a webpage, even though I wrote the code to show only 10 results. The database has over 10,000 entries, and I'm thinking maybe it's trying to display the entire set of photos in the database. Here are some code that I'm using.
I'm using the function below to create the query. $keyword is an array of the keywords that the user has typed in.
function create_multiword_query($keywords) {
// Creates multi-word text search query
$q = 'SELECT * FROM catalog WHERE ';
$num = 0;
foreach($keywords as $val) { // Multi-word search
$num++;
if ($num == 1) {
$q = $q . "name LIKE '%$val%'"; }
else {
$q = $q . " AND name LIKE '%$val%'";}
}
$q = $q . ' ORDER BY name';
return $q;
//$q = "SELECT * FROM catalog WHERE name LIKE \"%$trimmed%\" ORDER BY name";
}
And display the result. MAX_DISPLAY_NUM is 10.
$num = 0;
while (($row = mysqli_fetch_assoc($r)) && ($num < MAX_DISPLAY_NUM)) { // add max search result!
$num++;
print_images($row['img_url'], '/_', '.jpg'); // just prints photos
}
I'm very much a novice with PHP, but I can't seem to find anything wrong with my code. Maybe the way I wrote these algorithms are not quite right for PHP or MySQL? Can you guys help me out with this? I can post more code as necessary. TIA!!

Don't limit your search results in PHP, limit them in the SQL query with the LIMIT keyword.
As in:
select * form yourtable where ... order by ... limit 10;
BTW, those LIKE '%something%' can be expensive. Maybe you should look at Full text indexing and searching.
If you want to show a More... link or something like that, one way of doing it would be to limit your query to 11 and only show the first ten.

Apart from the LIMIT in your query, I would check out mysql full text search (if your tables have the MyISAM format).

Why don't use use MySQL to limit the number of search results returned?
http://dev.mysql.com/doc/refman/5.0/en/select.html

add LIMIT to your query.
you are retrieving all rows from DB (lot of bytes traveling from DB to server) and then you are filtering the first 10 rows.
try
$q = $q . ' ORDER BY name LIMIT 10';

LIKE is slow also according to Flickr(slides 24-26). You should first try to use FULL TEXT indexes instead. If your site still seems slow there are also some other really fast(er)/popular alternatives available:
sphinx
elasticsearch
solr
The only thing that is a little bit annoying that you need to learn/install these technologies, but are well worth the investment when needed.

MySQL/PHP Search Efficiency

I'm trying to create a small search for my site. I've tried using full-text index search, but I could never get it to work. Here is what I've come up with:
if(isset($_GET['search'])) {
$search = str_replace('-', ' ', $_GET['search']);
$result = array();
$titles = mysql_query("SELECT title FROM Entries WHERE title LIKE '%$search%'");
while($row = mysql_fetch_assoc($titles)) {
$result[] = $row['title'];
}
$tags = mysql_query("SELECT title FROM Entries WHERE tags LIKE '%$search%'");
while($row = mysql_fetch_assoc($tags)) {
$result[] = $row['title'];
}
$text = mysql_query("SELECT title FROM Entries WHERE entry LIKE '%$search%'");
while($row = mysql_fetch_assoc($text)) {
$result[] = $row['title'];
}
$result = array_unique($result);
}
So basically, it searches through all the titles, body-text, and tags of all the entries in the DB. This works decently well, but I'm just wondering how efficient would it be? This would only be for a small blog, too. Either way I'm just wondering if this could be made any more efficient.

There's no way to make LIKE '%pattern%' queries efficient. Once you get a nontrivial amount of data, using those wildcard queries performs hundreds or thousands of times slower than using a fulltext indexing solution.
You should look at the presentation I did for MySQL University:
http://www.slideshare.net/billkarwin/practical-full-text-search-with-my-sql
Here's how to get it to work:
First make sure your table uses the MyISAM storage engine. MySQL FULLTEXT indexes support only MyISAM tables. (edit 11/1/2012: MySQL 5.6 is introducing a FULLTEXT index type for InnoDB tables.)
ALTER TABLE Entries ENGINE=MyISAM;
Create a fulltext index.
CREATE FULLTEXT INDEX searchindex ON Entries(title, tags, entry);
Search it!
$search = mysql_real_escape_string($search);
$titles = mysql_query("SELECT title FROM Entries
WHERE MATCH(title, tags, entry) AGAINST('$search')");
while($row = mysql_fetch_assoc($titles)) {
$result[] = $row['title'];
}
Note that the columns you name in the MATCH clause must be the same columns in the same order as those you declared in the fulltext index definition. Otherwise it won't work.
I've tried using full-text index search, but I could never get it to work... I'm just wondering if this could be made any more efficient.
This is exactly like saying, "I couldn't figure out how to use this chainsaw, so I decided to cut down this redwood tree with a pocketknife. How can I make that work as well as the chainsaw?"
Regarding your comment about searching for words that match more than 50% of the rows.
The MySQL manual says this:
Users who need to bypass the 50% limitation can use the boolean search mode; see Section 11.8.2, “Boolean Full-Text Searches”.
And this:
The 50% threshold for natural language
searches is determined by the
particular weighting scheme chosen. To
disable it, look for the following
line in storage/myisam/ftdefs.h:
#define GWS_IN_USE GWS_PROB
Change that line to this:
#define GWS_IN_USE GWS_FREQ
Then recompile MySQL. There is no need
to rebuild the indexes in this case.
Also, you might be searching for stopwords. These are words that are ignored by the fulltext search because they're too common. Words like "the" and so on. See http://dev.mysql.com/doc/refman/5.1/en/fulltext-stopwords.html

Using LIKE is NOT fulltext.
You need to use ... WHERE MATCH(column) AGAINST('the query') in order to access a fulltext search.

MySQL Full-text search works -- I would look into it and debug it rather than trying to do this. Doing 3 separate MySQL queries will not be anywhere near as efficient.
If you want to try to make that much efficient you could separate the LIKE statements in one query with OR between them.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.