MySQL Regexp vs regex after MySQL result - php

I need to find in a MySQL DB all rows that contain in a specified column a string matching a pattern.
I used Regexp but it's too slow. I know about full text indexing or third-party software, but that looks more complicated.
I was wondering if it would be faster to select the column from the DB and then perform a regex search using preg_match (or something similar) on the result.
From your experience, do you think it would be faster?

Related

MySql can't decide between LIKE and MATCH for text search

I am building an application in Laravel. And I can't decide to go with Match() or Like for text searching.
I only want to do a text search on one column, that is a Varchar(42).
I will also filter out the query by some Where() statements, so it will not do a text search on all rows.
I am using mysql 5.6+ so Match works with my innobd engine.
Does Match() do good in a table that has about 30k rows?
Laravel ORM doesnt support match so my query looks like this:
$q = Input::get('query');
Post::whereRaw("MATCH(title) AGAINST(? IN BOOLEAN MODE)", array($q))->get();
Do I need to sanitize the "$q" in order to be safe from SQL injections? Since I'm using whereRaw()
The two capabilities are quite different, so the choice should be easy. MATCH is focused on words within the text. So, if you want to search by one or more words, then MATCH should be faster. However, MATCH is focused on words, so searching on numbers, stop words, and short words requires extra effort.
LIKE generally cannot make use of an index. This slows down such queries because every row needs to be processed. Of course, if the rest of the filtering reduces this to 100 rows, then it is not a big deal.
Also, LIKE can use an index for "prefix" searches -- that is, searches at the beginning of the string. So, LIKE 'abc%' can use an index. `LIKE '%abc%' cannot.

Database/datasource optimized for string matching?

I want to store large amount (~thousands) of strings and be able to perform matches using wildcards.
For example, here is a sample content:
Folder1
Folder1/Folder2
Folder1/*
Folder1/Folder2/Folder3
Folder2/Folder*
*/Folder4
*/Fo*4
(each line has additionnal data too, like tags, but the matching is only against that key)
Here is an example of what I would like to match against the data:
Folder1
Folder1/Folder2/Folder3
Folder3
(* being a wildcard here, it can be a different character)
I naively considered storing it in a MySQL table and using % wildcards with the LIKE operator, but MySQL indexes will only work for characters on the left of the wildcard, and in my case it can be anywhere (i.e. %/Folder3).
So I'm looking for a fast solution, that could be used from PHP. And I am open: it can be a separate server, a PHP library using files with regex, ...
Have you considered using MySQL's regular expression engine? Try something like this:
SELECT *
FROM your_table
WHERE your_query_string REGEXP pattern_column
This will return rows with regex keys that your query string matches. I expect it will perform better than running a query to pull all of the data and doing the matching in PHP.
More info here: http://dev.mysql.com/doc/refman/5.1/en/regexp.html
You might want to use the multicore approach to solve that search in a fraction of the time, i would recommend for search and matching, using FPGA's but thats probably the hardest way to do it, consider THIS ARTICLE using CUDA, you can do that searches in 16x usual time, in multicore CPU Systems, you can use posix, or a cluster of computers to do the job (MPI for example), you can call Gearman service to run the searches using advanced algorithms.
Were it me, I'd store out the key field two times ... once forward and once reversed (see mysql's reverse function). you can then search the index with left(main_field) and left(reversed_field). it won't help you when you have a wildcard in the middle of the string AND the beginning (e.g. "*Folder1*Folder2), but it will when you have a wildcard at the beginning or the end.
e.g. if you want to search */Folder1 then search where left(reverse_field, 8) = '1redloF/';
for Folder1/*/FolderX search where left(reverse_field, 8) = 'XredloF/' and left(main_field, 8) = 'Folder1/'
If your strings represent some kind of hierarchical structure (as it looks like in your sample content), actually not "real" files, but you say you are open to alternative solutions - why not consider something like a file-based index?
Choose a new directory like myindex
Create an empty file for each entry using the string key as location & file name in myindex
Now you can find matches using glob - thanks to the hierarchical file structure a glob search should be much faster than searching up all your database entries.
If needed you can match the results to your MySQL data - thanks to your MySQL index on the key this action will be very fast.
But don't forget to update the myindex structure on INSERT, UPDATE or DELETE in your MySQL database.
This solution will only compete on a huge data-set (but not too huge as #Kyle mentioned) with a rather deep than wide hierarchical structure.
EDIT
Sorry this would only work if the wildcards are in your search terms not in the stored strings itself.
As the wildcards (*) are in your data and not in your queries I think you should start with breaking up your data into pieces. You should create an index-table having columns like:
dataGroup INT(11),
exactString varchar(100),
wildcardEnd varchar(100),
wildcardStart varchar(100),
If you have a value like "Folder1/Folder2" store it in "exactString" and assign the ID of the value in the main data table to "dataGroup" in the above index table.
If you have a value like "Folder1/*" store a value of "Folder1/" to "wildcardEnd" and again assign the id of the value in the main table to the "dataGroup" field in above Table.
You can then do a match within your query using:
indexTable.wildcardEnd = LEFT('Folder1/WhatAmILookingFor/Data', LENGTH(indexTable.wildcardEnd))
This will truncate the search string ('Folder1/WhatAmILookingFor/Data') to "Folder1/" and then match it against the wildcardEnd field. I assume mysql is clever enough not to do the truncate for every row but to start with the first character and match it against every row (using B-Tree indexes).
A value like "*/Folder4" will go into the field "wildcardStart" but reversed. To cite Missy Elliot: "Is it worth it, let me work it
I put my thing down, flip it and reverse it" (http://www.youtube.com/watch?v=Ke1MoSkanS4). So store a value of "4redloF/" in "wildcardStart". Then a WHERE like the following will match rows:
indexTable.wildcardStart = LEFT(REVERSE('Folder1/WhatAmILookingFor/Folder4'), LENGTH(indexTable.wildcardStart))
of course you could do the "REVERSE" already in your application logic.
Now about the tricky part. Something like "*/Fo*4" should get split up into two records:
# Record 1
dataGroup ==> id of "*/Fo*4" in data table
wildcardStart ==> oF/
wildcardEnd ==> /Fo
# Record 2
dataGroup ==> id of "*/Fo*4" in data table
wildcardStart ==> 4
Now if you match something you have to take care that every index-record of a dataGroup gets returned for a complete match and that no overlapping occurs. This could also get solved in SQL but is beyond this question.
Database isn't the right tool to do these kinds of searches. You can still use a database (any database and any structure) to store the strings, but you have to write the code to do all the searches in memory. Load all the strings from the database (a few thousand strings is really no biggy), cache them and run your search\match algorithm on them.
You probably have to code your algorithm yourself because the standard tools will be an overkill for what you are trying to achieve and there is no garantee that they will be able to achieve exactly what you need.
I would build a regex representation of your wildcard based strings and run those regexs on your input. Your probabaly will have to do some work until you get the regex right, but it will be the fastest way to go.
I suggest reading the keys and their associated payload into a binary tree representation ordered alphanumerically by key. If your keys are not terribly "clumped" then you can avoid the (slight additional) overhead building of a balanced tree. You also can avoid any tree maintenance code as, if I understand your problem correctly, the data will be changing frequently and it would be simplest to rebuild the tree rather than add/remove/update nodes in place. The overhead of reading into the tree is similar to performing an initial sort, and tree traversal to search for your value is straight-forward and much more efficient than just running a regex against a bunch of strings. You may even find while working it through that your wild cards in the tree will lead to some shortcuts to prune the search space. A quick search show lots of resources and PHP snippets to get you started.
If you run SELECT folder_col, count(*) FROM your_sample_table group by folder_col do you get duplicate folder_col values (ie count(*) greater than 1)?
If not, that means you can produce an SQL that would generate a valid sphinx index (see http://sphinxsearch.com/).
I wouldn't recommend to do text search on large collection of data in MySQL. You need a database to store the data but that would be it. For searching use a search engine like:
Solr (http://lucene.apache.org/solr/)
Elastic Search (http://www.elasticsearch.org/)
Sphinx (http://sphinxsearch.com/)
Those services will allow you doing all sort of funky text search (including Wildcards) in a blink of an eye ;-)

optimize tables for search using LIKE clause in MySQL

I am building a search feature for the messages part of my site, and have a messages database with a little over 9,000,000 rows, and and index on the sender, subject, and message fields. I was hoping to use the LIKE mysql clause in my query, such as (ex)
SELECT sender, subject, message FROM Messages WHERE message LIKE '%EXAMPLE_QUERY%';
to retrieve results. unfortunately, MySQL doesn't use indexes when a leading wildcard is present , and this is necessary for the search query could appear anywhere in the message (this is how the wildcards work, no?). Queries are very very slow and I cannot use a full text index either, because of the annoying 50% rule (I just can't afford to rule that much out). Is there anyway (or even, any alternative to this) to optimize a query using like and two wildcards? Any help is appreciated.
You should either use full-text indexes (you said you can't), design a full-text search by yourself or offload the search from MySQL and use Sphinx/Lucene. For Lucene you can use Zend_Search_Lucene implementation from Zend Framework or use Solr.
Normal indexes in MySQL are B+Trees, and they can't be used if the starting of the string is not known (and this is the case when you have wildcard in the beginning)
Another option is to implement search on your own, using reference table. Split text in words and create table that contains word, record_id. Then in the search you split the query in words and search for each of the words in the reference table. In this way you are not limitting yourself to the beginning of the whole text, but only to the beginning of the given word (and you'll match the rest of the words anyway)
'%EXAMPLE_QUERY%'; is a very very bad idea .. am going to give you some
A. Avoid wildcards at the start of LIKE queries use 'EXAMPLE_QUERY%'; instead
B. Create Keywords where you can easily use MATCH
If you want to stick with using MySQL, you should use FULL TEXT indexes. Full text indexes index words in a text block. You can then search on word stems and return the results in order of relevance. So you can find the word "example" within a block of text, but you still can't search efficiently on "xampl" to find "example".
MySQL's full text search is not great, but it is functional.
http://dev.mysql.com/doc/refman/5.1/en/fulltext-search.html
select * from emp where ename like '%e';
gives emp_name that ends with letter e.
select * from emp where ename like 'A%';
gives emp_name that begins with letter a.
select * from emp where ename like '_a%';
gives emp_name in which second letter is a.

Making search facility

I want to make a search facility in my website.I'm using php..
What criteria should be taken for searching.
For ex: if someone searches
How to make soap
I can use many approaches for the search like finding database entries having exactly the same search string
or finding the database entries in the order of search keywords(ie . entry with search string "How" +"Soap" will have less preference than entry having search string "how soap make")...
So what is the algorithm generally used for searching.?
Also what is meant by full text search?
This is kind of a big subject for a simple answer, but I think what you mean is how to run complex fulltext searches on MySQL. In other words, this is really a MySQL question, not a PHP one.
Basically, you need to:
1. Create a fulltext index on a text field in your database.
2. Run queries on that database field using MySQL's fulltext syntax.
The basic syntax for querying a fulltext indexed table in MySQL is:
SELECT * FROM table
WHERE MATCH (fulltextfield)
AGAINST ('my search phrase');
There's a lot more to it than that, but the MySQL documentation is the place to go: http://dev.mysql.com/doc/refman/5.0/en/fulltext-natural-language.html
If you want to do really advanced fulltext searches, a good recommendation is Sphinx, but that's probably way more advanced than you need.

Relevancy search with PHP and Mysql

Whats the best way to go around doing this?
I have columns: track_name, artist_name, album_name
I want all columns to be matched against the search query. and some flexibility while matching.
mysql like is too strict, even with %XXX%. It matches the string as a whole, not the parts.
Your MySQL query could have several OR clauses, searching for each space-delimited word entered by the user. For example, a user search for "Queens of the Stoneage" may be represented in SQL as SELECT * FROM songs WHERE artist_name LIKE "%Queens%" OR artist_name LIKE "Stoneage".
However, that could be undesirable because LIKE searches which start with an % are inefficient and could be terribly slow on a large database.
Though I can't speak to the performance implications, you should have a look at natural language full-text searches. It's probably the most effective solution you'll find:
SELECT * FROM songs WHERE MATCH(track_name, artist_name, album_name) AGAINST('Queens of the Stoneage' IN NATURAL LANGUAGE MODE);
Some PHP functions do exist for determining the similarity of strings of text, but keeping this work in the database will probably be most efficient (and less frustrating):
levenshtein()
similar_text()
soundex()
I think you need to rethink your application. What I understood from your comment is that you need to implement some logic operator like "and", "or" and "not" in your program. It's not only about fancy algorithm like fulltext index or longest common substring like this mysql match query. But I can be wrong.

Categories