MySQL fulltext/regexp/levenshtein search optimization - php

I have a database of 100+ million records (and rapidly growing) and I would like to implement a search feature that sorts by the closest results.
I did some research and found that fulltext searches are only prefixes, which isn't really what I want.
I got results that are close enough to call it good, but the problem is the query is very slow.
The levenshtein function is from here: http://www.artfulsoftware.com/infotree/qrytip.php?id=552
Here's the query:
SELECT `id`,
`word`,
MATCH (`word`) AGAINST ('+*search*') IN BOOLEAN MODE) AS `match`
FROM `words`
WHERE `word` REGEXP '^.*[search].*$'
AND levenshtein(`word`, 'search') <= 2
ORDER BY levenshtein(`word`, 'search'), `match` ASC
LIMIT 10;
So, overall the results are pretty close, but it takes minutes to actually complete the search, which is really not what I want since each time a key is pressed I send a jquery AJAX request.
Any help would be appreciated.

I wound up using sphinxsearch as #RobGudgeon suggested. Though not MySQL (and another database on its own) it's actually very quick on both updating from MySQL and searching within its own database. Recommended for fulltext searches on big databases since MySQL's native support is currently lacking.

Related

faster way for Search in multiple databases

I am working on big eCommerce shopping website. I have around 40 databases. i want to create search page which show 18 result after searching by title in all databases.
(SELECT id_no,offers,image,title,mrp,store from db1.table1 WHERE MATCH(title) AGAINST('$searchkey') AND title like '%$searchkey%')
UNION ALL (SELECT id_no,offers,image,title,mrp,store from db3.table3 WHERE MATCH(title) AGAINST('$searchkey') AND title like '%$searchkey%')
UNION ALL (SELECT id_no,offers,image,title,mrp,store from db2.table2 WHERE MATCH(title) AGAINST('$searchkey') AND title like '%$searchkey%')
LIMIT 18
currently i am using the above query its working fine for 4 or more character keyword search like laptop nokia etc but takes 10-15 sec for processes but for query with keyword less than 3 characters it takes 30-40sec or i end up with 500 internal server error. Is there any optimized way for searching in multiple databases. I generated two index primary and full text index with title
Currently my search page is in php i am ready to code in python or any
other language if i gets good speed
You can use the sphixmachine:http://sphinxsearch.com/. This is powerfull search for database. IMHO Sphinx this best decision
for search in your site.
FULLTEXT is not configured (by default) for searching for words less than three characters in length. You can configure that to handle shorter words by setting a ...min_token_size parameter. Read this. https://dev.mysql.com/doc/refman/5.7/en/fulltext-fine-tuning.html You can only do this if you control the MySQL server. It won't be possible on shared hosting. Try this.
FULLTEXT is designed to produce more false-positive matches than false-negative matches. It's generally most useful for populating dropdown picklists like the ones under the location field of a browser. That is, it requires some human interaction to choose the correct record. To expect FULLTEXT to be able to do absolutely correct searches is probably a bad idea.
You simply cannot use AND column LIKE '%whatever%' if you want any reasonable performance at all. You must get rid of that. You might be able to rewrite your python program to do something different when the search term is one or two letters, and thereby avoid many, but not all, LIKE '%a%' and LIKE '%ab%' operations. If you go this route, create ordinary indexes on your title columns. Whatever you do, don't combine the FULLTEXT and LIKE searches in a single query.
If this were my project I'd consider using a special table with columns like this to hold all the short words from the title column in every row of each table.
id_pk INT autoincrement
id_no INT
word VARCHAR(3)
Then you can use a query like this to look up short words
SELECT a.id_no,offers,image,title,mrp,store
FROM db1.table1 a
JOIN db1.table1_shortwords s ON a.id_no = s.id_no
WHERE s.word = '$searchkey'
To do this, you will have to preprocess the title columns of your other tables to populate the shortwords tables, and put an index on the word column. This will be fast, but it will require a special-purpose program to do the preprocessing.
Having to search multiple tables with your UNION ALL operation is a performance problem. You will be able to improve performance dramatically by redesigning your schema so you need search only one table.
Having to search databases on different server machines is a performance problem. You may be able to rig up your python program to search them in parallel: that is, to somehow use separate tasks to search each one, then aggregate the results. Each of those separate search tasks requires its own connection to the data base, so this is not a cheap or simple solution.
If this system faces the public web, you will have to redesign it sooner or later, because it will never perform well enough as it is now. (Sorry to be the bearer of bad news.) Many system designers like to avoid redesigning systems after they become enormous. So, if I were you I would get the redesign done.
If your focus is on searching, then bend the schema to facilitate searching rather than the other way around.
Collect all the strings to search for in a single table. Whereas a UNION of 40 tables does work, it will be ~40 times as slow as having the strings collected together.
Use FULLTEXT when the words are long enough, use some other technique when they are not. (This addresses your 3-char problem; see also the Answer discussing innodb_ft_min_token_size. You are using InnoDB, correct?)
Use + and boolean mode to say that a word is mandatory: MATCH(col) AGAINST("+term" IN BOOLEAN MODE)
Do not add on a LIKE clause unless there is a good reason.

MySQL+PHP: How to paginate data from complex query with ORDER BY on user-selected column

I have a table with currently ~1500 rows which is expected to grow over time (can't say how much, but still), the website is read-only and lets users do complex queries through the use of some forms, then the search query is completely URL-encoded since it's a public database. It's important to know that users can select what column data must be sorted by.
I'm not concerned about putting some indexes and slowing down INSERTs and UPDATEs (just performed occasionally by admins) since it's basically heavy-reading, but I need to paginate results as some popular queries can return 900+ results and that takes up too much space and RAM on client-side (results are further processed to create a quite rich <div> HTML element with an <img> for each result, btw).
I'm aware of the use of OFFSET {$m} LIMIT {$n} but would like to avoid it
I'm aware of the use of this
Query
SELECT *
FROM table
WHERE {$filters} AND id > {$last_id}
ORDER BY id ASC
LIMIT {$results_per_page}
and that's what I'd like to use, but that requires rows to be sorted only by their ID!
I've come up with (what I think is) a very similar query to custom sort results and allow efficient pagination.
Query:
SELECT *
FROM table
WHERE {$filters} AND {$column_id} > {$last_column_id}
ORDER BY {$column} ASC
LIMIT {$results_per_page}
but that unfortunately requires to have a {$last_column_id} value to pass between pages!
I know indexes (especially unique indexes) are basically automatically-updated integer-based columns that "rank" a table by values of a column (be it integer, varchar etc.), but I really don't know how to make MySQL return the needed $last_column_id for that query to work!
The only thing I can come up with is to put an additional "XYZ_id" integer column next to every "XYZ" column users can sort results by, then update values periodically through some scripts, but is it the only way to make it work? Please help.
(Too many comments to fit into a 'comment'.)
Is the query I/O bound? Or CPU bound? It seems like a mere 1500 rows would lead to being CPU-bound and fast enough.
What engine are you using? How much RAM? What are the settings of key_buffer_size and innodb_buffer_pool_size?
Let's see SHOW CREATE TABLE. If the table is full of big BLOBs or TEXT fields, we need to code the query to avoid fetching those bulky fields only to throw them away because of OFFSET. Hint: Fetch the LIMIT IDs, then reach back into the table to get the bulky columns.
The only way for this to be efficient:
SELECT ...
WHERE x = ...
ORDER BY y
LIMIT 100,20
is to have INDEX(x,y). But, even that, will still have to step over 100 cow paddies.
You have implied that there are many possible WHERE and ORDER BY clauses? That would imply that adding enough indexes to cover all cases is probably impractical?
"Remembering where you left off" is much better than using OFFSET, so try to do that. That avoids the already-discussed problem with OFFSET.
Do not use WHERE (a,b) > (x,y); that construct used not to be optimized well. (Perhaps 5.7 has fixed it, but I don't know.)
My blog on OFFSET discusses your problem. (However, it may or may not help your specific case.)

Searching a big mysql database with relevance

I'm building a rather large "search" engine for our company intranet, it has 1miljon plus entries
it's running on a rather fast server and yet it takes up to 1 min for some search queries.
This is how the table looks
I tried create an index for it, but it seems as if i'm missing something, this is how the show index is showing
and this is the query itself, it is the ordering that slows the query mostly but even a query without the sorting is somewhat slow.
SELECT SQL_CALC_FOUND_ROWS *
FROM `businessunit`
INNER JOIN `businessunit-postaddress` ON `businessunit`.`Id` = `businessunit-postaddress`.`BusinessUnit`
WHERE `businessunit`.`Name` LIKE 'tanto%'
ORDER BY `businessunit`.`Premium` DESC ,
CASE WHEN `businessunit`.`Name` = 'tanto'
THEN 0
WHEN `businessunit`.`Name` LIKE 'tanto %'
THEN 1
WHEN `businessunit`.`Name` LIKE 'tanto%'
THEN 2
ELSE 3
END , `businessunit`.`Name`
LIMIT 0 , 30
any help is very much appreciated
Edit:
What's choking this query 99% is ordering by relevance with the wildcharacter %
When i Do an explain it says using where; using fsort
You should try sphinx search solution which is full-text search engine will give you very good performance along with lots of options to set relevancy.
Click here for more details.
Seems like the index doesn't cover Premium, yet that is the first ORDER BY argument.
Use EXPLAIN your query here to figure out the query plan and change your index to remove any table scans as explained in http://dev.mysql.com/doc/refman/5.0/en/using-explain.html
MySQL is good for storing data but not great when it comes down to fast text based search.
Apart from Sphinx which has been already suggested I recommend two fantastic search engines:
Solr with http://pecl.php.net/package/solr - very popular search engine. Used on massive services like NetFlix.
Elastic Search - relatively new software but with very active community and lots of respect
Both solution are based on the same library Apache Lucene
If the "ORDER BY" is really the bottleneck, the straight-forward solution would be to remove the "ORDER BY" logic from your query, and re-implement the sorting directly in your application's code using C# sorting. Unfortunately, this means you'd also have to move your pagination into your application, since you'd need to obtain the complete result set before you can sort & paginate it. I'm just mentioning this because no-one else so far appears to have thought of it.
Frankly (like others have pointed out), the query you showed at the top should not need full-text indexing. A single suffix wildcard (e.g., LIKE 'ABC%') should be very effective as long as a BTREE (and not a HASH) index is available on the column in question.
And, personally, I have no aversion to even double-wildcard (e.g., LIKE '%ABC%"), which of course can never make use of indexes, as long as a full table scan is cheap. Probably 250,000 rows is the point where I'll start to seriously consider full-text indexing. 100,000 is definitely no problem.
I always make sure my SELECT's are dirty-reads, though (no transactionality applied to the select).
It's dirty once it gets to the user's eyeballs in any case!
Most of the search engine oriended sites are use FULL-TEXT-SEARCH.
It will be very faster compare to select and LIKE...
I have added one example and some links ...
I think it will be useful for you...
In this full text search have some conditions also...
STEP:1
CREATE TABLE articles (
id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
title VARCHAR(200),
body TEXT,
FULLTEXT (title,body)
);
STEP:2
INSERT INTO articles (title,body) VALUES
('MySQL Tutorial','DBMS stands for DataBase ...'),
('How To Use MySQL Well','After you went through a ...'),
('Optimizing MySQL','In this tutorial we will show ...'),
('1001 MySQL Tricks','1. Never run mysqld as root. 2. ...'),
('MySQL vs. YourSQL','In the following database comparison ...'),
('MySQL Security','When configured properly, MySQL ...');
STEP:3
Natural Language Full-Text Searches:
SELECT * FROM articles
WHERE MATCH (title,body) AGAINST ('database');
Boolean Full-Text Searches
SELECT * FROM articles WHERE MATCH (title,body)
AGAINST ('+MySQL -YourSQL' IN BOOLEAN MODE);
Go through this links
viralpatel.net,devzone.zend.com,sqlmag.com,colorado.edu,en.wikipedia.org
It's so strange query :)
Let's try to understand what it does.
The results are less than 30 rows from the table "businessunit" with some conditions.
The first condition is a foreign key of the "businessunit-postaddress" table.
Please check if you have an index on the column businessunit-postaddress.BusinessUnit.
The second one is a filter for returning rows only with businessunit.Name begining with 'tanto'.
If I didn't make a mistake you have a very complex index 'Business' consists of 11 fields!
And field 'Name' is not the first field in this index.
So this index is useless when you run "like tanto%"'s query.
I have strong doubt about necessity of this index at all.
By the way it demands quite big resources for its maintaining and slow down edit operations with this table.
You have to make an index with the only field 'Name'.
After filtering the query is sorting results and do it in some strange way too.
At first it sorts by field businessunit.Premium - it's normal.
However next statements with CASE are useless too.
That's why.
The zero are assigned to Name = 'tanto' (exactly).
The next rows with the one are rows with space after 'tanto' - these will be after 'tanto' in any cases (except special symbols) cause space is lower than any letter.
The next rows with the two are rows with some letters after 'tanto' (include space!). These rows will be in this order too by definition.
And the three is "reserved" for "other" rows but you won't get "other" rows - remeber about [WHERE businessunit.Name LIKE 'tanto%'] condition.
So this part of ORDER BY is meaningless.
And at the end of ORDER BY there is businessunit.Name again...
My advice: you need rebuild the query from scratch keeping in mind what you want to get.
Anyway I guess you can use
SELECT SQL_CALC_FOUND_ROWS *
FROM `businessunit`
INNER JOIN `businessunit-postaddress` ON `businessunit`.`Id` = `businessunit-postaddress`.`BusinessUnit`
WHERE `businessunit`.`Name` LIKE 'tanto%'
ORDER BY `businessunit`.`Premium` DESC,
`businessunit`.`Name`
LIMIT 0 , 30
Don't forget about an index on field businessunit-postaddress.BusinessUnit!
And I have strong assumption about field Premium.
I guess it is designed for storing binary data (yes/no).
So an ordinary (BTREE) index doesn't match.
You have to use bitmap index.
P.S. I'm not sure that you really need to use SQL_CALC_FOUND_ROWS
MySQL: Pagination - SQL_CALC_FOUND_ROWS vs COUNT()-Query
Its either full-text(http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html) or the pattern matching (http://dev.mysql.com/doc/refman/5.0/en/pattern-matching.html) from php and mysql side.
From experience and theory:
Advantages of full-text -
1) Results are very relevant and de-limit characters like spacing in the search query does not hinder the search.
Disadvantages of full-text -
1) There are stopwords that are used as restrictions by webhosters to prevent excess load of data.(E.g. search results containing the word 'one' or 'moz' are not displayed. And this can be avoided if you're running your own server by keeping no stopwords.
2) If I type 'ree' it only displays words containing exactly 'ree' not 'three' or 'reed'.
Advantages of pattern matching -
1) It does not have any stopwords as in full-text and if you search for 'ree', it displays any word containing 'ree' like 'reed' or 'three' unlike fulltext where only the exact word is retreived.
Disadvantages of pattern matching-
1) If delimiters like spaces are used in your search words and if these spaces are not there in the results, because each word is separate from any delimiters, then it returns no result.
If the argument of LIKE doesn't begin with a wildchard character, like in your example, LIKE operator should be able to take advantage of indexes.
In this case, LIKE operator should perform better than LOCATE or LEFT, so I suspect that changing the condition like this could make things worse, but I still think it's worth trying (who knows?):
WHERE LOCATE('tanto', `businessunit`.`Name`)=1
or:
WHERE LEFT(`businessunit`.`Name`,5)='tanto'
I would also change your order by clause:
ORDER BY
`businessunit`.`Premium` DESC ,
CASE WHEN `businessunit`.`Name` LIKE 'tanto %' THEN 1
WHEN `businessunit`.`Name` = 'tanto' THEN 0
ELSE 2 END,
`businessunit`.`Name`
Name has to be LIKE 'tanto%' already, so you can skip a condition (CASE will never return value 3). Of course, make sure that Premium field is indexed.
Hope this helps.
I think you need to collect the keys only, sort them, then join last
SELECT A.*,B.* FROM
(
SELECT * FROM (
SELECT id BusinessUnit,Premium
CASE
WHEN Name = 'tanto' THEN 0
WHEN Name LIKE 'tanto %' THEN 1
WHEN Name LIKE 'tanto%' THEN 2
ELSE 3
END SortOrder
FROM businessunit Name LIKE 'tanto%'
) AA ORDER BY Premium,SortOrder LIMIT 0,30
) A LEFT JOIN `businessunit-postaddress` B USING (BusinessUnit);
This will still generate a filesort.
You may want to consider preloading the needed keys in a separate table you can index.
CREATE TABLE BusinessKeys
(
id int not null auto_increment,
BusinessUnit int not null,
Premium int not null,
SortOrder int not null,
PRIMARY KEY (id),
KEY OrderIndex (Premuim,SortOrder,BusinessUnit)
);
Populate all keys that match
INSERT INTO BusinessKeys (BusinessUnit,Premuim,SortOrder)
SELECT id,Premium
CASE
WHEN Name = 'tanto' THEN 0
WHEN Name LIKE 'tanto %' THEN 1
WHEN Name LIKE 'tanto%' THEN 2
ELSE 3
END
FROM businessunit Name LIKE 'tanto%';
Then, to paginate, run LIMIT on the BusinessKeys only
SELECT A.*,B.*
FROM
(
SELECT FROM BusinessKeys
ORDER BY Premium,SortOrder
LIMIT 0,30
) BK
LEFT JOIN businessunit A ON BK.BusinessUnit = A.id
LEFT JOIN `businessunit-postaddress` B ON A.BusinessUnit = B.BusinessUnit
;
CAVEAT : I use LEFT JOIN instead of INNER JOIN because LEFT JOIN preserves the order of the keys from the left side of the query.
I've read the answer to use Sphinx to optimize the search. But regarding my experience I would advise a different solution. We used Sphinx for some years and had a few nasty problems with segmentation faults and corrupted indice. Perhaps Sphinx isn't as buggy as a few years before, but for a year now we are very happy with a different solution:
http://www.elasticsearch.org/
The great benefits:
Scalability - you can simply add another server with nearly zero configuration. If you know mysql replication, you'll love this feature
Speed - Even under heavy load you get good results in much less than a second
Easy to learn - Only by knowing HTTP and JSON you can use it. If you are a Web-Developer, you feel like home
Easy to install - it is useable without touching the configuration. You just need simple Java (no Tomcat or whatever) and a Firewall to block direct access from the public
Good Javascript integration - even a phpMyAdmin-like Tool is a simple HTML-Page using Javascript: https://github.com/mobz/elasticsearch-head
Good PHP Integration with https://github.com/ruflin/Elastica
Good community support
Good documentation (it is not eye friendly, but it covers nearly every function!)
If you need an additional storage solution, you can easily combine the search engine with http://couchdb.apache.org/

Boolean search on database with 5 million rows, very slow

I have a script that when a user searches for "Apples", there is a coloumn with related searches like "Shiny Apples", and "Rotten Apples", etc.
Problem is my searches database has 5,600,000 rows and mysql is lagging very badly.
Here is my code:
$get = mysql_query("SELECT keyword FROM searches WHERE MATCH (keyword) AGAINST ('".mysql_real_escape_string(preg_replace('#([\S]+)#', '$1', $title))."' IN BOOLEAN MODE) LIMIT 0,15");
This is taking a very long time to process and I have a very active website where searches are getting performed every second, can anybody recommend me some tips?
Thanks!
You need indexes. I have had some experiences in the past where just by creating indexes we got significant performance gain.
You have to index the columns you look into in order to speed things up. I quit using Mysql's native search just because it sucks. I moved to Lucene, and works like charm

Optimizing a PHP page: MySQL bottleneck

I have a page that is taking 37 seconds to load. While it is loading it pegs MySQL's CPU usage through the roof. I did not write the code for this page and it is rather convoluted so the reason for the bottleneck is not readily apparent to me.
I profiled it (using kcachegrind) and find that the bulk of the time on the page is spent doing MySQL queries (90% of the time is spent in 25 different mysql_query calls).
The queries take the form of the following with the tag_id changing on each of the 25 different calls:
SELECT * FROM tbl_news WHERE news_id
IN (select news_id from
tbl_tag_relations WHERE tag_id = 20)
Each query is taking around 0.8 seconds to complete with a few longer delays thrown in for good measure... thus the 37 seconds to completely load the page.
My question is, is it the way the query is formatted with that nested select that is causing the problem? Or could it be any one of a million other things? Any advice on how to approach tackling this slowness is appreciated.
Running EXPLAIN on the query gives me this (but I'm not clear on the impact of these results... the NULL on primary key looks like it would be bad, yes? The number of results returned seems high to me as well as only a handful of results are returned in the end):
1 PRIMARY tbl_news ALL NULL NULL NULL NULL 1318 Using where
2 DEPENDENT SUBQUERY tbl_tag_relations ref FK_tbl_tag_tags_1 FK_tbl_tag_tags_1 4 const 179 Using where
I'e addressed this point in Database Development Mistakes Made by AppDevelopers. Basically, favour joins to aggregation. IN isn't aggregation as such but the same principle applies. A good optimize will make these two queries equivalent in performance:
SELECT * FROM tbl_news WHERE news_id
IN (select news_id from
tbl_tag_relations WHERE tag_id = 20)
and
SELECT tn.*
FROM tbl_news tn
JOIN tbl_tag_relations ttr ON ttr.news_id = tn.news_id
WHERE ttr.tag_id = 20
as I believe Oracle and SQL Server both do but MySQL doesn't. The second version is basically instantaneous. With hundreds of thousands of rows I did a test on my machine and got the first version to sub-second performance by adding appropriate indexes. The join version with indexes is basically instantaneous but even without indexes performs OK.
By the way, the above syntax I use is the one you should prefer for doing joins. It's clearer than putting them in the WHERE clause (as others have suggested) and the above can do certain things in an ANSI SQL way with left outer joins that WHERE conditions can't.
So I would add indexes on the following:
tbl_news (news_id)
tbl_tag_relations (news_id)
tbl_tag_relations (tag_id)
and the query will execute almost instantaneously.
Lastly, don't use * to select all the columns you want. Name them explicitly. You'll get into less trouble as you add columns later.
The SQL Query itself is definitely your bottleneck. The query has a sub-query in it, which is the IN(...) portion of the code. This is essentially running two queries at once. You can likely halve (or more!) your SQL times with a JOIN (similar to what d03boy mentions above) or a more targeted SQL query. An example might be:
SELECT *
FROM tbl_news, tbl_tag_relations
WHERE tbl_tag_relations.tag_id = 20 AND
tbl_news.news_id = tbl_tag_relations.news_id
To help SQL run faster you also want to try to avoid using SELECT *, and only select the information you need; also put a limiting statement at the end. eg:
SELECT news_title, news_body
...
LIMIT 5;
You also will want to look into the database schema itself. Make sure you are indexing all of the commonly referred to columns so that the queries will run faster. In this case, you probably want to check your news_id and tag_id fields.
Finally, you will want to take a look at the PHP code and see if you can make one single all-encompassing SQL query instead of iterating through several seperate queries. If you post more code we can help with that, and it will probably be the single greatest time savings for your posted problem. :)
If I understand correctly, this is just listing the news stories for a specific set of tags.
First of all, you really shouldn't
ever SELECT *
Second, this can probably be
accomplished within a single query,
thus reducing the overhead cost of
multiple queries. It seems like it
is getting fairly trivial data so
it could be retrieved within a
single call instead of 20.
A better approach to using IN might be to use a JOIN with a WHERE condition instead. When using an IN it will basically be a lot of OR statements.
Your tbl_tag_relations should definitely have an index on tag_id
select *
from tbl_news, tbl_tag_relations
where
tbl_tag_relations.tag_id = 20 and
tbl_news.news_id = tbl_tag_relations.news_id
limit 20
I think this gives the same results, but I'm not 100% sure. Sometimes simply limiting the results helps.
Unfortunately MySQL doesn't do very well with uncorrelated subqueries like your case shows. The plan is basically saying that for every row on the outer query, the inner query will be performed. This will get out of hand quickly. Rewriting as a plain old join as others have mentioned will work around the problem but may then cause the undesired affect of duplicate rows.
For instance the original query would return 1 row for each qualifying row in the tbl_news table but this query:
SELECT news_id, name, blah
FROM tbl_news n
JOIN tbl_tag_relations r ON r.news_id = n.news_id
WHERE r.tag_id IN (20,21,22)
would return 1 row for each matching tag. You could stick DISTINCT on there which should only have a minimal performance impact depending on the size of the dataset.
Not to troll too badly, but most other databases (PostgreSQL, Firebird, Microsoft, Oracle, DB2, etc) would handle the original query as an efficient semi-join. Personally I find the subquery syntax to be much more readable and easier to write, especially for larger queries.

Categories