MySQL, PHP Relative search results / Optimization - php

I understand LIKE results with wildcards etc. What I need to know is a good way to get search results with the most relative at the top.
For Example:
I search for "Front Brake CarModel" or something similar.
Currently I explode the string by spaces and create an addition OR/WHERE state so the query would look something like this .
SELECT * FROM table WHERE article_text LIKE '%Front%' OR article_text LIKE '%Brake%' OR article_text LIKE '%CarModel%'
Due to my novice searching skills, this is not great as it get results for every word in the search term. What I would like to happen is get the result and sort with the articles with the most found words at the top. If that makes sense.
Advice?
EDIT : Table is type InnoDB and cannot change type due to foreign key restraints. Thus removing the ability for me to use FULLTEXT indexing :(

This can be done easily with a fulltext index.
ALTER TABLE table ADD FULLTEXT INDEX `ft_search` (`article_text`);
SELECT *, MATCH(article_text) AGAINST('Front Brake CarModel') AS score
FROM table
WHERE MATCH(article_text) AGAINST('Front Brake CarModel') ORDER BY score DESC;

Related

How to search singular/plurals in php mysql

When I enter word (ex: freshers) I want fresher and freshers records both currently I am getting only freshers records. My code is like this:
$search='freshers';
$qry=mysql_query("select count(*) from jobs where job_title like '%$search%' or MATCH(job_title)
AGAINST('$search' IN BOOLEAN MODE)");
when the search word is freshers I m getting count as 1200. When the search word is fresher I am again getting count as 2000.
How to get almost same count when I enter either freshers or fresher.
You cannot get precisely the same matching-record count from any MySQL technology with a search term that's either singular or plural.
MySQL doesn't have the smarts to know that freshers is the plural of fresher or children is the plural of child. If you want to do this with MySQL you'll have to start your search with the singular form of the word for which you want the plural.
Neither does MySQL know that mice is the plural of mouse.
If you need automated plural/singular functionality you may want to investigate Lucene or some other natural language search tech. The name of the capability you seek is "stemming."
But you can use FULLTEXT search with terms with trailing asterisks. For example, 'fresher*' matches fresher, freshers, and even fresherola. This will extend a search from singular to plural. It will not work the other way around. It
select count(*)
from jobs
where MATCH(job_title) AGAINST('fresher*' IN BOOLEAN MODE)
There are some other modifying characters for boolean mode search terms. They are mentioned here:
http://dev.mysql.com/doc/refman/5.5/en/fulltext-boolean.html
Pro tip: column LIKE '%searchterm%' is probably the slowest way MySQL offers to search a column. It is guaranteed to scan the whole table.
Pro tip: FULLTEXT search is inherently a bit fuzzy. Expecting crisp record counts from it is probably a path to confusion.
try this
$search='fresher';
$qry=mysql_query("select count(*) from jobs where job_title like '$search%'

Searching a big mysql database with relevance

I'm building a rather large "search" engine for our company intranet, it has 1miljon plus entries
it's running on a rather fast server and yet it takes up to 1 min for some search queries.
This is how the table looks
I tried create an index for it, but it seems as if i'm missing something, this is how the show index is showing
and this is the query itself, it is the ordering that slows the query mostly but even a query without the sorting is somewhat slow.
SELECT SQL_CALC_FOUND_ROWS *
FROM `businessunit`
INNER JOIN `businessunit-postaddress` ON `businessunit`.`Id` = `businessunit-postaddress`.`BusinessUnit`
WHERE `businessunit`.`Name` LIKE 'tanto%'
ORDER BY `businessunit`.`Premium` DESC ,
CASE WHEN `businessunit`.`Name` = 'tanto'
THEN 0
WHEN `businessunit`.`Name` LIKE 'tanto %'
THEN 1
WHEN `businessunit`.`Name` LIKE 'tanto%'
THEN 2
ELSE 3
END , `businessunit`.`Name`
LIMIT 0 , 30
any help is very much appreciated
Edit:
What's choking this query 99% is ordering by relevance with the wildcharacter %
When i Do an explain it says using where; using fsort
You should try sphinx search solution which is full-text search engine will give you very good performance along with lots of options to set relevancy.
Click here for more details.
Seems like the index doesn't cover Premium, yet that is the first ORDER BY argument.
Use EXPLAIN your query here to figure out the query plan and change your index to remove any table scans as explained in http://dev.mysql.com/doc/refman/5.0/en/using-explain.html
MySQL is good for storing data but not great when it comes down to fast text based search.
Apart from Sphinx which has been already suggested I recommend two fantastic search engines:
Solr with http://pecl.php.net/package/solr - very popular search engine. Used on massive services like NetFlix.
Elastic Search - relatively new software but with very active community and lots of respect
Both solution are based on the same library Apache Lucene
If the "ORDER BY" is really the bottleneck, the straight-forward solution would be to remove the "ORDER BY" logic from your query, and re-implement the sorting directly in your application's code using C# sorting. Unfortunately, this means you'd also have to move your pagination into your application, since you'd need to obtain the complete result set before you can sort & paginate it. I'm just mentioning this because no-one else so far appears to have thought of it.
Frankly (like others have pointed out), the query you showed at the top should not need full-text indexing. A single suffix wildcard (e.g., LIKE 'ABC%') should be very effective as long as a BTREE (and not a HASH) index is available on the column in question.
And, personally, I have no aversion to even double-wildcard (e.g., LIKE '%ABC%"), which of course can never make use of indexes, as long as a full table scan is cheap. Probably 250,000 rows is the point where I'll start to seriously consider full-text indexing. 100,000 is definitely no problem.
I always make sure my SELECT's are dirty-reads, though (no transactionality applied to the select).
It's dirty once it gets to the user's eyeballs in any case!
Most of the search engine oriended sites are use FULL-TEXT-SEARCH.
It will be very faster compare to select and LIKE...
I have added one example and some links ...
I think it will be useful for you...
In this full text search have some conditions also...
STEP:1
CREATE TABLE articles (
id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
title VARCHAR(200),
body TEXT,
FULLTEXT (title,body)
);
STEP:2
INSERT INTO articles (title,body) VALUES
('MySQL Tutorial','DBMS stands for DataBase ...'),
('How To Use MySQL Well','After you went through a ...'),
('Optimizing MySQL','In this tutorial we will show ...'),
('1001 MySQL Tricks','1. Never run mysqld as root. 2. ...'),
('MySQL vs. YourSQL','In the following database comparison ...'),
('MySQL Security','When configured properly, MySQL ...');
STEP:3
Natural Language Full-Text Searches:
SELECT * FROM articles
WHERE MATCH (title,body) AGAINST ('database');
Boolean Full-Text Searches
SELECT * FROM articles WHERE MATCH (title,body)
AGAINST ('+MySQL -YourSQL' IN BOOLEAN MODE);
Go through this links
viralpatel.net,devzone.zend.com,sqlmag.com,colorado.edu,en.wikipedia.org
It's so strange query :)
Let's try to understand what it does.
The results are less than 30 rows from the table "businessunit" with some conditions.
The first condition is a foreign key of the "businessunit-postaddress" table.
Please check if you have an index on the column businessunit-postaddress.BusinessUnit.
The second one is a filter for returning rows only with businessunit.Name begining with 'tanto'.
If I didn't make a mistake you have a very complex index 'Business' consists of 11 fields!
And field 'Name' is not the first field in this index.
So this index is useless when you run "like tanto%"'s query.
I have strong doubt about necessity of this index at all.
By the way it demands quite big resources for its maintaining and slow down edit operations with this table.
You have to make an index with the only field 'Name'.
After filtering the query is sorting results and do it in some strange way too.
At first it sorts by field businessunit.Premium - it's normal.
However next statements with CASE are useless too.
That's why.
The zero are assigned to Name = 'tanto' (exactly).
The next rows with the one are rows with space after 'tanto' - these will be after 'tanto' in any cases (except special symbols) cause space is lower than any letter.
The next rows with the two are rows with some letters after 'tanto' (include space!). These rows will be in this order too by definition.
And the three is "reserved" for "other" rows but you won't get "other" rows - remeber about [WHERE businessunit.Name LIKE 'tanto%'] condition.
So this part of ORDER BY is meaningless.
And at the end of ORDER BY there is businessunit.Name again...
My advice: you need rebuild the query from scratch keeping in mind what you want to get.
Anyway I guess you can use
SELECT SQL_CALC_FOUND_ROWS *
FROM `businessunit`
INNER JOIN `businessunit-postaddress` ON `businessunit`.`Id` = `businessunit-postaddress`.`BusinessUnit`
WHERE `businessunit`.`Name` LIKE 'tanto%'
ORDER BY `businessunit`.`Premium` DESC,
`businessunit`.`Name`
LIMIT 0 , 30
Don't forget about an index on field businessunit-postaddress.BusinessUnit!
And I have strong assumption about field Premium.
I guess it is designed for storing binary data (yes/no).
So an ordinary (BTREE) index doesn't match.
You have to use bitmap index.
P.S. I'm not sure that you really need to use SQL_CALC_FOUND_ROWS
MySQL: Pagination - SQL_CALC_FOUND_ROWS vs COUNT()-Query
Its either full-text(http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html) or the pattern matching (http://dev.mysql.com/doc/refman/5.0/en/pattern-matching.html) from php and mysql side.
From experience and theory:
Advantages of full-text -
1) Results are very relevant and de-limit characters like spacing in the search query does not hinder the search.
Disadvantages of full-text -
1) There are stopwords that are used as restrictions by webhosters to prevent excess load of data.(E.g. search results containing the word 'one' or 'moz' are not displayed. And this can be avoided if you're running your own server by keeping no stopwords.
2) If I type 'ree' it only displays words containing exactly 'ree' not 'three' or 'reed'.
Advantages of pattern matching -
1) It does not have any stopwords as in full-text and if you search for 'ree', it displays any word containing 'ree' like 'reed' or 'three' unlike fulltext where only the exact word is retreived.
Disadvantages of pattern matching-
1) If delimiters like spaces are used in your search words and if these spaces are not there in the results, because each word is separate from any delimiters, then it returns no result.
If the argument of LIKE doesn't begin with a wildchard character, like in your example, LIKE operator should be able to take advantage of indexes.
In this case, LIKE operator should perform better than LOCATE or LEFT, so I suspect that changing the condition like this could make things worse, but I still think it's worth trying (who knows?):
WHERE LOCATE('tanto', `businessunit`.`Name`)=1
or:
WHERE LEFT(`businessunit`.`Name`,5)='tanto'
I would also change your order by clause:
ORDER BY
`businessunit`.`Premium` DESC ,
CASE WHEN `businessunit`.`Name` LIKE 'tanto %' THEN 1
WHEN `businessunit`.`Name` = 'tanto' THEN 0
ELSE 2 END,
`businessunit`.`Name`
Name has to be LIKE 'tanto%' already, so you can skip a condition (CASE will never return value 3). Of course, make sure that Premium field is indexed.
Hope this helps.
I think you need to collect the keys only, sort them, then join last
SELECT A.*,B.* FROM
(
SELECT * FROM (
SELECT id BusinessUnit,Premium
CASE
WHEN Name = 'tanto' THEN 0
WHEN Name LIKE 'tanto %' THEN 1
WHEN Name LIKE 'tanto%' THEN 2
ELSE 3
END SortOrder
FROM businessunit Name LIKE 'tanto%'
) AA ORDER BY Premium,SortOrder LIMIT 0,30
) A LEFT JOIN `businessunit-postaddress` B USING (BusinessUnit);
This will still generate a filesort.
You may want to consider preloading the needed keys in a separate table you can index.
CREATE TABLE BusinessKeys
(
id int not null auto_increment,
BusinessUnit int not null,
Premium int not null,
SortOrder int not null,
PRIMARY KEY (id),
KEY OrderIndex (Premuim,SortOrder,BusinessUnit)
);
Populate all keys that match
INSERT INTO BusinessKeys (BusinessUnit,Premuim,SortOrder)
SELECT id,Premium
CASE
WHEN Name = 'tanto' THEN 0
WHEN Name LIKE 'tanto %' THEN 1
WHEN Name LIKE 'tanto%' THEN 2
ELSE 3
END
FROM businessunit Name LIKE 'tanto%';
Then, to paginate, run LIMIT on the BusinessKeys only
SELECT A.*,B.*
FROM
(
SELECT FROM BusinessKeys
ORDER BY Premium,SortOrder
LIMIT 0,30
) BK
LEFT JOIN businessunit A ON BK.BusinessUnit = A.id
LEFT JOIN `businessunit-postaddress` B ON A.BusinessUnit = B.BusinessUnit
;
CAVEAT : I use LEFT JOIN instead of INNER JOIN because LEFT JOIN preserves the order of the keys from the left side of the query.
I've read the answer to use Sphinx to optimize the search. But regarding my experience I would advise a different solution. We used Sphinx for some years and had a few nasty problems with segmentation faults and corrupted indice. Perhaps Sphinx isn't as buggy as a few years before, but for a year now we are very happy with a different solution:
http://www.elasticsearch.org/
The great benefits:
Scalability - you can simply add another server with nearly zero configuration. If you know mysql replication, you'll love this feature
Speed - Even under heavy load you get good results in much less than a second
Easy to learn - Only by knowing HTTP and JSON you can use it. If you are a Web-Developer, you feel like home
Easy to install - it is useable without touching the configuration. You just need simple Java (no Tomcat or whatever) and a Firewall to block direct access from the public
Good Javascript integration - even a phpMyAdmin-like Tool is a simple HTML-Page using Javascript: https://github.com/mobz/elasticsearch-head
Good PHP Integration with https://github.com/ruflin/Elastica
Good community support
Good documentation (it is not eye friendly, but it covers nearly every function!)
If you need an additional storage solution, you can easily combine the search engine with http://couchdb.apache.org/

Searching a field for the whole word, and nothing but the word

I have a PHP interface with a keyword search, working off a DB(MySQL) which has a Keywords field.
The way in which the keywords field is set up is as follows, it is a varchar with all the words formatted as shown below...
the, there, theyre, their, thermal etc...
if i want to just return the exact word 'the' from the search how would this be achieved?
I have tried using 'the%' and '%the' in the PHP and it fails to work by not returning all of the rows where the keyword appears in.
is there a better (more accurate) way to go about this?
Thanks
If you want to select the rows that have exactly the keyword the:
SELECT * FROM table WHERE keyword='the'
If you want to select the rows that have the keyword the anywhere in them:
SELECT * FROM table WHERE keyword LIKE '%the%'
If you want to select the rows that start with the keyword the:
SELECT * FROM table WHERE keyword LIKE 'the%'
If you want to select the rows that end with the keyword the:
SELECT * FROM table WHERE keyword LIKE '%the'
Try this
SELECT * FROM tablename
WHERE fieldname REGEXP '[[:<:]]test[[:>:]]'
[[:<:]] and [[:>:]] are markers for word boundaries.
MySQL Regular Expressions
if you also search for the commas, you can be sure you are getting the whole word.
where keywordField like '%, the, %'
or keywordField like '%, the'
or keywordField like 'the, %'
maybe I didn't understand the question properly... but If you want all the words where 'the' appears, a LIKE '%word%' should work.
If the DB of words is HUGE MySQL may fail to retrieve some of the words, that can be solved in 2 ways...
1- get a DB that support bigger sizes (not many ppl would chose this one tho). For example SQL Server has a 'CONTAINS' function that works better than LIKE '%word%'.
2- use a external search tool that uses inverted index search. I used Sphinx for a project and it works quite good. This is better if you rarely UPDATE the rows of the data you want to search from, which should be the case.
Sphinx for example would generate a file from your MySQL table and use this file to solve the search (it's very fast), this file should be re-indexed everytime you do a insert or update on the table, making it a much better solution if you rarely update or insert new rows.
It looks like you have a one to many relationship going on within a column. It might be better to create a separate table for keywords with a row for each keyword and a foreign key to whatever it is you're searching on.
Doing like '%???%' is generally a bad idea because the DB can't make use of an index so it will scan the whole table. Whether this matters will depend on the size of data you're working with but its worth considering up front. The single best way to help DB performance is in the initial table design. This can be tricky to change later.

Tag based searching with MySQL

I want to write a tag based search engine in MySQL, but I don't really know how to get to a pleasant result.
I used LIKE, but as I stored over 18k keywords in the database, it's pretty slow.
What I got is a table like this:
id(int, primary key) article_cloud(text) keyword(varchar(40), FULLTEXT INDEX)
So I store one keyword per row and save all the refering article numbers in article_cloud.
I tried the MATCH() AGAINST() stuff, which works fine as long as the user types in the whole keyword. But I also want a suggest search, so that there are relevant articles popping up, while the user is typing. So I still need a similar statement to LIKE, but faster. And I have no idea what I could do.
Maybe this is the wrong concept of tag based searching. If you know a better one, please let me know. I'm fighting with this for days and can't figure out a satisfying solution. Thanks for reading :)
MATCH() AGAINST() / FULLTEXT searching is a quick fix to a problem - but your schema makes no sense at all - surely there are multiple keywords in each article? And using a fulltext index on a column which only contains a single word is rather dumb.
and save all the refering article numbers in article_cloud
No! storing multiple values in a single column is VERY bad practice. When those values are keys to another table, it's a mortal sin!
It looks like you've got a long journey ahead of you to create something which will work efficiently; the quickest route to the goal is probably to use Google or Yahoo's indexing services on your own data. But if you want to fix it yourself....
See this answer on creating a search engine - the keywords should be in a separate table with a N:1 relationship to your articles, primary key on keyword and article id, e.g.
CREATE TABLE article (
id INTEGER NOT NULL autoincrement,
modified TIMESTAMP,
content TEXT
...
PRIMARY KEY (id)
);
CREATE TABLE keyword (
word VARCHAR(20),
article_id INTEGER, /* references article.id
relevance FLOAT DEFAULT 0.5, /* allow users to record relevance of keyword to article*/
PRIMARY KEY (word, article_id)
);
CREATE TEMPORARY TABLE search (
word VARCHAR(20),
PRIMARY KEY (word)
);
Then split the words entered by the user, convert them to a consistent case (same as used for populating the keyword table) and populate the search table, then find matches using....
SELECT article.id, SUM(keyword.relevance)
FROM article, keyword, search
WHERE article.id=keyword.article_id
AND keyword.word=search.word
GROUP BY article_id
ORDER BY SUM(keyword.relevance) DESC
LIMIT 0,3
It'll be a lot more efficient if you can maintain a list of words or rules about words NOT to use as keywords (e.g. ignore any words of 3 chars or less in mixed or lower case will omit stuff like 'a', 'to', 'was', 'and', 'He'...).
Have a look at Sphinx and Lucene
I tried the MATCH() AGAINST() stuff, which works fine as long as the user types in the whole keyword.
what do you think that FULLTEXT means?
I had 40 000 entries in my table, using no indexes (local use) and it searched for maximally 0.1 sec with LIKE '%SOMETHING%'
You may LIMIT your queries output

Keyword search using PHP MySql?

I have title (varchar), description (text), keywords (varchar) fields in my mysql table.
I kept keywords field as I thought I would be searching in this field only. But I now require to search among all three fields. so for keywords "word1 word2 word3", my query becomes
SELECT * FROM myTable
WHERE (
name LIKE '%word1%' OR description LIKE '%word1%' OR keywords LIKE '%word1%'
OR name LIKE '%word2%' OR description LIKE '%word2%' OR keywords LIKE '%word2%'
OR name LIKE '%word3%' OR description LIKE '%word3%' OR keywords LIKE '%word3%')
AND status = 'live'
Looks a bit messy but this works. But now I need to implement synonym search. so for a given word assuming there are a few synonyms available this query becomes more messy as I loop through all of the words. As the requirements are getting clearer, I will need to join this myTable to some other tables as well.
So
Do you think the above way is messy and will cause problems as the data grow?
How can I avoid above mess? Is there any cleaner solution I can go by? Any example will help me.
Is there any other method/technique you can recommend to me?
With thanks
EDIT
#Peter Stuifzand suggested me that I could create one search_index table and store all 3 fields (title,keyword,desc) info on that and do full text search. I understand that additionally this table will include reference to myTable primary key as well.
But my advanced search may include joining mytable with Category table, geographic_location table (for searching within 10, 20 miles etc), filtering by someother criteria and of course, sorting of search results. Do you think using mysql fulltext will not slow it down?
When your queries are getting out of hand, it's sometimes better to write parts of it in SQL and other parts in your programming language of choice.
And you could also use fulltext search for searching. You can create separate table with all fields that you want to search and add the FULLTEXT modifier.
CREATE TABLE `search_index` (
`id` INT NOT NULL,
`data` TEXT FULLTEXT,
);
SELECT `id` FROM `search_index` WHERE MATCH(`data`) AGAINST('word1 word2 word3');
One more way (sometimes it's better but it depends...)
SELECT
id, name, description, keywords
FROM
myTable
WHERE
name REGEXP '.*(word1|word2|word3).*' OR
description REGEXP '.*(word1|word2|word3).*' OR
keywords REGEXP '.*(word1|word2|word3).*'
;
PS: But MATCH(cols) AGAINST('expr') possibly is better for your case.
If at all possible, you should look into fulltext search.
Given the expanded requirements, you might want consider using apache solr (see http://lucene.apache.org/solr/) it is a faceted search engine, designed for full text searching. It has a RESTful interface that can return XML or JSON. I am using it with a few projects - works well.
The only area I see you hitting some problems is potentially with the proximity search, but with some additional logic for building the query it should work.

Categories