MySQL Combining FULLTEXT with a LIKE Fallback - php

I'm building my app to use a single search table for searching all different object types ie: posts, pages, products etc., based on this article.
My table layout looks like so:
CREATE TABLE IF NOT EXISTS myapp_search_index (
id int(11) unsigned NOT NULL,
language_id int(11) unsigned NOT NULL,
`type` varchar(24) COLLATE utf8_unicode_ci NOT NULL,
object_id int(11) unsigned NOT NULL,
`text` text COLLATE utf8_unicode_ci NOT NULL
PRIMARY KEY (id,language_id),
FULLTEXT KEY `text.fdx` (`text`),
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=1;
My search query looks like so:
$items = $db->escape($query);
$query = $db->query("
SELECT *,
SUM(MATCH(text) AGAINST('+{$items}' IN BOOLEAN MODE)) as score
FROM {$db->prefix}search_index
WHERE MATCH(text) AGAINST('+{$items}' IN BOOLEAN MODE)
GROUP BY language_id, type, object_id
ORDER BY score DESC
LIMIT " . (int)$start . ", " . (int)$limit . "
");
This works great except where we run into fulltext limitations like stop words and min word length.
For instance I have 2 entries in the table for my About Us page, one holds the page title, and one holds the content of the page.
Running the query about us returns no results as about is a stop word, and us is less than the minimum 4 letters.
So, my thought was to create a conditional fallback query using a traditional LIKE parameter as such:
if (!$query->num_rows):
$query = $db->query("
SELECT *
FROM {$db->prefix}search_index
WHERE text LIKE '%{$items}%'
GROUP BY language_id, type, object_id
ORDER BY id DESC
LIMIT " . (int)$start . ", " . (int)$limit . "
");
endif;
And once again this works fine. My About Us page now comes up just fine in the results.
But what I'd like is to run this all in one query and maintain the score somehow.
Is this possible?
EDIT:
Ok so in response to Michael's answer and comments I've changed my query to this:
SELECT *,
SUM(MATCH(text) AGAINST('{$search}' IN BOOLEAN MODE)) as score
FROM {$db->prefix}test_index
WHERE (
MATCH(text) AGAINST('{$search}' IN BOOLEAN MODE)
AND text LIKE '%{$search}%')
OR text LIKE '%{$search}%'
GROUP BY language_id, type, object_id
ORDER BY score DESC
I set up a test table with 100K rows, 50K of which do contain my lorem ipsum search term.
This queries the entire table and returns results in 0.6379 microseconds without any query caching as of yet.
Can anyone tell me if this seems like a fair compromise?

Play around with natural language mode too with multi-word:
SELECT id,prod_name, match( prod_name )
AGAINST ( '+harpoon +article' IN NATURAL LANGUAGE MODE) AS relevance
FROM testproduct
ORDER BY relevance DESC
We often just go with solr integration, throwing json csv and text files at it.

There is not a way to elegantly combine fulltext search and LIKE together to get more results.
This is because the two predicates would have to be combined with an OR, which would in turn mean a full table scan (or full index scan if a suitable BTREE exists) is required to test the LIKE expression. All rows would have to be evaluated, which would remove any optimization you're getting from the fulltext search.
In the opposite situation, combining MATCH and LIKE using AND instead of OR -- in cases where the fulltext match returns insufficiently precise matches -- works great because the optimizer uses the fulltext index to find all possible matching rows, then filters the identified rows against the LIKE expression. (Fulltext indexes are almost always preferred by the optimizer, when other possible query plans exist.) Unfortunately, that's the opposite of what you need.

Related

Caching big data, alternative query or other indexes?

I'm with a problem, I am working on highscores, and for those highscores you need to make a ranking based on skill experience and latest update time (to see who got the highest score first incase skill experience is the same).
The problem is that with the query I wrote, it takes 28 (skills) x 0,7 seconds to create a personal highscore page to see what their rank is on the list. Requesting this in the browser is just not doable, it takes way too long for the page to load and I need a solution for my issue.
MySQL version: 5.5.47
The query I wrote:
SELECT rank FROM
(
SELECT hs.playerID, (#rowID := #rowID + 1) AS rank
FROM
(
SELECT hs.playerID
FROM highscores AS hs
INNER JOIN overall AS o ON hs.playerID = o.playerID
WHERE hs.skillID = ?
AND o.game_mode = ?
ORDER BY hs.skillExperience DESC,
hs.updateTime ASC
) highscore,
(SELECT #rowID := 0) r
) data
WHERE data.playerID = ?
As you can see I first have to create a whole resultset that gives me a full ranking for that game mode and skill, and then I have to select the rank based on the playerID after that, the problem is that I cannot let the query run untill it finds the result, because mysql doesn't offer such function, if I'd specifiy where data.playerID = ? in the query above, it would give back 1 result, meaning the ranking will be 1 as well.
The highscores table has 550k rows
What I have tried was storing the resultset for each skillid/gamemode combination in a temp table json_encoded, tried storing on files, but it ended up being quite slow as well, because the files are really huge and it takes time to process.
Highscores table:
CREATE TABLE `highscores` (
`playerID` INT(11) NOT NULL,
`skillID` INT(10) NOT NULL,
`skillLevel` INT(10) NOT NULL,
`skillExperience` INT(10) NOT NULL,
`updateTime` BIGINT(20) NOT NULL,
PRIMARY KEY (`playerID`, `skillID`)
)
COLLATE='utf8_general_ci'
ENGINE=MyISAM;
Overall table has got 351k rows
Overall table:
CREATE TABLE `overall` (
`playerID` INT(11) NOT NULL,
`playerName` VARCHAR(50) NOT NULL,
`totalLevel` INT(10) NOT NULL,
`totalExperience` BIGINT(20) NOT NULL,
`updateTime` BIGINT(20) NOT NULL,
`game_mode` ENUM('REGULAR','IRON_MAN','IRON_MAN_HARDCORE') NOT NULL DEFAULT 'REGULAR',
PRIMARY KEY (`playerID`, `playerName`)
)
COLLATE='utf8_general_ci'
ENGINE=MyISAM;
Explain Select result from the query:
Does anybody have a solution for me?
No useful index for WHERE
The last 2 lines of the EXPLAIN (#3 DERIVED):
WHERE hs.skillID = ?
AND o.game_mode = ?
Since neither table has a suitable index to use for the WHERE clause, to optimizer decided to do a table scan of one of them (overall), then reach into the other (highscores). Having one of these indexes would help, at least some:
highscores: INDEX(skillID)
overall: INDEX(game_mode, ...) -- note that an index only on a low-cardinality ENUM is rarely useful.
(More in a minute.)
No useful index for ORDER BY
The optimizer sometimes decides to use an index for the ORDER BY instead of for the WHERE. But
ORDER BY hs.skillExperience DESC,
hs.updateTime ASC
cannot use an index, even though both are in the same table. This is because DESC and ASC are different. Changing ASC to DESC would have an impact on the resultset, but would allow
INDEX(skillExperience, updateTime)
to be used. Still, this may not be optimal. (More in a minute.)
Covering index
Another form of optimization is to build a "covering index". That is an index that has all the columns that the SELECT needs. Then the query can be performed entirely in the index, without reaching over to the data. The SELECT in question is the innermost:
( SELECT hs.playerID
FROM highscores AS hs
INNER JOIN overall AS o ON hs.playerID = o.playerID
WHERE hs.skillID = ?
AND o.game_mode = ?
ORDER BY hs.skillExperience DESC, hs.updateTime ASC
) highscore,
For hs: INDEX(skillID, skillExperience, updateTime, playerID) is "covering" and has the most important item (skillID, from the WHERE) first.
For o: INDEX(game_mode, playerID) is "covering". Again, game_mode must be first.
If you change the ORDER BY to be DESC and DESC, then add another index for hs: INDEX(skillExperience, updateTime, skillID, playerID). Now the first 2 columns must be in that order.
Conclusion
It is not obvious which of those indexes the optimizer would prefer. I suggest you add both and let it choose.
I believe that (1) the innermost query is consuming the bulk of time, and (2) there is nothing to optimize in the outer SELECTs. So, I leave that as my recommendation.
Much of this is covered in my Indexing Cookbook.
Important subanswer: How frequently change rank of all players? Hmm.. Need explain.. You want realtime statistics? No, you dont want realtime )) You must select time interval for update statistics, e.g. 10 minutes. For this case you can run cronjob for insert new rank statistics into separated table like this:
/* lock */
TRUNCATE TABLE rank_stat; /* maybe update as unused/old for history) instead truncate */
INSERT INTO rank_stat (a, b, c, d) <your query here>;
/* unlock */
and users (browsers) will select readonly statistics from this table (can be split to pages).
But if rank stat not frequently change, e.g. you can recalculate it for all wanted game events and/or acts/achievs of players.
This is recommedations only. Because you not explain full environment. But I think you can found right solution with this recommendations.
It doesn't look like you really need to rank everyone, you just want to find out how many people are ahead of the current player. You should be able to get a simple count of how many players have better scores & dates than the current player which represents the current player's ranking.
SELECT count(highscores.id) as rank FROM highscores
join highscores playerscore
on playerscore.skillID = highscores.skillID
and playerscore.gamemode = highscores.gamemode
where highscores.skillID = ?
AND highscores.gamemode = ?
and playerscore.playerID = ?
and (highscores.skillExperience > playerscore.skillExperience
or (highscores.skillExperience = playerscore.skillExperience
and highscores.updateTime > playerscore.updateTime));
(I joined the table to itself and aliased the second instance as playerscore so it was slightly less confusing)
You could probably even simplify it to one query by grouping and parsing the results within your language of choice.
SELECT
highscores.gamemode as gamemode,
highscores.skillID as skillID,
count(highscores.id) as rank
FROM highscores
join highscores playerscore
on playerscore.skillID = highscores.skillID
and playerscore.gamemode = highscores.gamemode
where playerscore.playerID = ?
and (highscores.skillExperience > playerscore.skillExperience
or (highscores.skillExperience = playerscore.skillExperience
and highscores.updateTime > playerscore.updateTime));
group by highscores.gamemode, highscores.skillID;
Not quite sure about the grouping bit though.

What's is the best way to create indexes for the query below

I have this query:
select * from `metro_stations`
where `is_active` = 1
and (`title` like '%search%' or `title_en` like '%search%')
How to create effective indexes if is_active is TINYINT field and titles are VARCHAR(255) ?
And what about this query:
select * from `metro_stations`
where `is_active` = 1
and (`title` like '%search%' or
`title_en` like '%search%' or
`description` like '%search%')
if description field is text?
use full text index for each column.
If use in query "or" use seperately fts index, if use "and" mix fts index (in one index use several column)
Full Text Index
FULLTEXT(title, title_en)
WHERE is_active = 1
AND MATCH(title, title_en) AGAINST ("+search" IN BOOLEAN MODE)
(This should work for either InnoDB (5.6+) or MyISAM.)
Keep in mind the limitations of "word length" and "stop words" in FULLTEXT.

Select from 3 possible columns, order by occurances / relevance

I have a table that contains 3 text fields, and an ID one.
The table exists solely to get collection of ID's of posts based on relevance of a user search.
Problem is I lack the Einsteinian intellect necessary to warp the SQL continuum to get the desired results -
SELECT `id` FROM `wp_ss_images` WHERE `keywords` LIKE '%cute%' OR `title` LIKE '%cute%' OR `content` LIKE '%cute%'
Is this really enough to get a relevant-to-least-relevant list, or is there a better way?
Minding of course databases could be up to 20k rows, I want to keep it efficient.
Here is an update - I've gone the fulltext route -
EXAMPLE:
SELECT `id` FROM `wp_ss_images` WHERE MATCH (`keywords`,`title`,`content`) AGAINST ('+cute +dog' IN BOOLEAN MODE);
However it seems to be just grabbing all entries with any of the words. How can I refine this to show relevance by occurances?
To get a list of results based on the relevance of the number of occurrences of keywords in each field (meaning cute appears in all three fields first, then in 2 of the fields, etc.), you could do something like this:
SELECT id
FROM (
SELECT id,
(keywords LIKE '%cute%') + (title LIKE '%cute%') + (content LIKE '%cute%') total
FROM wp_ss_images
) t
WHERE total > 0
ORDER BY total DESC
SQL Fiddle Demo
You could concatenate the fields which will be better than searching them individually
SELECT `id` FROM `wp_ss_images` WHERE CONCAT(`keywords`,`title`,`content`) LIKE '%cute%'
This doesn't help with the 'greatest to least' part of your question though.

Search algorithms or tool for searching from database

I have this database table:
Column Type
source text
news_id int(12)
heading text
body text
source_url tinytext
time timestamp
news_pic char(100)
location char(128)
tags text
time_created timestamp
hits int(10)
Now I was searching for an algorithm or tool to perform a search for a keyword in this table which contains news data. Keyword should be searched in heading,body,tags and number of hits on the news to give best results.
MySQL already has the tool you need built-in: full-text search. I'm going to assume you know how to interact with MySQL using PHP. If not, look into that first. Anyway ...
1) Add full-text indexes to the fields you want to search:
alter table TABLE_NAME add fulltext(heading);
alter table TABLE_NAME add fulltext(body);
alter table TABLE_NAME add fulltext(tags);
2) Use a match ... against statement to perform a full-text search:
select * from TABLE_NAME where match(heading, body, tags, hits) against ('SEARCH_STRING');
Obviously, substitute your table's name for TABLE_NAME and your search string for SEARCH_STRING in these examples.
I don't see why you'd want to search the number of hits, as it's just an integer. You could sort by number of hits, however, by adding an order clause to your query:
select * from TABLE_NAME where match(heading, body, tags, hits) against ('SEARCH_STRING') order by hits desc;

Full text search - tag system problem

I store tags in 255 varchar area, like this type;
",keyword1,keyword2,keyword3,key word 324,",keyword1234,
(keyword must start and end comma (commakeyword123comma))
-
I can find a keyword3 like this sql query;
select * from table where keyword like = '%,keyword3,%'
CREATE TABLE IF NOT EXISTS `table1` (
`id` mediumint(8) unsigned NOT NULL AUTO_INCREMENT,
`tags` varchar(255) NOT NULL,
PRIMARY KEY (`id`),
FULLTEXT KEY `tags` (`tags`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=2242 ;
INSERT INTO `table1` (`id`, `tags`) VALUES
(2222, ',keyword,'),
(2223, ',word is not big,'),
(2224, ',keyword3,'),
(2225, ',my keys,'),
(2226, ',hello,keyword3,thanks,'),
(2227, ',hello,thanks,keyword3,'),
(2228, ',keyword3,hello,thanks,'),
(2239, ',keyword3 but dont find,'),
(2240, ',dont find keyword3,'),
(2241, ',dont keyword3 find,');
(returns 2224,2226,2227,2228)
-
I must change this like command for FULL TEXT SEARCH.
select * from table1 where match (tags) against (",keyword3," in boolean mode)
sql command find 2239,2240,2241 but i dont want to find %keyword3% or keyword3
http://prntscr.com/137u9
ideas to find only ,keyword3, ?
,keyword3,
thank you
You can't use full text search alone for this - it searches only for words. Here are a few different alternatives you could use:
You can use a full text search to quickly find candidate rows and then afterwords use a LIKE as you are already doing to filter out any false matches from the full text search.
You can use FIND_IN_SET.
You can normalize your database - store only one keyword per row.
INSERT INTO `table1` (`id`, `tag`) VALUES
(2222, 'keyword'),
(2223, 'word is not big'),
(2224, 'keyword3'),
(2225, 'my keys'),
(2226, 'hello'), -- // 2226 has three rows with one keyword in each.
(2226, 'keyword3'),
(2226, 'thanks'),
(2227, 'hello'),
-- etc...
Of those I'd recommend normalizing your database if it is at all possible.
First of all FULL TEXT is intended to be used for text searches. So there are limitations to what you can do with it. To do what you want you need to check the Boolean Mode specifications and see if the " operator can help you, but even with this your searches may not be 100% accurate. You would need to impose a word format for your keywords (preferably no word delimiters inside them like ).
Is there a reason for storing all the tags in one row?
I would store each "tag" in a row then do as andreas suggests and do something like this:
SELECT * FROM table1 WHERE tag IN('keyword0', 'keyword1', 'etc.')
If you need, for some reason, to return all the tags in one row, you could store them individually and GROUP_CONCAT them together.
http://dev.mysql.com/doc/refman/5.0/en/group-by-functions.html#function_group-concat

Categories