Searching a database with forgiveness.

Searching a database with forgiveness. - php

I have a database of 30k elements, each game names.
I'm currently using:
SELECT * from gamelist WHERE name LIKE '%%%$search%%%' OR aliases LIKE '%$search%' LIMIT 10"
for the searchbar.
However, it can get really picky about things like:
'Animal Crossing: Wild World'
See, many users won't know that ':' is in that phrase, so when they start searching:
Animal Crossing Wild World
It won't show up.
How can I have the sql be more forgiving?

Replace the non alphanumeric characters in the search parameter with wildcards so Animal Crossing Wild World becomes %Animal%Crossing%Wild%World% and filter on that.

I would suggest you make another table witch contains keyworks like
+---------+------------+
| game_id | keyword_id |
+---------+------------+
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 1 | 4 |
+---------+------------+
 
+------------+--------------+
| keyword_id | keyword_name |
+------------+--------------+
| 1 | animal |
| 2 | crossing |
| 3 | wild |
| 4 | world |
+------------+--------------+
After that you can easily explode the user given text into keywords and search for them in the database, witch will give you the id's of the possible games he/she was looking for.
Oh, and remove special symbols, like ":" or "-", so you don't need multiple keywords for the same phrase.

The following is from MySQL LIKE %string% not quite forgiving enough. Anything else I can use? by the user M_M:
If you're using MyISAM, you can use full text indexing. See this tutorial
If you're using a different storage engine, you could use a third party full text engine like sphinx, which can act as a storage engine for mysql or a separate server that can be queried.
With MySQL full text indexing a search on A J Kelly would match AJ Kelly (no to confuse matters but A, J and AJ would be ignored as they are too short by default and it would match on Kelly.) Generally Fulltext is much more forgiving (and usually faster than LIKE '%string%') because allows partial matches which can then be ranked on relevance.
You can also use SOUNDEX to make searches more forgiving by indexing the phonetic equivalents of words and search them by applying SOUNDEX on your search terms and then using those to search the index. With soundex mary, marie, and marry will all match, for example.

You can try Match () AGAINST () if your MySQL engine is MyISAM or InnoDB:
http://dev.mysql.com/doc/refman/5.6/en/fulltext-search.html
http://dev.mysql.com/doc/refman/5.6/en/fulltext-boolean.html
Your resulting SQL will be like this:
SELECT * from gamelist WHERE MATCH (name, aliases) AGAINST ('$search' IN BOOLEAN MODE) LIMIT 10
The behavior of the search is more like the boolean search used in search engines.

Related

How to store mostly-empty data?

I want to set up a system that allows for, say, 200 different translations per post. However most translations wouldn't exist, so there'd be a lot of empty datasets. How much of a performance and storage hit is it if I save every language (including empty ones) as a specific column? I.E.
English | Arabic | Mandarin | Russian | French | German
Potato | | | | Pomme de Terre |
Orange | | | | Orange |
Peach | | | | |
I wouldn't cycle through the whole list very often, I'd use a session variable or usersetting and then load directly from that column if it exists, with a fallback to a default language, and perhaps after that a full search.
if (exists(french))
{echo french}
else {if(exists(english))
{echo english}}
else {echo links to non-null language}
}
I'd assume that, if I tell the server which column to go to, the overhead in terms of processing would be negligible? I also assume that an empty cell would be negligible in terms of storage? However I don't know for sure, and it could potentially be a huge mistake.
The reason I'd want to work like this is so I could assign language codes, instead of every installed instance having a different order (e.g. english|french|german|mandarin versus english|mandarin|german|french).
To prevent XY-problems, here's a more global formulation:
I want to set up a system that allows for many languages, but I expect that in most cases only 1 or two are used. What would be an efficient way to store?

Keyword: Relational database.
You will want to use multiple tables.
Let's say that the default langauge is english, then your "words" table will implicitly contain the english words.
Words:
Id | Word
1 | Potato
2 | Orange
Languages:
Id | Name
1 | Norwegian
2 | Danish
Translations:
Word | Language | Translated
1 | 1 | Potet
2 | 1 | Oransje
1 | 2 | Kartoffel
2 | 2 | Appelsin
Then you can do (pseudo sql, you can look up the language and word ids first, or use a more advanced query):
SELECT Translated FROM Translations WHERE Word = (the word id) and Language = (the language id)
This comes with the benefit that it's very simple to list all the languages you support, all the Words you support, and also all translated words for a specific language (or, find all NON translated words for a language).
A specific query for translating "Potato" into "Danish" would look like:
SELECT Translated FROM Translations
JOIN Words ON Words.Id = Translations.Word
JOIN Languages ON Languages.Id = Translations.Language
WHERE
Languages.Name = "Danish" and Words.Word = "Potato"

Group coordinates by proximity to each other

I'm building a REST API so the answer can't include google maps or javascript stuff.
In our app, we have a table containing posts that looks like that :
ID | latitude | longitude | other_sutff
1 | 50.4371243 | 5.9681102 | ...
2 | 50.3305477 | 6.9420498 | ...
3 | -33.4510148 | 149.5519662 | ...
We have a view with a map that shows all the posts around the world.
Hopefully, we will have a lot of posts and it will be ridiculous to show thousands and thousands of markers in the map. So we want to group them by proximity so we can have something like 2-3 markers by continent.
To be clear, we need this :
Image from https://github.com/googlemaps/js-marker-clusterer
I've done some research and found that k-means seems to be part of the solution.
As I am really really bad at Math, I tried a couple of php libraries like this one : https://github.com/bdelespierre/php-kmeans that seems to do a decent job.
However, there is a drawback : I have to parse all the table each time the map is loaded. Performance-wise, it's awful.
So I would like to know if someone already got through this problematic or if there is a better solution.

I kept searching and I've found an alternative to KMeans : GEOHASH
Wikipedia will explain better than me what it is : Wiki geohash
But to summarize, The world map is divided in a grid of 32 cells and to each one is given an alpha-numeric character.
Each cell is also divided into 32 cells and so on for 12 levels.
So if I do a GROUP BY on the first letter of hash I will get my clusters for the lowest zoom level, if I want more precision, I just need to group by the first N letters of my hash.
So, what I've done is only added one field to my table and generate the hash corresponding to my coordinates:
ID | latitude | longitude | geohash | other_sutff
1 | 50.4371243 | 5.9681102 | csyqm73ymkh2 | ...
2 | 50.3305477 | 6.9420498 | p24k1mmh98eu | ...
3 | -33.4510148 | 149.5519662 | 8x2s9674nd57 | ...
Now, if I want to get my clusters, I just have to do a simple query :
SELECT count(*) as nb_markers FROM mtable GROUP BY SUBSTRING(geohash,1,2);
In the substring, 2 is level of precision and must be between 1 and 12
PS : Lib I used to generate my hash

Dynamic mysql search when Fulltext is not a viable solution

Many articles will point you to Fulltext indexing for a simple solution to mysql searches. This very well may be the case under the right circumstances, but I've yet to see a solution that comes close to Fulltext when Fulltext cannot be used (for instance, across tables). The solution I'm looking for would preferably be one that can match anything in the sentence.
So, searching James Woods or searching Woods James, might both return the same row where the text James Woods exists. Basic search methods would render "mix-matching" of search words useless.
The likely answers are replacing Fulltext with REGEXP or LIKE. Then replacing the 'whitespace' in the search term with | or % so James Woods might become James|Woods, so any combination of James and Woods will return results. Or become '%James%Woods%', which will be less productive, but still will return matches that aren't necessarily exact.
Example SQL
SELECT * FROM people
LEFT JOIN
(SELECT GROUP_CONCAT(other_data) AS people_data GROUP BY people_id)
AS t2 ON(t2.people_id = people.id)
WHERE CONCAT_WS(' ', people.firstname, people.lastname, people_data) LIKE {$query}
Is this really the best way? Are there any tricks to making this method (or another method) work more efficiently? I'm really looking for a mysql solution, so if your answer is to use another DB service, well, so be it and I'll accept that as an answer, but the real question is the best solution for mysql. Thanks.

In MySQL have you tried using
combination of MATCH() and AGAINST() functions
they will yield the result that you are looking I guess.
e.g.
for following set of data ::
mysql> select * from temp;
+----+---------------------------------------------+
| id | string |
+----+---------------------------------------------+
| 1 | James Wood is the matyr. |
| 2 | Wood James is the saviour. |
| 3 | James thames are rhyming words. |
| 4 | Wood is a natural product. |
| 5 | Don't you worry child - Swedish House Mafia |
+----+---------------------------------------------+
5 rows in set (0.00 sec)
this query would return following results, if you require either james or wood to be present
mysql> select string from temp where match (string) against ('james wood' in bo
olean mode);
+---------------------------------+
| string |
+---------------------------------+
| James Wood is the matyr. |
| Wood James is the saviour. |
| James thames are rhyming words. |
| Wood is a natural product. |
+---------------------------------+
4 rows in set (0.00 sec)
if you require that James and Wood both words should be present than this query would work. note the '+' sign before the words. check this Boolean mode
mysql> select string from temp where match (string) against ('+james +wood' in b
oolean mode);
+----------------------------+
| string |
+----------------------------+
| James Wood is the matyr. |
| Wood James is the saviour. |
+----------------------------+
2 rows in set (0.00 sec)
to find a word with any suffix it works in similar way
mysql> select string from temp where match (string) against ('Jame*' in boolean
mode);
+---------------------------------+
| string |
+---------------------------------+
| James Wood is the matyr. |
| Wood James is the saviour. |
| James thames are rhyming words. |
+---------------------------------+
3 rows in set (0.02 sec)
but note that prefix searches are not yet supported in fulltext searches by Mysql
mysql> select string from temp where match (string) against ('*ame*' in boolean
mode);
Empty set (0.00 sec)
I hope this helps.
On a kind note, this reply is very late but was interested enough for me to reply.
to learn more check this link http://dev.mysql.com/doc/refman/5.5/en//fulltext-search.html

I'm a bit late to the party - so apologies for that.
You mentioned that you cannot use fulltext functionality because you're using joins - well, although that is kind of the case, there is a popular way to get around this.
Consider the following for using fulltext search with joins:
SELECT
article.articleID,
article.title,
topic.title
FROM articles AS article
-----
INNER JOIN (
SELECT articleID
FROM articles
WHERE MATCH (title, keywords) AGAINST ("cat" IN BOOLEAN MODE)
ORDER BY postDate DESC
LIMIT 0, 30
) AS ftResults ON article.articleID = ftResults.articleID
-----
LEFT JOIN topics AS topic ON article.topicID = topic.topicID
GROUP BY article.id
ORDER BY article.postDate DESC
Notice how I managed to keep my topics join intact by running my fulltext search in another query and joining/matching the results by ID.
If you're not on shared hosting, also consider using Sphinx or Lucene Solr alongside MySQL for extra fast fulltext searches. I've used Sphinx and highly recommend it.

php & mysql: use a table for a filter list for another table

I have two mysql tables. One is a bad words list, the other is the table to compare against the bad words list. Essentially I want to filter out and return a list of rows with domains that do not have ANY occurrence of a word in the bad words table. A few sample tables:
bad words list
+----------+------------------+
| id | words |
+----------+------------------+
| 1 | porn |
| 2 | sex |
+----------+------------------+
table of domains to compare
+----------+------------------+
| id | domain |
+----------+------------------+
| 56 | google.com |
| 57 | sex.com |
+----------+------------------+
I want to return results such as
+----------+------------------+
| id | domain |
+----------+------------------+
| 56 | google.com |
+----------+------------------+
A thing to note is that these tables have nothing in common, so I'm not even sure this is the best method. I was using a comparison function in PHP but that seemed to be way too slow over hundreds of thousands of rows to search.

It is possible to get from mysql. like this:
SELECT
d.*
FROM
domains d
LEFT JOIN
words w ON(d.domain LIKE CONCAT('%',w.word,'%') )
GROUP BY
d.domain
HAVING
COUNT(w.id) < 1
but it is not optimal and will get slower and slower with more records in both tables.

Data like this typically needs to be pre-calculated at insertion time rather than at fetch time. You should add a column to Domains something like "bad_words boolean default null".
null would mean "don't know" which in some context could be interpretted as "unsafe to show".
false means "no bad words" and true means "contains bad words".
Everytime the list of bad words is updated all columns are reset to null and some background work will start to process them again. Probably in another language than sql.

Whats the easiest site search application to implement, that supports fuzzy searching?

I have a site that needs to search thru about 20-30k records, which are mostly movie and TV show names. The site runs php/mysql with memcache.
Im looking to replace the FULLTEXT with soundex() searching that I currently have, which works... kind of, but isn't very good in many situations.
Are there any decent search scripts out there that are simple to implement, and will provide a decent searching capability (of 3 columns in a table).

ewemli's answer is in the right direction but you should be combining FULLTEXT and soundex mapping, not replacing the fulltext, otherwise your LIKE queries are likely be very slow.
create table with_soundex (
id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
original TEXT,
soundex TEXT,
FULLTEXT (soundex)
);
insert into with_soundex (original, soundex) values
('add some test cases', CONCAT_WS(' ', soundex('add'), soundex('some'), soundex('test'), soundex('cases'))),
('this is some text', CONCAT_WS(' ', soundex('this'), soundex('is'), soundex('some'), soundex('text'))),
('one more test case', CONCAT_WS(' ', soundex('one'), soundex('more'), soundex('test'), soundex('case'))),
('just filling the index', CONCAT_WS(' ', soundex('just'), soundex('filling'), soundex('the'), soundex('index'))),
('need one more example', CONCAT_WS(' ', soundex('need'), soundex('one'), soundex('more'), soundex('example'))),
('seems to need more', CONCAT_WS(' ', soundex('seems'), soundex('to'), soundex('need'), soundex('more')))
('some helpful cases to consider', CONCAT_WS(' ', soundex('some'), soundex('helpful'), soundex('cases'), soundex('to'), soundex('consider')))
select * from with_soundex where match(soundex) against (soundex('test'));
+----+---------------------+---------------------+
| id | original | soundex |
+----+---------------------+---------------------+
| 1 | add some test cases | A300 S500 T230 C000 |
| 2 | this is some text | T200 I200 S500 T230 |
| 3 | one more test case | O500 M600 T230 C000 |
+----+---------------------+---------------------+
select * from with_soundex where match(soundex) against (CONCAT_WS(' ', soundex('test'), soundex('some')));
+----+--------------------------------+---------------------------+
| id | original | soundex |
+----+--------------------------------+---------------------------+
| 1 | add some test cases | A300 S500 T230 C000 |
| 2 | this is some text | T200 I200 S500 T230 |
| 3 | one more test case | O500 M600 T230 C000 |
| 7 | some helpful cases to consider | S500 H414 C000 T000 C5236 |
+----+--------------------------------+---------------------------+
That gives pretty good results (within the limits of the soundex algo) while taking maximum advantage of an index (any query LIKE '%foo' has to scan every row in the table).
Note the importance of running soundex on each word, not on the entire phrase. You could also run your own version of soundex on each word rather than having SQL do it but in that case make sure you do it both when storing and retrieving in case there are differences between the algorithms (for instance, MySQL's algo doesn't limit itself to the standard 4 chars)

If you are looking for a simple existing solution instead of creating your own solution check out
Sphider.eu
PHPDig

There is a function SOUNDEX in mysql. If you want to search for a movie title :
select * from movie where soundex(title) = soundex( 'the title' );
Of course it doesn't work to search in text, such as movie or plot summary.
Soundex is a relatively simple algo. You can also decide to handle all that at the applicative level, it may be easier:
when text is stored, tokenize it and apply soundex on all words
store the original text and the soundex version in two columns
when you search, compute the soundex at the app. level and then use a regular LIKE at the db level.

Soundex has limitations to deal with fuzzy search. A better function is edit distance, which can be integrated into MySQL using UDF. Check http://flamingo.ics.uci.edu/toolkit/ for a C++ implementation for MySQL on Linux.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Searching a database with forgiveness. - php

Replace the non alphanumeric characters in the search parameter with wildcards so Animal Crossing Wild World becomes %Animal%Crossing%Wild%World% and filter on that.

Related

How to store mostly-empty data?

Group coordinates by proximity to each other

Dynamic mysql search when Fulltext is not a viable solution

php & mysql: use a table for a filter list for another table

Whats the easiest site search application to implement, that supports fuzzy searching?

Categories

Resources