Accent insensitive search in InnoDB MySQL table!

Accent insensitive search in InnoDB MySQL table! - php

I am working on a simple search script that looks through two columns of a specific table. Essentially I'm looking for a match between either a company's number or their name. I'm using the LIKE statement in SQL because I am using InnoDB tables (which means no fulltext searches).
The problem is that I am working in a bilingual environment (french and english) and some of the characters in french have accents. I would like accented characters to be considered the same as their non-accented counterpart, in other words é = e, e = é, à = a, etc. SO has a lot of questions pertaining to the issue but none seem to be working for me.
Here is my SQL statement:
SELECT id, name FROM clients WHERE id LIKE '%éc%' OR name LIKE '%éc%';
I would like that to find "école" and "ecole" but it only finds "école".
I would also like to note that my tables are all utf8_general_ci.
Help me StackOverflow, you're my only hope! :)

I am going to offer up another answer for you.
I just read that utf8_general_ci is accent-insensitive so you should be OK.
One solution is to use
mysql_query("SET NAMES 'utf8'");
This tells the client what char set to send SQL statements in.
Another solution seems to be to use MySQL's HEX() function to convert the accented chars into their Hex value. But I could not find any good examples of this working and after reading the MySQL docs for HEX() it looks like it probably will not work.

You maybe should consider converting the problem characters to their English counterparts, then storing them in a different column, perhaps called searchable or similar. You would of cause need to update this whenever your main column was updated.
You would then have two columns, one containing the accented characters and one containing the plain English searchable content.

Related

PHP / SQL: searching against html entities stored in database

I work a lot with languages that use accented characters, e.g. é. I store content in "utf_8_bin" encoded tables AND I convert accented characters to HTML entities too.
So, for example, "Términator" would be stored as "T & eacute ; rminator" (had to spaces in that to stop it rendering online) in the database.
When a user searches for "términator" a match is found because the query is also converted to HTML entities and the SQL query "lowercases" both sides of the argument with "lcase".
The problem I am having now, is that the client wants to be able to search for "Terminator" (no accent on the "e") to get results matching "Términator".
I would prefer not to change the way I store my data, particularly because storing HTML entities solves a number of other problems. So I'm asking in case there's a simpler solution. Thanks!

You should use the correct collation in your query in your case utf8_unicode_ci (This is without the html entities)
https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-sets.html
This collation you use determines which results you get back from you database. And how that characters are compared.
SELECT * FROM some_table WHERE title LIKE "Terminator" COLLATE utf8_unicode_ci
This query will return records with the title términator, Terminator etc, note that it does a case insensitive comparison (the _ci part in the collation).
The utf8_unicode_ci is a bit slower but that's really minimal and you probably wouldn't even notice the difference.
There are more collations that can fit your needs, not sure if there is one which can be used for html entities. You could add your own collation to mysql database to create the htmlentities support yourself something like utf8_htmlentities_ci. https://dev.mysql.com/doc/refman/5.7/en/adding-collation.html
Here a nice example with phone numbers https://dev.mysql.com/doc/refman/5.7/en/ldml-collation-example.html

php search with latin basic, but return results with diactrics

I'm running into a complicated situation here, and I'm hoping for a push in the right direction.
I need to allow Basic Latin searches to bring back results with diacritics. This is further complicated by the fact that the data is stored with HTML instead of pure ASCII. I have been making some progress, but have come across two problems.
First: I'm able to do a partial conversion of the data into something marginally useful, using something like this:
$string = 'Véra';
$converted = html_entity_decode($string, ENT_COMPAT, 'UTF-8');
setlocale(LC_ALL, 'en_US.UTF8');
$translit = iconv('UTF-8', 'ASCII//TRANSLIT', $converted);
echo $translit;
This brings back this result: V'era This is a start but what I really need is Vera. I can do a preg_replace on resulting string, but is there a way of just bringing it back without the hyphen? This is only one example; there are a lot more diacritics in the database (e.g. ñ and more). I feel like this has been addressed before (e.g. iconv returns strange results), but there don't appear to be any solutions listed.
Bigger Problem: I need to convert a string such as Vera and be able to bring back results with Véra. as well as results of Vera. However I believe I need to get problem 1 solved first before I can get to this point.
I'm thinking something like if ($translit) { return $string} but I'm a bit unsure of how to handle this.
All help appreciated.
Edit: I'm thinking this might be done easier directly in the database, however I'm running into issues with DQL. I know that there are ways with doing it in SQL with a stored procedure, but with limited access to the database, I'm open any suggestions for dealing with this in Doctrine
Okay, so maybe I'm making this too difficult
All I need is a way of finding entries that have been HTML encoded in the database without having to search with either the specific encoding but also without the diacritic itself. If I search for Jose, it should bring up anything in the database labeled as José

Preface: It's not quite clear whether the data to search is already in the database or whether you're just taking advantage of the fact that the database has logic for character comparisons. I'm going to assume that the data source is the DB.
The fact that you're trying to search html raises the question of whether you really want to search HTML or in fact want to search the user-visible text in HTML and strip html tags (What if there is a diacritic in a tag attribute? What if a word is broken with an empty <span>? Should it match? What if it was broken with a <br>?)
MySQL has the notion of both character sets (how characters are encoded) and collations (how characters are compared)
Relevant Documentation:
https://dev.mysql.com/doc/refman/5.7/en/charset-mysql.html
https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-sets.html
Assuming your mysql client/terminal is correctly set for UTF8 encoding, then the following demonstrates the effect of overriding the collation (using ß as particularly interesting example)
> SET NAMES 'utf8';
> SELECT
'ß',
'ss',
'ß' = 'ss' COLLATE utf8_unicode_ci AS ss_unicode,
'ß' = 'ss' COLLATE utf8_general_ci AS ss_general,
'ß' = 's' COLLATE utf8_general_ci AS s_general;
+----+----+------------+------------+-----------+
| ß | ss | ss_unicode | ss_general | s_general |
+----+----+------------+------------+-----------+
| ß | ss | 1 | 0 | 1 |
+----+----+------------+------------+-----------+
1 row in set (0.00 sec)
Note: general is the faster but not-strictly-correct version of the unicode collation -- but even that is wrong if you speak turkish (see: dotted uppercase i)
I would save decoded html in the database and search on this making sure that the collation is set correctly.
Confirm that the table/column collation is correct using SHOW CREATE TABLE xxx. Change it manually (ALTER TABLE ...), or use doctrine annotations as per this answer & use doctrine migrations to update (and confirm afterwards with SHOW CREATE TABLE that your version of doctrine respects collation)
Confirm that doctrine is configured to use utf8 encoding.
If you just need to override the collation for one particular query (eg you don't have permission to change the DB structure or it will break other code):
If you need to map to a doctrine ORM object, use NativeQuery and add COLLATE overrides as per the example above.
If you just want the record ID & field then you can use a direct query bypassing the ORM with a COLLATE override

You can use REGEX_REPLACE function to strip diactrics in Database, while requesting. Mysql database has no built-in regex_replace function, but you can use User Defined Library, or change library to MariaDB. MariaDB based on Mysql (Migrating data to MariaDB will be easy).
Then in MariaDB you can use queries like:
SELECT * FROM `test` WHERE 'jose' = REGEXP_REPLACE(name, '(&[A-Za-z]*;)', '')
// another variant with PHP variable
SELECT `table`.name FROM `table` WHERE $search = REGEXP_REPLACE(name, '(&[A-Za-z]*;)', '')
Even phpMyAdmin supports MariaDB. I tested my query on Demo page. It worked pretty well:
Or if you want to stay on MySql, add this UDFs:
https://github.com/mysqludf/lib_mysqludf_preg

How to look for words before comma (,) in a string

I 'm currently working on a language project with someone else, and I 'm using a database for words in a language, but we also have translations and the best way to do it is by including the words in one column. So right now we have (language) (English) (German) (Dutch). The problem is that some words can be translated by multiple words, so for English, you get a translation like:
good, healthy
My question is, I want to avoid having to make a new (English2) column for more different translations of one word in one language, and instead of that putting them all at one page, but how can I make sure that if people look for something, I can also let the code distinguish between the words before and after the comma? So that if you look for 'healthy', you can find the main word, and not only if you type 'good, healthy', what no one will do. I have some knowledge of PHP, but working with strings is quite difficult for me and I still don't get how to do this.

I don't know where you are storing this data, but I will assume a MySQL database or something you have similar control over.
You really should simply use two tables, one to store the words and an ID associated with them and another table to store the translations for those words.
words
ID INT
Word VARCHAR(32)
translations
ID INT
Lang ENUM('ENGLISH', 'GERMAN', 'FRENCH')
Translation VARCHAR(32)
In PHP you would make a query like this:
SELECT `Word`, `Translation`
FROM `words`
LEFT JOIN `translations` ON (`translations`.`ID` = `words`.`ID` AND `Lang`='FRENCH')
WHERE `Word` = 'Funky'
This query would return the word and a translation if available, or NULL if no translation was available.

Wouldn't it be nice to have a DB structure like this:
WORD ENGLISH GERMAN DUTCH
-----------------------------------------------------------------------------
well good <any Dutch/German translation or can be null>
well healthy <any Dutch/German translation or can be null>
Then when you want to query the translations for well then you could just
SELECT word, GROUP_CONCAT(english) as "English Translations" FROM myTable WHERE word='well'
Then it will result to :
word English Translations
----------------------------------------------
well good,healthy

I have found the solution! Thanks to Aziz I could find the function FIND_IN_SET in his referred question, in that way I solved the problem by letting MySQL look for values seperated by commas.

How to search for special characters, unicode characters and Emoji's in a mysql Database

I am saving Emoji's with charset utfmb4_general_ci, Storing and retrieving are working fine but when i try to search for the information with a string containing Emoji's or special characters i am not getting the result. It always returning empty.
Can somebody help to solve this?
CODE
select * from table where Title LIKE '%Kanye West - \"Bound 2\" PARODY%'
UPDATE:
The search string are like
Kanye%20West%20-%20"Bound%202"%20PARODY
Stored in database like Kanye West - \"Bound 2\" PARODY
Family%20guy%20😎😔😁
Stored in database like Family guy \ud83d\ude0e\ud83d\ude14\ud83d\ude01
Please accept my Apologies for not making it clear
The first string is What we sent from the url via HTTP POST
and the second is how the data is stored in my table.
The charset of the database table is
utf8mb4_general_ci

You need to change your character set to utf8mb4_unicode_520_ci to make emojis searchable. Otherwise they are all treated as the same character.

how can i match two strings even if they are 1 character different?

I have a large database of sentences, and a problem where sentences like "i'm good" do not match to "im good" and vise versa or "is that mine?" not matching with "is that mine" and vise versa when i would want them to be detected as a match.
I had made complicated and messy functions trying to do this with wildcards and researching but its just a big mess. and im sure there must be a way to search with this 1 character lee way. If i can i would like to control which characters get this lee way, like in my examples the main problem causers are the question mark and the half quote. (? ').
im currently using a plane select query with php and mysql to do the matching queries.
i would love some help to figure this out so i can clean up the big mess of code that is currently doing the job inconsistently.
in case anyone wants to see the code query checking for matches is like this:
$checkqwry = "select * from `eng-jap` where (eng = '$eng' or english = '$oldeng' or english = '$oldeng2') and (jap = '$jap' or japanese = '$oldjap' or japanese = '$oldjap2');";
the purpose of the query is to just check if there is already a translation with the $eng and $jap already in the DB. the reason you see $oldeng $oldeng2 and $oldeng3 and so on is like i said, my messy foolish attempts to match even if there is or is not a question mark and so on. where some of the $oldeng variables have questions marks or halfquotes and so on and the others dont. there is more code above appending and remove question marks and stuff. yes its a big mess.

You want to use a String Metric algorithm as mentioned above, PHP has this function built in http://php.net/manual/en/function.levenshtein.php as well as http://www.php.net/manual/en/function.similar-text.php.
MySQL doesn't implement this (specific algorithm) natively but some people have went ahead and wrote stored procedures to accomplish the same: http://www.artfulsoftware.com/infotree/queries.php#552
In my opinion using a String Metric that can handle arbitrary changes is better then stripping out punctuation, and can also catch omissions, transpositions, etc...

Probably better to simply strip non-alphanumeric characters out before comparing the strings.

You can use the replace function in sql to replace "'" with "" and "?" with "".

You might want to look at natural language full text searches in MySQL. Add a FULLTEXT index to the eng column.
ALTER TABLE `eng-jap` ADD FULLTEXT INDEX `full` (`eng`) ;
Then, use match function:
select * from `eng-jap` where match(eng) against ('Im happy');
This will return both I'm happy and Im happy
If you select the relevance score like:
select id, match(eng) against ('Im happy') from `eng-jap` where match(eng) against ('Im happy');
you can use it to further process the matches in PHP and filter.
[EDIT]: Just verified that the relevance score for yesterday and yesterday? are the same too:
select *, match(eng) against ('yesterday') as mc from `eng-jap`
Result is:
6, yesterday?, 0.9058732390403748
7, yesterday, 0.9058732390403748
Note: For Fulltext index to be applied, your mysql engine has to be MyISAM. Also, the sentence has to contain more than 3 characters. The index doesn't seem to match a word like 'yes'.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.