I work a lot with languages that use accented characters, e.g. é. I store content in "utf_8_bin" encoded tables AND I convert accented characters to HTML entities too.
So, for example, "Términator" would be stored as "T & eacute ; rminator" (had to spaces in that to stop it rendering online) in the database.
When a user searches for "términator" a match is found because the query is also converted to HTML entities and the SQL query "lowercases" both sides of the argument with "lcase".
The problem I am having now, is that the client wants to be able to search for "Terminator" (no accent on the "e") to get results matching "Términator".
I would prefer not to change the way I store my data, particularly because storing HTML entities solves a number of other problems. So I'm asking in case there's a simpler solution. Thanks!
You should use the correct collation in your query in your case utf8_unicode_ci (This is without the html entities)
https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-sets.html
This collation you use determines which results you get back from you database. And how that characters are compared.
SELECT * FROM some_table WHERE title LIKE "Terminator" COLLATE utf8_unicode_ci
This query will return records with the title términator, Terminator etc, note that it does a case insensitive comparison (the _ci part in the collation).
The utf8_unicode_ci is a bit slower but that's really minimal and you probably wouldn't even notice the difference.
There are more collations that can fit your needs, not sure if there is one which can be used for html entities. You could add your own collation to mysql database to create the htmlentities support yourself something like utf8_htmlentities_ci. https://dev.mysql.com/doc/refman/5.7/en/adding-collation.html
Here a nice example with phone numbers https://dev.mysql.com/doc/refman/5.7/en/ldml-collation-example.html
Related
I'm running into a complicated situation here, and I'm hoping for a push in the right direction.
I need to allow Basic Latin searches to bring back results with diacritics. This is further complicated by the fact that the data is stored with HTML instead of pure ASCII. I have been making some progress, but have come across two problems.
First: I'm able to do a partial conversion of the data into something marginally useful, using something like this:
$string = 'Véra';
$converted = html_entity_decode($string, ENT_COMPAT, 'UTF-8');
setlocale(LC_ALL, 'en_US.UTF8');
$translit = iconv('UTF-8', 'ASCII//TRANSLIT', $converted);
echo $translit;
This brings back this result: V'era This is a start but what I really need is Vera. I can do a preg_replace on resulting string, but is there a way of just bringing it back without the hyphen? This is only one example; there are a lot more diacritics in the database (e.g. ñ and more). I feel like this has been addressed before (e.g. iconv returns strange results), but there don't appear to be any solutions listed.
Bigger Problem: I need to convert a string such as Vera and be able to bring back results with Véra. as well as results of Vera. However I believe I need to get problem 1 solved first before I can get to this point.
I'm thinking something like if ($translit) { return $string} but I'm a bit unsure of how to handle this.
All help appreciated.
Edit: I'm thinking this might be done easier directly in the database, however I'm running into issues with DQL. I know that there are ways with doing it in SQL with a stored procedure, but with limited access to the database, I'm open any suggestions for dealing with this in Doctrine
Okay, so maybe I'm making this too difficult
All I need is a way of finding entries that have been HTML encoded in the database without having to search with either the specific encoding but also without the diacritic itself. If I search for Jose, it should bring up anything in the database labeled as José
Preface: It's not quite clear whether the data to search is already in the database or whether you're just taking advantage of the fact that the database has logic for character comparisons. I'm going to assume that the data source is the DB.
The fact that you're trying to search html raises the question of whether you really want to search HTML or in fact want to search the user-visible text in HTML and strip html tags (What if there is a diacritic in a tag attribute? What if a word is broken with an empty <span>? Should it match? What if it was broken with a <br>?)
MySQL has the notion of both character sets (how characters are encoded) and collations (how characters are compared)
Relevant Documentation:
https://dev.mysql.com/doc/refman/5.7/en/charset-mysql.html
https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-sets.html
Assuming your mysql client/terminal is correctly set for UTF8 encoding, then the following demonstrates the effect of overriding the collation (using ß as particularly interesting example)
> SET NAMES 'utf8';
> SELECT
'ß',
'ss',
'ß' = 'ss' COLLATE utf8_unicode_ci AS ss_unicode,
'ß' = 'ss' COLLATE utf8_general_ci AS ss_general,
'ß' = 's' COLLATE utf8_general_ci AS s_general;
+----+----+------------+------------+-----------+
| ß | ss | ss_unicode | ss_general | s_general |
+----+----+------------+------------+-----------+
| ß | ss | 1 | 0 | 1 |
+----+----+------------+------------+-----------+
1 row in set (0.00 sec)
Note: general is the faster but not-strictly-correct version of the unicode collation -- but even that is wrong if you speak turkish (see: dotted uppercase i)
I would save decoded html in the database and search on this making sure that the collation is set correctly.
Confirm that the table/column collation is correct using SHOW CREATE TABLE xxx. Change it manually (ALTER TABLE ...), or use doctrine annotations as per this answer & use doctrine migrations to update (and confirm afterwards with SHOW CREATE TABLE that your version of doctrine respects collation)
Confirm that doctrine is configured to use utf8 encoding.
If you just need to override the collation for one particular query (eg you don't have permission to change the DB structure or it will break other code):
If you need to map to a doctrine ORM object, use NativeQuery and add COLLATE overrides as per the example above.
If you just want the record ID & field then you can use a direct query bypassing the ORM with a COLLATE override
You can use REGEX_REPLACE function to strip diactrics in Database, while requesting. Mysql database has no built-in regex_replace function, but you can use User Defined Library, or change library to MariaDB. MariaDB based on Mysql (Migrating data to MariaDB will be easy).
Then in MariaDB you can use queries like:
SELECT * FROM `test` WHERE 'jose' = REGEXP_REPLACE(name, '(&[A-Za-z]*;)', '')
// another variant with PHP variable
SELECT `table`.name FROM `table` WHERE $search = REGEXP_REPLACE(name, '(&[A-Za-z]*;)', '')
Even phpMyAdmin supports MariaDB. I tested my query on Demo page. It worked pretty well:
Or if you want to stay on MySql, add this UDFs:
https://github.com/mysqludf/lib_mysqludf_preg
I need to search some rows from database by its title. Problem is with czech diacritics:
in phpMyAdmin in row title is shown this:
Černé kouřovody 2MM
in php, where I call searching query default string is written like this:
Černé kouřovody 2MM
And I don't know, how to find out rows with this (Černé kouřovody 2MM) title by this(ÄŒerné kouÅ™ovody 2MM) title
Here is my query on database:
SELECT * FROM categories WHERE LOWER(title) LIKE LOWER("Černé
kouřovody 2MM") COLLATE utf8_bin
Thank you very much for your help
There are multiple variants to solve this. Personally, I use base64 encoding. I encode it before insertion to db, and then I decode upon extraction from DB. You can avoid that by setting appropriate collation in your database, for example, czech collation.
If you use base64, you can use any character without worrying about collation, but it results in some data overhead. I'd use it.
Those 2 should be ok, cause I gotta run now :P
I am saving Emoji's with charset utfmb4_general_ci, Storing and retrieving are working fine but when i try to search for the information with a string containing Emoji's or special characters i am not getting the result. It always returning empty.
Can somebody help to solve this?
CODE
select * from table where Title LIKE '%Kanye West - \"Bound 2\" PARODY%'
UPDATE:
The search string are like
Kanye%20West%20-%20"Bound%202"%20PARODY
Stored in database like Kanye West - \"Bound 2\" PARODY
Family%20guy%20😎😔😁
Stored in database like Family guy \ud83d\ude0e\ud83d\ude14\ud83d\ude01
Please accept my Apologies for not making it clear
The first string is What we sent from the url via HTTP POST
and the second is how the data is stored in my table.
The charset of the database table is
utf8mb4_general_ci
You need to change your character set to utf8mb4_unicode_520_ci to make emojis searchable. Otherwise they are all treated as the same character.
I have made a dictionary with about 100k words of Punjabi Language in Unicode. There is a letter ਸ਼, whose code in unicode is ਸ਼ and there are many such letters like ਖ਼ ਜ਼ ਗ਼ ਫ਼. But in this language, the dot u see under the letters can also be typed separately, but there are combined letters in unicode. in the db, there are words in word table and the md5 of the word in word_hash. When i try to search the database with php with the statement SELECT * FROM db WHERE word_hash = md5('word');, it results in no records found with words with such letters with the dot. When i tried to search, i found that the md5 of the words in the db and the md5 generated by search syntax is different. Why is it so? I have entered all the words through a textbox and the md5 entered is with mysql syntax.
For ex : the code for the word ਸ਼ਰਬਤ is 45f756f02a28b5ec48ddf369db6ad7e6 echoed by mysql query and in the db is d6da1a44526c5ab1259dcc05404b1e8c
Two alternates for ਸ਼ are ਸ਼ and ਸ਼
What you have here are the different Unicode normalization forms. There are combined characters, where a base character is combined with a diacritic or other character to form an alternate version, but sometimes this alternative version may also exist as a standalone character. E.g.:
ਸ਼ GURMUKHI LETTER SHA (U+0A36)
ਸ GURMUKHI LETTER SA (U+0A38)
਼ GURMUKHI SIGN NUKTA (U+0A3C)
ਸ + ਼ (U+0A38 + U+0A3C) equivalent to ਸ਼ U+0A36
(I'm not actually sure if the GURMUKHI SIGN NUKTA is the correct combining dot here, since I don't know Gurmukhi, but you get the idea.)
For storage and comparison, you should decide on one form or the other, since it's often impossible to predict which format the input will be in. You do this using the Unicode Normalization process, which converts between both forms. In PHP you do this with the Normalizer class.
i need to search with md5 because when i do it in a normalized form, it considers the letter with and without the dot same..
You second problem is that you're inventing an overcomplicated solution to a simple problem: collations. The database uses collation rules for "fuzzy" matching, i.e. to treat "matinee" and "matineé" the same, or in your case "ਸ਼" and "ਸ". You set the default collation on the column, but you can influence it during query time as well:
SELECT ... WHERE foo = 'bar' COLLATE utf8_bin;
If you want absolute matches, use the utf8_bin collation or another equivalent _bin (binary) collation for your chosen encoding.
I am working on a simple search script that looks through two columns of a specific table. Essentially I'm looking for a match between either a company's number or their name. I'm using the LIKE statement in SQL because I am using InnoDB tables (which means no fulltext searches).
The problem is that I am working in a bilingual environment (french and english) and some of the characters in french have accents. I would like accented characters to be considered the same as their non-accented counterpart, in other words é = e, e = é, à = a, etc. SO has a lot of questions pertaining to the issue but none seem to be working for me.
Here is my SQL statement:
SELECT id, name FROM clients WHERE id LIKE '%éc%' OR name LIKE '%éc%';
I would like that to find "école" and "ecole" but it only finds "école".
I would also like to note that my tables are all utf8_general_ci.
Help me StackOverflow, you're my only hope! :)
I am going to offer up another answer for you.
I just read that utf8_general_ci is accent-insensitive so you should be OK.
One solution is to use
mysql_query("SET NAMES 'utf8'");
This tells the client what char set to send SQL statements in.
Another solution seems to be to use MySQL's HEX() function to convert the accented chars into their Hex value. But I could not find any good examples of this working and after reading the MySQL docs for HEX() it looks like it probably will not work.
You maybe should consider converting the problem characters to their English counterparts, then storing them in a different column, perhaps called searchable or similar. You would of cause need to update this whenever your main column was updated.
You would then have two columns, one containing the accented characters and one containing the plain English searchable content.