Mysql database collate with czech symbols - php

I need to search some rows from database by its title. Problem is with czech diacritics:
in phpMyAdmin in row title is shown this:
Černé kouřovody 2MM
in php, where I call searching query default string is written like this:
Černé kouřovody 2MM
And I don't know, how to find out rows with this (Černé kouřovody 2MM) title by this(ÄŒerné kouÅ™ovody 2MM) title
Here is my query on database:
SELECT * FROM categories WHERE LOWER(title) LIKE LOWER("Černé
kouřovody 2MM") COLLATE utf8_bin
Thank you very much for your help

There are multiple variants to solve this. Personally, I use base64 encoding. I encode it before insertion to db, and then I decode upon extraction from DB. You can avoid that by setting appropriate collation in your database, for example, czech collation.
If you use base64, you can use any character without worrying about collation, but it results in some data overhead. I'd use it.
Those 2 should be ok, cause I gotta run now :P

Related

PHP / SQL: searching against html entities stored in database

I work a lot with languages that use accented characters, e.g. é. I store content in "utf_8_bin" encoded tables AND I convert accented characters to HTML entities too.
So, for example, "Términator" would be stored as "T & eacute ; rminator" (had to spaces in that to stop it rendering online) in the database.
When a user searches for "términator" a match is found because the query is also converted to HTML entities and the SQL query "lowercases" both sides of the argument with "lcase".
The problem I am having now, is that the client wants to be able to search for "Terminator" (no accent on the "e") to get results matching "Términator".
I would prefer not to change the way I store my data, particularly because storing HTML entities solves a number of other problems. So I'm asking in case there's a simpler solution. Thanks!
You should use the correct collation in your query in your case utf8_unicode_ci (This is without the html entities)
https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-sets.html
This collation you use determines which results you get back from you database. And how that characters are compared.
SELECT * FROM some_table WHERE title LIKE "Terminator" COLLATE utf8_unicode_ci
This query will return records with the title términator, Terminator etc, note that it does a case insensitive comparison (the _ci part in the collation).
The utf8_unicode_ci is a bit slower but that's really minimal and you probably wouldn't even notice the difference.
There are more collations that can fit your needs, not sure if there is one which can be used for html entities. You could add your own collation to mysql database to create the htmlentities support yourself something like utf8_htmlentities_ci. https://dev.mysql.com/doc/refman/5.7/en/adding-collation.html
Here a nice example with phone numbers https://dev.mysql.com/doc/refman/5.7/en/ldml-collation-example.html

php search with latin basic, but return results with diactrics

I'm running into a complicated situation here, and I'm hoping for a push in the right direction.
I need to allow Basic Latin searches to bring back results with diacritics. This is further complicated by the fact that the data is stored with HTML instead of pure ASCII. I have been making some progress, but have come across two problems.
First: I'm able to do a partial conversion of the data into something marginally useful, using something like this:
$string = 'Véra';
$converted = html_entity_decode($string, ENT_COMPAT, 'UTF-8');
setlocale(LC_ALL, 'en_US.UTF8');
$translit = iconv('UTF-8', 'ASCII//TRANSLIT', $converted);
echo $translit;
This brings back this result: V'era This is a start but what I really need is Vera. I can do a preg_replace on resulting string, but is there a way of just bringing it back without the hyphen? This is only one example; there are a lot more diacritics in the database (e.g. ñ and more). I feel like this has been addressed before (e.g. iconv returns strange results), but there don't appear to be any solutions listed.
Bigger Problem: I need to convert a string such as Vera and be able to bring back results with Véra. as well as results of Vera. However I believe I need to get problem 1 solved first before I can get to this point.
I'm thinking something like if ($translit) { return $string} but I'm a bit unsure of how to handle this.
All help appreciated.
Edit: I'm thinking this might be done easier directly in the database, however I'm running into issues with DQL. I know that there are ways with doing it in SQL with a stored procedure, but with limited access to the database, I'm open any suggestions for dealing with this in Doctrine
Okay, so maybe I'm making this too difficult
All I need is a way of finding entries that have been HTML encoded in the database without having to search with either the specific encoding but also without the diacritic itself. If I search for Jose, it should bring up anything in the database labeled as José
Preface: It's not quite clear whether the data to search is already in the database or whether you're just taking advantage of the fact that the database has logic for character comparisons. I'm going to assume that the data source is the DB.
The fact that you're trying to search html raises the question of whether you really want to search HTML or in fact want to search the user-visible text in HTML and strip html tags (What if there is a diacritic in a tag attribute? What if a word is broken with an empty <span>? Should it match? What if it was broken with a <br>?)
MySQL has the notion of both character sets (how characters are encoded) and collations (how characters are compared)
Relevant Documentation:
https://dev.mysql.com/doc/refman/5.7/en/charset-mysql.html
https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-sets.html
Assuming your mysql client/terminal is correctly set for UTF8 encoding, then the following demonstrates the effect of overriding the collation (using ß as particularly interesting example)
> SET NAMES 'utf8';
> SELECT
'ß',
'ss',
'ß' = 'ss' COLLATE utf8_unicode_ci AS ss_unicode,
'ß' = 'ss' COLLATE utf8_general_ci AS ss_general,
'ß' = 's' COLLATE utf8_general_ci AS s_general;
+----+----+------------+------------+-----------+
| ß | ss | ss_unicode | ss_general | s_general |
+----+----+------------+------------+-----------+
| ß | ss | 1 | 0 | 1 |
+----+----+------------+------------+-----------+
1 row in set (0.00 sec)
Note: general is the faster but not-strictly-correct version of the unicode collation -- but even that is wrong if you speak turkish (see: dotted uppercase i)
I would save decoded html in the database and search on this making sure that the collation is set correctly.
Confirm that the table/column collation is correct using SHOW CREATE TABLE xxx. Change it manually (ALTER TABLE ...), or use doctrine annotations as per this answer & use doctrine migrations to update (and confirm afterwards with SHOW CREATE TABLE that your version of doctrine respects collation)
Confirm that doctrine is configured to use utf8 encoding.
If you just need to override the collation for one particular query (eg you don't have permission to change the DB structure or it will break other code):
If you need to map to a doctrine ORM object, use NativeQuery and add COLLATE overrides as per the example above.
If you just want the record ID & field then you can use a direct query bypassing the ORM with a COLLATE override
You can use REGEX_REPLACE function to strip diactrics in Database, while requesting. Mysql database has no built-in regex_replace function, but you can use User Defined Library, or change library to MariaDB. MariaDB based on Mysql (Migrating data to MariaDB will be easy).
Then in MariaDB you can use queries like:
SELECT * FROM `test` WHERE 'jose' = REGEXP_REPLACE(name, '(&[A-Za-z]*;)', '')
// another variant with PHP variable
SELECT `table`.name FROM `table` WHERE $search = REGEXP_REPLACE(name, '(&[A-Za-z]*;)', '')
Even phpMyAdmin supports MariaDB. I tested my query on Demo page. It worked pretty well:
Or if you want to stay on MySql, add this UDFs:
https://github.com/mysqludf/lib_mysqludf_preg

How to search for special characters, unicode characters and Emoji's in a mysql Database

I am saving Emoji's with charset utfmb4_general_ci, Storing and retrieving are working fine but when i try to search for the information with a string containing Emoji's or special characters i am not getting the result. It always returning empty.
Can somebody help to solve this?
CODE
select * from table where Title LIKE '%Kanye West - \"Bound 2\" PARODY%'
UPDATE:
The search string are like
Kanye%20West%20-%20"Bound%202"%20PARODY
Stored in database like Kanye West - \"Bound 2\" PARODY
Family%20guy%20😎😔😁
Stored in database like Family guy \ud83d\ude0e\ud83d\ude14\ud83d\ude01
Please accept my Apologies for not making it clear
The first string is What we sent from the url via HTTP POST
and the second is how the data is stored in my table.
The charset of the database table is
utf8mb4_general_ci
You need to change your character set to utf8mb4_unicode_520_ci to make emojis searchable. Otherwise they are all treated as the same character.

Perform accent insensitive fulltext search MySQL

I'm currently developing a search functionality for a website. Users search for other users by name. I'm having some trouble getting good results for users that have accents on their name.
I have a FULLTEXT index on the name column and the table's collation is utf8_general_ci.
Currently if somebody registers for the site, and has a name with accents (for example: Alberto Andrés), the name is stored in the DB as shown in the following image:
So if I perform the following query SELECT * MATCH(name) AGAINST('alberto andres') I get lots of results with better match scores like 'Alberto', 'Andres', 'Andrés' and finally with a low match score the record the user is probably looking for 'Alberto Andrés'.
What could I do to take into account the way accented records are currently stored in the DB?
Thanks!
It looks to me like the surname of el Señor Andrés is actually stored correctly. The rendering you showed us is the way some non-UTF apps mangle UTF8 text.
You might try this modification of your query if you don't yet have a whole bunch of records in your table. Fulltext (non-boolean) mode works weirdly on small data sets.
SELECT *
FROM TABLE
WHERE MATCH(name) AGAINST('alberto andres' IN BOOLEAN MODE)
You also might try
SELECT *
FROM TABLE
WHERE MATCH(name) AGAINST(CONVERT('alberto andres' USING utf8))
just to make sure your match string is in the same character set as your MySQL columns.

Accent insensitive search in InnoDB MySQL table!

I am working on a simple search script that looks through two columns of a specific table. Essentially I'm looking for a match between either a company's number or their name. I'm using the LIKE statement in SQL because I am using InnoDB tables (which means no fulltext searches).
The problem is that I am working in a bilingual environment (french and english) and some of the characters in french have accents. I would like accented characters to be considered the same as their non-accented counterpart, in other words é = e, e = é, à = a, etc. SO has a lot of questions pertaining to the issue but none seem to be working for me.
Here is my SQL statement:
SELECT id, name FROM clients WHERE id LIKE '%éc%' OR name LIKE '%éc%';
I would like that to find "école" and "ecole" but it only finds "école".
I would also like to note that my tables are all utf8_general_ci.
Help me StackOverflow, you're my only hope! :)
I am going to offer up another answer for you.
I just read that utf8_general_ci is accent-insensitive so you should be OK.
One solution is to use
mysql_query("SET NAMES 'utf8'");
This tells the client what char set to send SQL statements in.
Another solution seems to be to use MySQL's HEX() function to convert the accented chars into their Hex value. But I could not find any good examples of this working and after reading the MySQL docs for HEX() it looks like it probably will not work.
You maybe should consider converting the problem characters to their English counterparts, then storing them in a different column, perhaps called searchable or similar. You would of cause need to update this whenever your main column was updated.
You would then have two columns, one containing the accented characters and one containing the plain English searchable content.

Categories