Encoding Problem char looks right but is not

Encoding Problem char looks right but is not - php

First it is not the common utf8 Problem. All parts of my application is set to utf8 and works fine.
I get a mail over IMAP by PHP and fetch the title. In this title I have s special char. A ö from the German language. Now I search in my DB if there is an entry with this title. I know there is one. Database use utf8mb4_general_ci for encoding to be able to store 4 bit utf8 encoded special chars.
Title from Mail:
Fw: Auflösungsvertrag
Entry in Database:
Fw: Auflösungsvertrag
I put the cursor behind the ö and tried to delete it. First the ö switched to an o and after the second press of the delete key it was fully gone. If I type now and ö over my keyboard, MySQL finds the entry.
If I put both inside notepad ++ You see
Fw: Auflösungsvertrag
FW: Auflösungsvertrag
If you turn the encoding to ASCII you get
Fw: AufloÌˆsungsvertrag
Fw: AuflÃ¶sungsvertrag
So you can see now that the two ö are different encoded, but they get displayed right. So my mySQL select don't find the DB entry.
Can someone explain this to me and give me a hint for a php command to turn the first encoded string to the second one?
I bit longer description how this problem accrues:
I write a ticketing system. Every mail I send out get added the tickets ID to the subject. If I send out a mail I write it to the db in an outgoing table. Then a cronjob sends this mails out asyncron. I use PHP mailer and send over SMTP.
I fetch Incoming mails by IMAP and the PHP IMAP classes. If a mail comes in with an TID in the subject I merge this mail into the ticket in the database. All ticket entries are grouped by the TID column.
The problem is now, if you send a mail from the system to another mail address inside the same system, you get the mail merged into the existing ticket.
That's why I look in the outgoing table for every incoming mail by search for the from Address, to Address and the title. If I find the mail I know the system has sent it out.
So if I send the mail out I have the first encoding. If I get the same mail back in again it has the other encoding. Both encoding seam to be valid utf8 encoding. Everywhere on the website I get the right character and also in the db I get it displayed right. Only if I make an SQL query over PDO, MySQL treat them as two different characters.

Here's how I would solve this and according to me it has to be fixed once for all with a oneshot instruction on the database side, and not with just a trick on the PHP, that you would have to repeat everywhere each time you are facing the issue.
First I copied your 2 strings :
Auflösungsvertrag
Auflösungsvertrag
Into Notepad++, in which I do have the (very handy) HEX pluggin.
When I turn the text into HEX, I have those values
4175666c6fcc8873756e677376657274726167
4175666cc3b673756e677376657274726167
If we split that, we see easily the HEX of the 2 ö that causes issues
4175666c 6fcc88 73756e677376657274726167
4175666c c3b6 73756e677376657274726167
The trick is now to tell MySQL to replace all characters having those HEX values from one to another, ie 6fcc88 to c3b6
And you can do that with this statement that uses UNHEX() function
UPDATE your_table
SET your_column=REPLACE(your_column, UNHEX('6fcc88'), UNHEX('c3b6'))
Example and reproduction below
Schema (MySQL v8.0)
/* Creating test data - Row 1 and 2 are identical */
create table test (id int, txt varchar(50), txthex varchar(100));
INSERT INTO test (id,txt,txthex) VALUES (1, 'Auflösungsvertrag', '4175666c6fcc8873756e677376657274726167');
INSERT INTO test (id,txt,txthex) VALUES (2, 'Auflösungsvertrag', '4175666c6fcc8873756e677376657274726167');
INSERT INTO test (id,txt,txthex) VALUES (3, 'Auflösungsvertrag','4175666cc3b673756e677376657274726167');
Applying fix
/* Running oneshot fix on row 2 only */
UPDATE test
SET txt=REPLACE(txt, UNHEX('6fcc88'), UNHEX('c3b6'))
WHERE id=2
Check Query
SELECT id, txt, txthex hex_original,
CAST(UNHEX(txthex) AS CHAR(30)) unexed_original ,
HEX(txt) hex_replaced
FROM test;
id
txt
hex_original
unexed_original
hex_replaced
1
Auflösungsvertrag
4175666c6fcc8873756e677376657274726167
Auflösungsvertrag
4175666C6FCC8873756E677376657274726167
2
Auflösungsvertrag
4175666c6fcc8873756e677376657274726167
Auflösungsvertrag
4175666CC3B673756E677376657274726167
3
Auflösungsvertrag
4175666cc3b673756e677376657274726167
Auflösungsvertrag
4175666CC3B673756E677376657274726167

I found the solution.
The topic is called Unicode equivalence and there are methods that normalize.
https://en.wikipedia.org/wiki/Unicode_equivalence
PHP also have a class for this.
https://www.php.net/manual/de/normalizer.normalize.php
I had to call
normalizer_normalize( $myString, Normalizer::NFKC );

Related

German Umlaute not considered in JSON_CONTAINS() with MariaDB

first time posting here.
I am facing a problem with unpredicted behavior on my PROD server and my local environment.
Here is some background on the situation:
In my application (backend Laravel 7, frontend regular html/javascript) I need to search for entries in a particular table based on JSON data stored in one of the columns:
Table: flights
columns: id, date, passengers, ... pilot_id, second_pilot_id, flight_data, updated_at, created_at
There are flights, that are directly linked to either a pilot or a second pilot via pilot_id or second_pilot_id. That is fine so far, because I can easily query them. However there are also flight entries, where no registered user is doing the entry and they are only represented by a name that is entered. This works only if the name doesn't contain special characters, in particular the german Umlaute (ö, ä, ü), also doesn't work for other specials like â or ß or é, è etc. But ONLY ON PROD, on Local everything works even with special characters.
flight_data has the data type "JSON" in my migration files.
$table->json('flight_data') ...
Now the problem:
On my local environment I can run the following and will get results returned:
... ->where(function($q) use ($r) {
$q->whereRaw("IF(payee = 2, JSON_CONTAINS(flight_data, '{\"second_pilotname\":\"$r\"}'), JSON_CONTAINS(flight_data, '{\"pilotname\":\"$r\"}'))");
})->...
This will get me my example results without issues, as expected
($r is filled a particular name of a pilot, in my example he is called "Jöhn Düe")
If I run this on my PROD system I will get no retuns. I tracked it down to the JSON_CONTAINS() function, that prevents the results. I also tried playing around with "Joehn Duee", which would be found correctly, so it basically comes down to the german Umlaute (ö, ä, ü) not being handled correctly somehow.
I also tried some SQL statements in phpmyadmin and these are the results:
LOCAL
select id, flight_data, comments, updated_at from logbook where JSON_CONTAINS(flight_data, '{"pilotname": "Juehn Duee"}')
1 result found
select id, flight_data, comments, updated_at from logbook where JSON_CONTAINS(flight_data, '{"pilotname": "Jühn Düe"}')
1 result found
PROD
select id, flight_data, comments, updated_at from logbook where JSON_CONTAINS(flight_data, '{"pilotname": "Juehn Duee"}')
1 result found
select id, flight_data, comments, updated_at from logbook where JSON_CONTAINS(flight_data, '{"pilotname": "Jühn Düe"}')
0 result found
I also checked the raw data that is stored:
PROD:
column
data
flight_data
{"pilotname":"J\u00fchn D\u00fce"}
LOCAL:
column
data
flight_data
{"pilotname":"J\u00fchn D\u00fce"}
So logically the data is transformed. Which is ok, because the data is then shown according to UTF-8 and then correctly displayed ("Jühn Düe")
The problem is, that in the backend I need to compare this data.
The differences are that on my local environment I am using MYSQL 8.0 (it's a homestead server, so select ##version; => 8.0.23-0ubuntu0.20.04.1) and on PROD (the hosted server) I am seeing "10.3.28-MariaDB-log-cll-lve"
Therefore the difference is clear, MariaDB vs. MYSQL and the handling of german Umlaute.
I tried various things around changing the conversion / charset of the entries, of the database, that all didn't solve the problem. I searched for quite a while for various similar problems, but most of them resulted in having the data stored not in UTF-8 - which I checked and is the case for me here.
Even querying for the raw data doesn't work somehow:
The following doesn't work neither on PROD nor on LOCAL:
select id, flight_data, comments, updated_at from logbook where JSON_CONTAINS(flight_data, '{"pilotname": "J\u00fchn D\u00fce"}')
0 results found
Can you help me figuring out what I am missing here?
Obviously it has to do something with the database, what else can I check or do I need to change?
Thanks a lot everybody for your help!

You should use the same software in development that you use in production. The same brand and the same version. Otherwise you risk encountering these incompatible features.
MariaDB started as a fork of the MySQL project in 2010, and both have been diverging gradually since then. MySQL implements new features, and MariaDB may or may not implement similar features, either by cherry-picking code from the MySQL project or by implementing their own original code. So over time these two projects grow more and more incompatible. At this point, over 10 years after the initial fork, you should consider MariaDB to be a different software product. Don't count on any part of it remaining compatible with MySQL.
In particular, the implementation of JSON in MariaDB versus MySQL is not entirely compatible. MariaDB creating their own original code for the JSON data type as an alias for LONGTEXT. So the internal implementation is quite different.
You asked if there's something you need to change.
Since you use MariaDB in production, not MySQL, you should use MariaDB 10.3.28 in your development environment, to ensure compatibility with the database brand and version you use in production.
I think the problem is a collation issue. Some unicode collations implement character expansions, so ue = ü would be true in the German collation.
Here's a test using MySQL 5.7 which is what I have handy (I don't use MariaDB):
mysql> select 'Juehn Duee' collate utf8mb4_unicode_520_ci = 'Jühn Düe' as same;
+------+
| same |
+------+
| 0 |
+------+
mysql> select 'Juehn Duee' collate utf8mb4_german2_ci = 'Jühn Düe' as same;
+------+
| same |
+------+
| 1 |
+------+
As you can see, this has nothing to do with JSON, but it's just related to string comparisons and which collation is used.
See the explanation in https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-sets.html in the section "_general_ci Versus _unicode_ci Collations"

Thank you all for your inputs and response!
I figured out a different solution for the problem. Maybe it helps someone..
I went a step back and checked how I am storing the data. I was using json_encode() for that, which created the table contents as shown above. By just using a raw array to save it, it was working then
$insert->pilotname = ['pilotname' => $request->pilotname];
Somehow the storing of data before was already the issue.

php search with latin basic, but return results with diactrics

I'm running into a complicated situation here, and I'm hoping for a push in the right direction.
I need to allow Basic Latin searches to bring back results with diacritics. This is further complicated by the fact that the data is stored with HTML instead of pure ASCII. I have been making some progress, but have come across two problems.
First: I'm able to do a partial conversion of the data into something marginally useful, using something like this:
$string = 'Véra';
$converted = html_entity_decode($string, ENT_COMPAT, 'UTF-8');
setlocale(LC_ALL, 'en_US.UTF8');
$translit = iconv('UTF-8', 'ASCII//TRANSLIT', $converted);
echo $translit;
This brings back this result: V'era This is a start but what I really need is Vera. I can do a preg_replace on resulting string, but is there a way of just bringing it back without the hyphen? This is only one example; there are a lot more diacritics in the database (e.g. ñ and more). I feel like this has been addressed before (e.g. iconv returns strange results), but there don't appear to be any solutions listed.
Bigger Problem: I need to convert a string such as Vera and be able to bring back results with Véra. as well as results of Vera. However I believe I need to get problem 1 solved first before I can get to this point.
I'm thinking something like if ($translit) { return $string} but I'm a bit unsure of how to handle this.
All help appreciated.
Edit: I'm thinking this might be done easier directly in the database, however I'm running into issues with DQL. I know that there are ways with doing it in SQL with a stored procedure, but with limited access to the database, I'm open any suggestions for dealing with this in Doctrine
Okay, so maybe I'm making this too difficult
All I need is a way of finding entries that have been HTML encoded in the database without having to search with either the specific encoding but also without the diacritic itself. If I search for Jose, it should bring up anything in the database labeled as José

Preface: It's not quite clear whether the data to search is already in the database or whether you're just taking advantage of the fact that the database has logic for character comparisons. I'm going to assume that the data source is the DB.
The fact that you're trying to search html raises the question of whether you really want to search HTML or in fact want to search the user-visible text in HTML and strip html tags (What if there is a diacritic in a tag attribute? What if a word is broken with an empty <span>? Should it match? What if it was broken with a <br>?)
MySQL has the notion of both character sets (how characters are encoded) and collations (how characters are compared)
Relevant Documentation:
https://dev.mysql.com/doc/refman/5.7/en/charset-mysql.html
https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-sets.html
Assuming your mysql client/terminal is correctly set for UTF8 encoding, then the following demonstrates the effect of overriding the collation (using ß as particularly interesting example)
> SET NAMES 'utf8';
> SELECT
'ß',
'ss',
'ß' = 'ss' COLLATE utf8_unicode_ci AS ss_unicode,
'ß' = 'ss' COLLATE utf8_general_ci AS ss_general,
'ß' = 's' COLLATE utf8_general_ci AS s_general;
+----+----+------------+------------+-----------+
| ß | ss | ss_unicode | ss_general | s_general |
+----+----+------------+------------+-----------+
| ß | ss | 1 | 0 | 1 |
+----+----+------------+------------+-----------+
1 row in set (0.00 sec)
Note: general is the faster but not-strictly-correct version of the unicode collation -- but even that is wrong if you speak turkish (see: dotted uppercase i)
I would save decoded html in the database and search on this making sure that the collation is set correctly.
Confirm that the table/column collation is correct using SHOW CREATE TABLE xxx. Change it manually (ALTER TABLE ...), or use doctrine annotations as per this answer & use doctrine migrations to update (and confirm afterwards with SHOW CREATE TABLE that your version of doctrine respects collation)
Confirm that doctrine is configured to use utf8 encoding.
If you just need to override the collation for one particular query (eg you don't have permission to change the DB structure or it will break other code):
If you need to map to a doctrine ORM object, use NativeQuery and add COLLATE overrides as per the example above.
If you just want the record ID & field then you can use a direct query bypassing the ORM with a COLLATE override

You can use REGEX_REPLACE function to strip diactrics in Database, while requesting. Mysql database has no built-in regex_replace function, but you can use User Defined Library, or change library to MariaDB. MariaDB based on Mysql (Migrating data to MariaDB will be easy).
Then in MariaDB you can use queries like:
SELECT * FROM `test` WHERE 'jose' = REGEXP_REPLACE(name, '(&[A-Za-z]*;)', '')
// another variant with PHP variable
SELECT `table`.name FROM `table` WHERE $search = REGEXP_REPLACE(name, '(&[A-Za-z]*;)', '')
Even phpMyAdmin supports MariaDB. I tested my query on Demo page. It worked pretty well:
Or if you want to stay on MySql, add this UDFs:
https://github.com/mysqludf/lib_mysqludf_preg

How to search for special characters, unicode characters and Emoji's in a mysql Database

I am saving Emoji's with charset utfmb4_general_ci, Storing and retrieving are working fine but when i try to search for the information with a string containing Emoji's or special characters i am not getting the result. It always returning empty.
Can somebody help to solve this?
CODE
select * from table where Title LIKE '%Kanye West - \"Bound 2\" PARODY%'
UPDATE:
The search string are like
Kanye%20West%20-%20"Bound%202"%20PARODY
Stored in database like Kanye West - \"Bound 2\" PARODY
Family%20guy%20😎😔😁
Stored in database like Family guy \ud83d\ude0e\ud83d\ude14\ud83d\ude01
Please accept my Apologies for not making it clear
The first string is What we sent from the url via HTTP POST
and the second is how the data is stored in my table.
The charset of the database table is
utf8mb4_general_ci

You need to change your character set to utf8mb4_unicode_520_ci to make emojis searchable. Otherwise they are all treated as the same character.

csv "UTF8" character encoding with pgsql

Our client has sent us a CSV file of data that I need to import into a specific table in our Postgresql 8.3.9 database. The database uses UTF-8 character encoding, i.e. our CMS allows multiple languages such as French which are inputted into the database via the CMS in French. One particular facility is for the client to upload images to the server and then enter "alt" tags for them in French. However, due to a bulk update required, we have been sent a CSV to feed into a particular table - for the image alt tags, in French.
The CSV has some special characters such as "é" - e.g.
"Bottes Adaptées Amora Cuir Faux-Croco Fauve Photo d'Ensemble"
The images themselves are hosted on two places - one is a CDN, and one is a local database backup and local server (web server) file backup. I am using a PHP script to read the CSV file and do the needful so that the "alt" tags are updated on two places - our web database, and the CDN.
However, when I read the CSV (using PHP), the character does not "come out" as expected.
The data is comming as "Bottes Adapt�es Amora Cuir Faux-Croco Fauve Photo d'Ensemble".
I don't think this has anything to do with the database, but it has something to do with my PHP file reading the CSV data. Even if I print the data that it is reading, the special character above does not print as above, it' prints as if the special character is not recognised. Other characters print fine.
Here is the code I'm using (not some special custom functions are used here to interact with the database but they can be ignored). The CSV file is made up of {column 1} for image name, and {column 2} for the ALT tag.
$handle = fopen($conn->getIncludePath() . "cronjobs/GIB_img_alt_tags_fr.csv", "r");
while (($data = fgetcsv($handle, 1000, ",")) !== FALSE) {
//normally I run a query here to check if the data exists - "SELECT imageid, image_fileref FROM table1 WHERE image_fileref = '". $data[0]. "'");
if ($conn->Numrows($result)) { //if rows were found -
$row=$conn->fetchArray($result);
//printing the data from $row here
}
}
fclose($handle);

You've still omitted key information - when asking for help with an UPDATE don't delete the UPDATE statement from the code - and your description of the problem is very confused, but there's some hint of what's going on.
Mismatched encodings
It's highly likely that your PHP connection has a client_encoding set to something other than UTF-8. If you're sending UTF-8 data down the connection without conversion, the connection's client_encoding must be UTF-8.
To confirm, run SHOW client_encoding as a SQL statement from PHP and print the result. Add SET client_encoding = 'UTF-8' to your code before importing the CSV and see if that helps. Assuming, of course, that the CSV file is really UTF-8 encoded. If it isn't, you need to either transcode it to UTF-8 or find out what encoding it is in and SET client_encoding to that.
Read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) and the PostgreSQL manual on character set support.
Better approach
The approach you're taking is unnecessarily slow and inefficient, anyway. You should be:
Opening a transaction
Creating a temporary table in the database with the same structure as the CSV file.
Use pg_copy_from to load the CSV into the temp table, with appropriate options to specify the CSV format.
Merge the contents of the temporary table into the destination table with an INSERT then an UPDATE, eg:
INSERT INTO table1 (image_fileref, ... other fields ...)
SELECT n.image_fileref, ... other fields ...
FROM the_temp_table n
WHERE NOT EXISTS (SELECT 1 from table1 o WHERE o.image_fileref = n.image_fileref);
UPDATE table1 o
SET .... data to update ....
FROM the_temp_table n
WHERE o.image_fileref = n.image_fileref;
Commit the transaction
The INSERT may be more efficiently written as a left outer join with an IS NULL filter to exclude matching rows. It depends on the data. Try it.
I probably could've written a faster CTE-based version, but you didn't say what version of Pg you were using, so I didn't know if your server supported CTEs.
Since you left out the UPDATE I can't be more specific about the UPDATE or INSERT statements. If you'd provided the schema for table1 or even just your INSERT or UPDATE I could've said more. Without sample data I haven't been able to run the statements to check them, and I didn't feel like making up some dummy data, so the above is untested. As it is, completing the code is left as a learning exercise. I will not be updating this answer with fully-written-out statements, you get to work that out.

Accent insensitive search in InnoDB MySQL table!

I am working on a simple search script that looks through two columns of a specific table. Essentially I'm looking for a match between either a company's number or their name. I'm using the LIKE statement in SQL because I am using InnoDB tables (which means no fulltext searches).
The problem is that I am working in a bilingual environment (french and english) and some of the characters in french have accents. I would like accented characters to be considered the same as their non-accented counterpart, in other words é = e, e = é, à = a, etc. SO has a lot of questions pertaining to the issue but none seem to be working for me.
Here is my SQL statement:
SELECT id, name FROM clients WHERE id LIKE '%éc%' OR name LIKE '%éc%';
I would like that to find "école" and "ecole" but it only finds "école".
I would also like to note that my tables are all utf8_general_ci.
Help me StackOverflow, you're my only hope! :)

I am going to offer up another answer for you.
I just read that utf8_general_ci is accent-insensitive so you should be OK.
One solution is to use
mysql_query("SET NAMES 'utf8'");
This tells the client what char set to send SQL statements in.
Another solution seems to be to use MySQL's HEX() function to convert the accented chars into their Hex value. But I could not find any good examples of this working and after reading the MySQL docs for HEX() it looks like it probably will not work.

You maybe should consider converting the problem characters to their English counterparts, then storing them in a different column, perhaps called searchable or similar. You would of cause need to update this whenever your main column was updated.
You would then have two columns, one containing the accented characters and one containing the plain English searchable content.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.