Compare same values stored with different encodings

Compare same values stored with different encodings - php

This question is not a duplicate of PHP string comparison between two different types of encoding because my question requires a SQL solution, not a PHP solution.
Context ► There's a museum with two databases with the same charset and collation (engine=INNODB charset=utf8 collate=utf8_unicode_ci) used by two different PHP systems. Each PHP system stores the same data in a different way, next image is an example :
There are tons of data already stored that way and both systems are working fine, so I can't change the PHP encoding or the databases'. One system handles the sales from the box office, the other handles the sales from the website.
The problem ► I need to compare the right column (tipo_boleto_tipo) to the left column (tipo) in order to get the value in another column of the left table (unseen in image), but I'm getting no results because the same values are stored different, for example, when I search for "Niños" it is not found because it was stored as "NiÃ±os" ("children" in spanish). I tried to do it via PHP by using utf8_encode and utf8_decode but it is unacceptably slow, so I think it's better to do it with SQL only. This data will be used for a unified report of sales (box office and internet) in variable periods of time and it has to compare hundreds of thousands of rows, that's why it is so slow with PHP.
The question ► Is there anything like utf8_encode or utf8_decode in MYSQL that allows me to match the equivalent values of both columns? Any other suggestion will be welcome.
Next is my current code (with no results) :
DATABASE TABLE COLUMN
▼ ▼ ▼
SELECT boleteria.tipos_boletos.genero ◄ DESIRED COLUMN.
FROM boleteria.tipos_boletos ◄ DATABASE WITH WEIRD CHARS.
INNER JOIN venta_en_linea.ventas_detalle ◄ DATABASE WITH PROPER CHARS.
ON venta_en_linea.ventas_detalle.tipo_boleto_tipo = boleteria.tipos_boletos.tipo
WHERE venta_en_linea.ventas_detalle.evento_id='1'
AND venta_en_linea.ventas_detalle.tipo_boleto_tipo = 'Niños'
The line ON venta_en_linea.ventas_detalle.tipo_boleto_tipo = boleteria.tipos_boletos.tipo is never gonna work because both values are different ("Niños" vs "NiÃ±os").

It appears the application which writes to the boleteria database is not storing correct UTF-8. The database column character set refers to how MySQL interprets strings, but your application can still write in other character sets.
I can't tell from your example exactly what the incorrect character set is, but assuming it's Latin-1 you can convert it to latin1 (to make it "correct"), then convert it back to "actual" utf8:
SELECT 1
FROM tipos_boletos, ventas_detalle
WHERE CONVERT(CAST(CONVERT(tipo USING latin1) AS binary) USING utf8)
= tipo_boleto_tipo COLLATE utf8_unicode_ci
I've seen this all too often in PHP applications that weren't written carefully from the start to use UTF-8 strings. If you find the performance too slow and you need to convert frequently, and you don't have an opportunity to update the application writing the data incorrectly, you can add a new column and trigger to the tipos_boletos table and convert on the fly as records are added or edited.

Related

Exact Same MySQL Table Structure/Data, Different Result To Same Query

Here's my situation.
I'm migrating from one server to another. As part of this, I'm moving across database data.
The migration method involved running the same CREATE TABLE query on the new server, then using a series of INSERT commands to insert the data row by row. It's possible this resulted in different data, however, the CHECKSUM command was used to validate the contents. CHECKSUM was done on the whole table after the transfer, on a new table with that row isolated, and after truncation of the string by applying the LEFT operator. Every time, the result was identical between the old and new server, indicating the raw data should be exactly identical at the byte level.
CHECKSUM TABLE `test`
I've checked the structure and it's exactly the same as well.
SHOW CREATE TABLE `test`
Here is the structure:
CREATE TABLE test ( item varchar(32) COLLATE utf8_unicode_ci NOT NULL, amount mediumint(5) NOT NULL ) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
The field is of type:
`item` varchar(32) COLLATE utf8_unicode_ci NOT NULL
Here is my connection code in PHP:
$sql = new mysqli($db_host, $db_user, $db_pass, $db_name);
if ($sql->connect_error) {
die('Connect Error ('.$sql->connect_errno.') '.$sql->connect_error);
}
When I go to retrieve the data in PHP with a simple query:
SELECT * FROM `test`
The data displays like this:
Â§lO
On the old server/host, I get this sequence of raw bytes:
Decimal: -194-167-108-79-
HEX: -C2-A7-6C-4F-
And on the new server, I get a couple of extra bytes at the beginning:
Decimal: -195-130-194-167-108-79-
HEX: -C3-82-C2-A7-6C-4F-
Why might the exact same raw data, table structure, and query, return a different result between the two servers? What should I do to ensure that results are as consistent as possible in the future?

Â§lO is "Mojibake" for §lO. I presume the latter (3-character) is "correct"?
The raw data looks like this (in both cases when I display it)
is bogus because the technique used for displaying it probably messed with the encoding.
Since the 3 characters became 4 and then became 6, you probably have "double-encoding".
This discusses how "double encoding" can occur: Trouble with UTF-8 characters; what I see is not what I stored
If you provide some more info (CREATE TABLE, hex, method of migrating the data, etc), we may be able to further unravel the mess you have.
More
When using mysqli, do $sql->set_charset('utf8');
(The HEX confirms my analysis.)
The migration method involved running the same CREATE TABLE query on the new server
Was it preceded by some character set settings, as in mysqldump?
then using a series of INSERT commands to insert the data row by row.
Can you get the HEX of some accented character in the file?
... CHECKSUM ...
OK, being the same rules out one thing.
CHECKSUM was done on ... a new table with that row isolated
How did you do that? SELECTing the row could have modified the text, thereby invalidating the test.
indicating the raw data should be exactly identical at the byte level.
For checking the data in the table, SELECT HEX(col)... is the only way to bypass all possible character set conversions that could happen. Please provide the HEX for some column with a non-ascii character (such as the example given). And do the CHECKSUM against the HEX output.
And provide SHOW VARIABLES LIKE 'char%';

How to search the MySQL entries shown as decimal numeric character reference(NCR) &#xxxxx?

When I am searching my MySQL database with some query like:
SELECT * FROM mytable WHERE mytable.title LIKE '%副教授%';
("副教授" are three Chinese characters, whose decimal numeric character reference, NCR, is "副教授"), I got no result.
By looking into the phpMyadmin and browsing "mytable", the should-be-found entry is shown as "副教授". I think that is the reason for the failure of search.
Not all the entries in the same column are numeric character reference and some of them are just normal. Here is one pic of the table column shown in phpMySQLAdmin.
I wonder how I could search for all entries in my table in MySQL using one format regardless if there are shown in NCR or not. Or should I convert the NCR entries by running some script? Thanks.

your database table encoding should be utf-8 and when you insert new data you should run set names 'utf-8' query before insertion and this will contain all your data.

Store full width and half width character in unique column of database

I have a word list stored in mysql, and the size is around 10k words. The column is marked as unique. However, I cannot insert full-width and half-width character of punctuation mark.
Here are some examples:
(half-width, full-width)
('?', '？')
('/', '／')
The purpose is that, I have many articles containing both full-width and half-width characters and want to find out if the articles contain these words. I use php to do the comparison and it can know that '?' is different than '？'. Is there any idea how to do it in mysql too? Or is there some ways so that php can make it equal?
I use utf8_unicode_ci for the database encoding, and the column is also used utf8_unicode_ci for the encoding. When I made these queries, both return the same record, '?測試'
SELECT word FROM word_list WHERE word='?測試'
SELECT word FROM word_list WHERE word='？測試'

Most likely explanation is a characterset translation issue; for example, the column you are storing the value to is defined as latin1 characterset.
But it's not necessarily the characterset of the column that's causing the issue. It's a characterset conversion happening somewhere.
If you aren't aware of characterset encodings, I recommend consulting the source of all knowledge: google.
I highly recommend the two top hits for this search:
what every programmer needs to know about character encoding
http://www.joelonsoftware.com/articles/Unicode.html
http://kunststube.net/encoding/

Discover the charset from foreign database

I have mysql database (not mine). In this database all the encodings set to utf-8, and I connect with charset utf-8. But, when I try to read from the database I get this:
×¢×?×§ 1
×‘×™×ª ×ª×•×’× ×” ×”×¢×•×¡×§ ×‘×ž×¡×¤×¨ ×©×¤×•×ª ×ª×•×›× ×”
× × ×œ× ×œ×¤× ×•×ª ××—×¨×™ 12 ×‘×œ×™×œ×” ..
What I supposed to get:
עסק 1
בית תוגנה העוסק במספר שפות תוכנה
נא לא לפנות אחרי 12 בלילה ..
When I look from phpmyadmin, I have the same thing(connection in pma is to utf-8).
I know that the data is supposed to be in Hebrew. Someone have an idea how to fix these?

You appear to have UTF-8 data that was treated as Windows-1252 and subsequently converted to UTF-8 (sometimes referred to as "double-encoding").
The first thing that you need to determine is at what stage the conversion took place: before the data was saved in the table, or upon your attempts to retrieve it? The easiest way is often to SELECT HEX(the_column) FROM the_table WHERE ... and manually inspect the byte-encoding as it is currently stored:
If, for the data above, you see C397C2A9... then the data is stored erroneously (an incorrect connection character set at the time of data insertion is the most common culprit); it can be corrected as follows (being careful to use data types of sufficient length in place of TEXT and BLOB as appropriate):
Undo the conversion from Windows-1252 to UTF-8 that caused the data corruption:
ALTER TABLE the_table MODIFY the_column TEXT CHARACTER SET latin1;
Drop the erroneous encoding metadata:
ALTER TABLE the_table MODIFY the_column BLOB;
Add corrected encoding metadata:
ALTER TABLE the_table MODIFY the_column TEXT CHARACTER SET utf8;
See it on sqlfiddle.
Beware to correctly insert any data in the future, or else the table will be partly encoded in one way and partly in another (which can be a nightmare to try and fix).
If you're unable to modify the database schema, the records can be transcoded to the correct encoding on-the-fly with CONVERT(BINARY CONVERT(the_column USING latin1) USING utf8) (see it on sqlfiddle), but I strongly recommended that you fix the database when possible instead of leaving it containing broken data.
However, if you see D7A2D73F... then the data is stored correctly and the corruption is taking place upon data retrieval; you will have to perform further tests to identify the exact cause. See UTF-8 all the way through for guidance.

Creating MySQL table for user-input text strings

Using a PHP form to insert the text strings from users into the table, and another form to pull it later on another page, what would be the best method of creating a table in MySQL for strings of text, and what options when creating the table would likely be necessary to best handle text strings?
The complicating factor, I suppose, is that the text that would exist in the table doesn't exist yet (as it would need to be input through the form, etc.), I am unsure if this is why I've had trouble (along with my relative inexperience, so I am unsure of what, precisely would be an ideal table configuration).
Since I don't want to store any other data beyond this user input (like I said, just strings of text i.e a sentence), I assumed I only needed one column when creating the table, but I was unsure of this as well; it seems it is possible I am more likely just overlooking something about how SQL works.

I'll put my comments into an answer now:
consider the estimated, maximum length of such strings to decide whether to use varchar-fields or text-fields in mysql.
Quoting from the MySQL-Manual (BTW a good read for your purpose):
Values in VARCHAR columns are variable-length strings. The length can be specified as a value from 0 to 255 before MySQL 5.0.3, and 0 to 65,535 in 5.0.3 and later versions.
http://dev.mysql.com/doc/refman/5.0/en/char.html
It is said that varcharis faster, for a good summary, see MySQL: Large VARCHAR vs. TEXT?
consider having at least a 2nd field called id (int, primary key, auto increment), when you need to reference those strings later. Consider having a field referencing the author of that string. Maybe a field to store the date and time when the string was put into the database would be a good idea as well.
use mysqli or PDO instead of mysql, which is deprecated.
See here, there are links to good tutorials in the 1st answer: How do I migrate my site from mysql to mysqli?

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.