Here's my situation.
I'm migrating from one server to another. As part of this, I'm moving across database data.
The migration method involved running the same CREATE TABLE query on the new server, then using a series of INSERT commands to insert the data row by row. It's possible this resulted in different data, however, the CHECKSUM command was used to validate the contents. CHECKSUM was done on the whole table after the transfer, on a new table with that row isolated, and after truncation of the string by applying the LEFT operator. Every time, the result was identical between the old and new server, indicating the raw data should be exactly identical at the byte level.
CHECKSUM TABLE `test`
I've checked the structure and it's exactly the same as well.
SHOW CREATE TABLE `test`
Here is the structure:
CREATE TABLE test ( item varchar(32) COLLATE utf8_unicode_ci NOT NULL, amount mediumint(5) NOT NULL ) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
The field is of type:
`item` varchar(32) COLLATE utf8_unicode_ci NOT NULL
Here is my connection code in PHP:
$sql = new mysqli($db_host, $db_user, $db_pass, $db_name);
if ($sql->connect_error) {
die('Connect Error ('.$sql->connect_errno.') '.$sql->connect_error);
}
When I go to retrieve the data in PHP with a simple query:
SELECT * FROM `test`
The data displays like this:
§lO
On the old server/host, I get this sequence of raw bytes:
Decimal: -194-167-108-79-
HEX: -C2-A7-6C-4F-
And on the new server, I get a couple of extra bytes at the beginning:
Decimal: -195-130-194-167-108-79-
HEX: -C3-82-C2-A7-6C-4F-
Why might the exact same raw data, table structure, and query, return a different result between the two servers? What should I do to ensure that results are as consistent as possible in the future?
§lO is "Mojibake" for §lO. I presume the latter (3-character) is "correct"?
The raw data looks like this (in both cases when I display it)
is bogus because the technique used for displaying it probably messed with the encoding.
Since the 3 characters became 4 and then became 6, you probably have "double-encoding".
This discusses how "double encoding" can occur: Trouble with UTF-8 characters; what I see is not what I stored
If you provide some more info (CREATE TABLE, hex, method of migrating the data, etc), we may be able to further unravel the mess you have.
More
When using mysqli, do $sql->set_charset('utf8');
(The HEX confirms my analysis.)
The migration method involved running the same CREATE TABLE query on the new server
Was it preceded by some character set settings, as in mysqldump?
then using a series of INSERT commands to insert the data row by row.
Can you get the HEX of some accented character in the file?
... CHECKSUM ...
OK, being the same rules out one thing.
CHECKSUM was done on ... a new table with that row isolated
How did you do that? SELECTing the row could have modified the text, thereby invalidating the test.
indicating the raw data should be exactly identical at the byte level.
For checking the data in the table, SELECT HEX(col)... is the only way to bypass all possible character set conversions that could happen. Please provide the HEX for some column with a non-ascii character (such as the example given). And do the CHECKSUM against the HEX output.
And provide SHOW VARIABLES LIKE 'char%';
Related
I have some interesting issue. The following two codes do not produce the same output:
$result = $sql->QueryFetch("SELECT machinecodeSigned FROM ...");
echo bin2hex($result['machinecodeSigned']);
and
$result = $sql->QueryFetch("SELECT HEX(machinecodeSigned) FROM ...");
echo $result['machinecodeSigned'];
So, $sql is just some wrapper class and method QueryFetch internally just calls PHP standard functions for query and fetch to attain values.
I get two different results, though. For example, for some arbitrary input in my database, I get:
08c3bd79c3a0c2a66fc2bb375b6370c399c3acc3ba7bc2b8c2b203c39d70
and
08FD79E0A66FBB375B6370D9ECFA7BB8B203DD70
Ignoring case-sensitivity, the first output is nonsense while the other one is correct.
machinecodeSigned is a char(255) field that is latin-1 encoded and has collation latin-1 (which should not play a role, I assume).
What could be the reason that I get two different results? This used to yield the same results for years, but suddenly I had to change the code from version 1 to version 2 in order for it to produce the correct result. It seems, as if PHP does some arbitrary conversion of the bytes in the string.
Edit: It seems necessary to say that the field is not human-readable. In any case, since the second output is the correct one, feel free to convert the hexadecimal form to ASCII characters, if this helps you.
Edit:
SHOW CREATE TABLE yields:
CREATE TABLE `user` (
`ID` int(9) NOT NULL AUTO_INCREMENT,
`machinecodeSigned` char(255) CHARACTER SET latin1 COLLATE latin1_bin DEFAULT NULL
PRIMARY KEY (`ID`),
) ENGINE=InnoDB AUTO_INCREMENT=10092 DEFAULT CHARSET=latin1 COLLATE=latin1_german2_ci
char(255) CHARACTER SET latin1 COLLATE latin1_bin
will read/write bytes unchanged. It would be better to say BINARY(255), or perhaps something else.
If you tell the server that your client wants to talk in "utf8", and you SELECT that column, then MySQL will translate from latin1 (the charset of the data) to utf8 (the encoding you say the client wants). This leads to the longer hex string.
You say that phpmyadmin says "utf8" somewhere; that is probably the cause of the confusion.
If it had been stored as base64, there would be no confusion because base64 uses very few different characters, and they are encoded identically in latin1 and utf8. Furthermore, latin1_bin would have been appropriate. So, another explanation of what went wrong is the unwanted reconversion from base64 to binary.
MySQL's implementation of latin1_bin is simple and permissive -- all 256 bit values are simply stored and loaded, unchecked. This makes it virtually identical to BLOB and BINARY.
This is probably the base64_encode that should have been stored:
MDhGRDc5RTBBNjZGQkIzNzVCNjM3MEQ5RUNGQTdCQjhCMjAzREQ3MA==
Datatypes starting with VAR or ending with BLOB or TEXT are implemented via a 'length' field plus the bytes needed to represent the value.
On the other hand, CHAR and BINARY are fixed length, and padded by spaces (CHAR) or \0 (BINARY).
So, writing binary info to CHAR(255) actually may modify the data due to spaces appended.
i have a table field type varchar(36) and i want to generate it dynamically by mysql so i used this code:
$sql_code = 'insert into table1 (id, text) values (uuid(),'some text');';
mysql_query($sql_code);
how can i retrieve the generated uuid immediately after inserting the record ?
char(36) is better
you cannot. The only solution is to perform 2 separated queries:
SELECT UUID()
INSERT INTO table1 (id, text) VALUES ($uuid, 'text')
where $uuid is the value retrieved on the 1st step.
You can do everything you need to with SQL triggers. The following SQL adds a trigger on tablename.table_id to automatically create the primary key UUID when inserting, then stores the newly created ID into an SQL variable for retrieval later:
CREATE TRIGGER `tablename_newid`
AFTER INSERT ON `tablename`
FOR EACH ROW
BEGIN
IF ASCII(NEW.table_id) = 0 THEN
SET NEW.table_id = UNHEX(REPLACE(UUID(),'-',''));
END IF;
SET #last_uuid = NEW.table_id;
END
As a bonus, it inserts the UUID in binary form to a binary(16) field to save storage space and greatly increase query speed.
edit: the trigger should check for an existing column value before inserting its own UUID in order to mimic the ability to provide values for table primary keys in MySQL - without this, any values passed in will always be overridden by the trigger. The example has been updated to use ASCII() = 0 to check for the existence of the primary key value in the INSERT, which will detect empty string values for a binary field.
edit 2: after a comment here it has since been pointed out to me that using BEFORE INSERT has the effect of setting the #last_uuid variable even if the row insert fails. I have updated my answer to use AFTER INSERT - whilst I feel this is a totally fine approach under general circumstances it may have issues with row replication under clustered or replicated databases. If anyone knows, I would love to as well!
To read the new row's insert ID back out, just run SELECT #last_uuid.
When querying and reading such binary values, the MySQL functions HEX() and UNHEX() will be very helpful, as will writing your query values in hex notation (preceded by 0x). The php-side code for your original answer, given this type of trigger applied to table1, would be:
// insert row
$sql = "INSERT INTO table1(text) VALUES ('some text')";
mysql_query($sql);
// get last inserted UUID
$sql = "SELECT HEX(#last_uuid)";
$result = mysql_query($sql);
$row = mysql_fetch_row($result);
$id = $row[0];
// perform a query using said ID
mysql_query("SELECT FROM table1 WHERE id = 0x" . $id);
Following up in response to #ina's comment:
A UUID is not a string, even if MySQL chooses to represent it as such. It's binary data in its raw form, and those dashes are just MySQL's friendly way of representing it to you.
The most efficient storage for a UUID is to create it as UNHEX(REPLACE(UUID(),'-','')) - this will remove that formatting and convert it back to binary data. Those functions will make the original insertion slower, but all following comparisons you do on that key or column will be much faster on a 16-byte binary field than a 36-character string.
For one, character data requires parsing and localisation. Any strings coming in to the query engine are generally being collated automatically against the character set of the database, and some APIs (wordpress comes to mind) even run CONVERT() on all string data before querying. Binary data doesn't have this overhead. For the other, your char(36) is actually allocating 36 characters, which means (if your database is UTF-8) each character could be as long as 3 or 4 bytes depending on the version of MySQL you are using. So a char(36) can range anywhere from 36 bytes (if it consists entirely of low-ASCII characters) to 144 if consisting entirely of high-order UTF8 characters. This is much larger than the 16 bytes we have allocated for our binary field.
Any logic performed on this data can be done with UNHEX(), but is better accomplished by simply escaping data in queries as hex, prefixed with 0x. This is just as fast as reading a string, gets converted to binary on the fly and directly assigned to the query or cell in question. Very fast.
Reading data out is slightly slower - you have to call HEX() on all binary data read out of a query to get it in a useful format if your client API doesn't deal well with binary data (PHP in paricular will usually determine that binary strings === null and will break them if manipulated without first calling bin2hex(), base64_encode() or similar) - but this overhead is about as minimal as character collation and more importantly is only being called on the actual cells SELECTed, not all cells involved in the internal computations of a query result.
So of course, all these small speed increases are very minimal and other areas result in small decreases - but when you add them all up binary still comes out on top, and when you consider use cases and the general 'reads > writes' principle it really shines.
... and that's why binary(16) is better than char(36).
Its pretty easy actually
you can pass this to mysql and it will return the inserted id.
set #id=UUID();
insert into <table>(<col1>,<col2>) values (#id,'another value');
select #id;
Depending on how the uuid() function is implemented, this is very bad programming practice - if you try to do this with binary logging enabled (i.e. in a cluster) then the insert will most likely fail. Ivan's suggestion looks it might solve the immediate problem - however I thought this only returned the value generated for an auto-increment field - indeed that's what the manual says.
Also what's the benefit of using a uuid()? Its computationally expensive to generate, requires a lot of storage, increases the cost of querying the data and is not cryptographically secure. Use a sequence generator or autoincrement instead.
Regardless if you use a sequence generator or uuid, if you must use this as the only unique key on the database, then you'll need to assign the value first, read it back into phpland and embed/bind the value as a literal to the subsequent insert query.
This question is not a duplicate of PHP string comparison between two different types of encoding because my question requires a SQL solution, not a PHP solution.
Context ► There's a museum with two databases with the same charset and collation (engine=INNODB charset=utf8 collate=utf8_unicode_ci) used by two different PHP systems. Each PHP system stores the same data in a different way, next image is an example :
There are tons of data already stored that way and both systems are working fine, so I can't change the PHP encoding or the databases'. One system handles the sales from the box office, the other handles the sales from the website.
The problem ► I need to compare the right column (tipo_boleto_tipo) to the left column (tipo) in order to get the value in another column of the left table (unseen in image), but I'm getting no results because the same values are stored different, for example, when I search for "Niños" it is not found because it was stored as "Niños" ("children" in spanish). I tried to do it via PHP by using utf8_encode and utf8_decode but it is unacceptably slow, so I think it's better to do it with SQL only. This data will be used for a unified report of sales (box office and internet) in variable periods of time and it has to compare hundreds of thousands of rows, that's why it is so slow with PHP.
The question ► Is there anything like utf8_encode or utf8_decode in MYSQL that allows me to match the equivalent values of both columns? Any other suggestion will be welcome.
Next is my current code (with no results) :
DATABASE TABLE COLUMN
▼ ▼ ▼
SELECT boleteria.tipos_boletos.genero ◄ DESIRED COLUMN.
FROM boleteria.tipos_boletos ◄ DATABASE WITH WEIRD CHARS.
INNER JOIN venta_en_linea.ventas_detalle ◄ DATABASE WITH PROPER CHARS.
ON venta_en_linea.ventas_detalle.tipo_boleto_tipo = boleteria.tipos_boletos.tipo
WHERE venta_en_linea.ventas_detalle.evento_id='1'
AND venta_en_linea.ventas_detalle.tipo_boleto_tipo = 'Niños'
The line ON venta_en_linea.ventas_detalle.tipo_boleto_tipo = boleteria.tipos_boletos.tipo is never gonna work because both values are different ("Niños" vs "Niños").
It appears the application which writes to the boleteria database is not storing correct UTF-8. The database column character set refers to how MySQL interprets strings, but your application can still write in other character sets.
I can't tell from your example exactly what the incorrect character set is, but assuming it's Latin-1 you can convert it to latin1 (to make it "correct"), then convert it back to "actual" utf8:
SELECT 1
FROM tipos_boletos, ventas_detalle
WHERE CONVERT(CAST(CONVERT(tipo USING latin1) AS binary) USING utf8)
= tipo_boleto_tipo COLLATE utf8_unicode_ci
I've seen this all too often in PHP applications that weren't written carefully from the start to use UTF-8 strings. If you find the performance too slow and you need to convert frequently, and you don't have an opportunity to update the application writing the data incorrectly, you can add a new column and trigger to the tipos_boletos table and convert on the fly as records are added or edited.
I have mysql database (not mine). In this database all the encodings set to utf-8, and I connect with charset utf-8. But, when I try to read from the database I get this:
×¢×?ק 1
בית ×ª×•×’× ×” העוסק במספר שפות ×ª×•×›× ×”
× × ×œ× ×œ×¤× ×•×ª ×חרי 12 בלילה ..
What I supposed to get:
עסק 1
בית תוגנה העוסק במספר שפות תוכנה
נא לא לפנות אחרי 12 בלילה ..
When I look from phpmyadmin, I have the same thing(connection in pma is to utf-8).
I know that the data is supposed to be in Hebrew. Someone have an idea how to fix these?
You appear to have UTF-8 data that was treated as Windows-1252 and subsequently converted to UTF-8 (sometimes referred to as "double-encoding").
The first thing that you need to determine is at what stage the conversion took place: before the data was saved in the table, or upon your attempts to retrieve it? The easiest way is often to SELECT HEX(the_column) FROM the_table WHERE ... and manually inspect the byte-encoding as it is currently stored:
If, for the data above, you see C397C2A9... then the data is stored erroneously (an incorrect connection character set at the time of data insertion is the most common culprit); it can be corrected as follows (being careful to use data types of sufficient length in place of TEXT and BLOB as appropriate):
Undo the conversion from Windows-1252 to UTF-8 that caused the data corruption:
ALTER TABLE the_table MODIFY the_column TEXT CHARACTER SET latin1;
Drop the erroneous encoding metadata:
ALTER TABLE the_table MODIFY the_column BLOB;
Add corrected encoding metadata:
ALTER TABLE the_table MODIFY the_column TEXT CHARACTER SET utf8;
See it on sqlfiddle.
Beware to correctly insert any data in the future, or else the table will be partly encoded in one way and partly in another (which can be a nightmare to try and fix).
If you're unable to modify the database schema, the records can be transcoded to the correct encoding on-the-fly with CONVERT(BINARY CONVERT(the_column USING latin1) USING utf8) (see it on sqlfiddle), but I strongly recommended that you fix the database when possible instead of leaving it containing broken data.
However, if you see D7A2D73F... then the data is stored correctly and the corruption is taking place upon data retrieval; you will have to perform further tests to identify the exact cause. See UTF-8 all the way through for guidance.
I'm scraping data from multiple pages and inserting to my MySQL database. There could be duplicates; I only want to store unique entries. Just in case my primary key isn't sufficient, I put in a test which is checked when I get a MySQL 1062 error* (duplicate entry on primary key**). The test checks that all of the pieces of the tuple to be inserted are identical to the stored tuple. What I found is that the when I get the 1062 error that the stored tuple and the scraped tuple are only different by one element/field, a TEXT field.
First, I retrieved the already stored entry and passed them both into htmlspecialchars() to compare the output visually; they looked identical.
According to strlen(), the string retrieved from the DB was 304 characters in length but the newly scraped string was 305. similar_text() backed that up by returning 304***.
So then I looped through one string comparing character for character with the other string, stopping when there was a mismatch. The problem was the first character. In the string coming from the DB it was N yet both strings appear to start with I (even in their output from htmlspecialchars()). Plus the DB string was supposedly one character shorter, not longer.
I then checked the output (printing htmlspecialchars()) and the strlen() again, but this time before the original string (the one that ends up in the DB) is inserted, and before the duplicated is inserted. They looked the same as before and strlen() returned 305 for both.
So this made me think their must be something happening between my PHP and my MySQL. So instead of comparing the newly scraped string to the string in the database with the same primary key (the ID), I try to retrieve a tuple where every single field is equal to their respective parts in newly scraped section like SELECT * FROM table WHERE value1='{$MYSQL_ESCAPED['value1']}' .... AND valueN='{$MYSQL_ESCAPED['valueN']}'; and the tuple is returned. Therefore they are identical in every way including that problematic TEXT field.
What's going on here?
Straight away when I see N in front of string I think of NVARCHAR, etc. from MSSQL but as I know that's not a part of MySQL, but...
Could it have anything to do with the fact that "Each TEXT value is stored using a two-byte length prefix that indicates the number of bytes in the value."?
Or does this just point to a character encoding problem?
Edit:
There are no multi-byte characters stored in the database.
mb_strlen() returns the same results as strlen() where mentioned above.
Using utf8_encode() or mb_convert_encoding() before inserting to the DB makes no difference; an invisible N is still prefixing the string retrieved from the DB.
Notes:
Before inserting any string into my database I pass it through mysql_real_escape_string(trim(preg_replace('/\s\s+/', ' ', $str))) which replaces double spaces with single spaces, removes leading & tailing spaces and escapes it for MySQL insertion.
The page I print the output & testing to is UTF-8.
Upon creation, my DB has its character set set to utf8, its collation to utf8_general_ci and I use the SET NAMES 'utf8' COLLATE 'utf8_general_ci'; command too, as a precaution.
Foot notes:
* I force an exit from the scraping then also.
** The primary key is just a ID (VARCHAR(10)) which I scrape from the pages.
*** Number of common characters
TEXT fields are subject to character set conversion as/when MySQL sees fit. However, MySQL will not randomly add/remove data without a reason. While text fields DO store the length of the data as 2 extra bytes at the head of the on-disk data blob containing the text field data, those 2 bytes are NEVER exposed to the end user. Assuming character set settings are the same throughout the client->database->on-disk->database->client pipeline, there should never be a change in string length anywhere.