I have some interesting issue. The following two codes do not produce the same output:
$result = $sql->QueryFetch("SELECT machinecodeSigned FROM ...");
echo bin2hex($result['machinecodeSigned']);
and
$result = $sql->QueryFetch("SELECT HEX(machinecodeSigned) FROM ...");
echo $result['machinecodeSigned'];
So, $sql is just some wrapper class and method QueryFetch internally just calls PHP standard functions for query and fetch to attain values.
I get two different results, though. For example, for some arbitrary input in my database, I get:
08c3bd79c3a0c2a66fc2bb375b6370c399c3acc3ba7bc2b8c2b203c39d70
and
08FD79E0A66FBB375B6370D9ECFA7BB8B203DD70
Ignoring case-sensitivity, the first output is nonsense while the other one is correct.
machinecodeSigned is a char(255) field that is latin-1 encoded and has collation latin-1 (which should not play a role, I assume).
What could be the reason that I get two different results? This used to yield the same results for years, but suddenly I had to change the code from version 1 to version 2 in order for it to produce the correct result. It seems, as if PHP does some arbitrary conversion of the bytes in the string.
Edit: It seems necessary to say that the field is not human-readable. In any case, since the second output is the correct one, feel free to convert the hexadecimal form to ASCII characters, if this helps you.
Edit:
SHOW CREATE TABLE yields:
CREATE TABLE `user` (
`ID` int(9) NOT NULL AUTO_INCREMENT,
`machinecodeSigned` char(255) CHARACTER SET latin1 COLLATE latin1_bin DEFAULT NULL
PRIMARY KEY (`ID`),
) ENGINE=InnoDB AUTO_INCREMENT=10092 DEFAULT CHARSET=latin1 COLLATE=latin1_german2_ci
char(255) CHARACTER SET latin1 COLLATE latin1_bin
will read/write bytes unchanged. It would be better to say BINARY(255), or perhaps something else.
If you tell the server that your client wants to talk in "utf8", and you SELECT that column, then MySQL will translate from latin1 (the charset of the data) to utf8 (the encoding you say the client wants). This leads to the longer hex string.
You say that phpmyadmin says "utf8" somewhere; that is probably the cause of the confusion.
If it had been stored as base64, there would be no confusion because base64 uses very few different characters, and they are encoded identically in latin1 and utf8. Furthermore, latin1_bin would have been appropriate. So, another explanation of what went wrong is the unwanted reconversion from base64 to binary.
MySQL's implementation of latin1_bin is simple and permissive -- all 256 bit values are simply stored and loaded, unchecked. This makes it virtually identical to BLOB and BINARY.
This is probably the base64_encode that should have been stored:
MDhGRDc5RTBBNjZGQkIzNzVCNjM3MEQ5RUNGQTdCQjhCMjAzREQ3MA==
Datatypes starting with VAR or ending with BLOB or TEXT are implemented via a 'length' field plus the bytes needed to represent the value.
On the other hand, CHAR and BINARY are fixed length, and padded by spaces (CHAR) or \0 (BINARY).
So, writing binary info to CHAR(255) actually may modify the data due to spaces appended.
Related
Here's my situation.
I'm migrating from one server to another. As part of this, I'm moving across database data.
The migration method involved running the same CREATE TABLE query on the new server, then using a series of INSERT commands to insert the data row by row. It's possible this resulted in different data, however, the CHECKSUM command was used to validate the contents. CHECKSUM was done on the whole table after the transfer, on a new table with that row isolated, and after truncation of the string by applying the LEFT operator. Every time, the result was identical between the old and new server, indicating the raw data should be exactly identical at the byte level.
CHECKSUM TABLE `test`
I've checked the structure and it's exactly the same as well.
SHOW CREATE TABLE `test`
Here is the structure:
CREATE TABLE test ( item varchar(32) COLLATE utf8_unicode_ci NOT NULL, amount mediumint(5) NOT NULL ) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
The field is of type:
`item` varchar(32) COLLATE utf8_unicode_ci NOT NULL
Here is my connection code in PHP:
$sql = new mysqli($db_host, $db_user, $db_pass, $db_name);
if ($sql->connect_error) {
die('Connect Error ('.$sql->connect_errno.') '.$sql->connect_error);
}
When I go to retrieve the data in PHP with a simple query:
SELECT * FROM `test`
The data displays like this:
§lO
On the old server/host, I get this sequence of raw bytes:
Decimal: -194-167-108-79-
HEX: -C2-A7-6C-4F-
And on the new server, I get a couple of extra bytes at the beginning:
Decimal: -195-130-194-167-108-79-
HEX: -C3-82-C2-A7-6C-4F-
Why might the exact same raw data, table structure, and query, return a different result between the two servers? What should I do to ensure that results are as consistent as possible in the future?
§lO is "Mojibake" for §lO. I presume the latter (3-character) is "correct"?
The raw data looks like this (in both cases when I display it)
is bogus because the technique used for displaying it probably messed with the encoding.
Since the 3 characters became 4 and then became 6, you probably have "double-encoding".
This discusses how "double encoding" can occur: Trouble with UTF-8 characters; what I see is not what I stored
If you provide some more info (CREATE TABLE, hex, method of migrating the data, etc), we may be able to further unravel the mess you have.
More
When using mysqli, do $sql->set_charset('utf8');
(The HEX confirms my analysis.)
The migration method involved running the same CREATE TABLE query on the new server
Was it preceded by some character set settings, as in mysqldump?
then using a series of INSERT commands to insert the data row by row.
Can you get the HEX of some accented character in the file?
... CHECKSUM ...
OK, being the same rules out one thing.
CHECKSUM was done on ... a new table with that row isolated
How did you do that? SELECTing the row could have modified the text, thereby invalidating the test.
indicating the raw data should be exactly identical at the byte level.
For checking the data in the table, SELECT HEX(col)... is the only way to bypass all possible character set conversions that could happen. Please provide the HEX for some column with a non-ascii character (such as the example given). And do the CHECKSUM against the HEX output.
And provide SHOW VARIABLES LIKE 'char%';
I'm trying to convert a database to use utf8mb4 instead of utf8. Everything is going fine except one table:
CREATE TABLE `search_terms` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`search_term` varchar(128) NOT NULL,
`time_added` timestamp NULL DEFAULT NULL,
`count` int(10) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`id`),
UNIQUE KEY `search_term` (`search_term`),
KEY `search_term_count` (`count`)
) ENGINE=InnoDB AUTO_INCREMENT=198981 DEFAULT CHARSET=utf8;
Basically all it does is save an entry every time somebody searches something in a form so we can track the number of searches, very simple.
There's a unique index on search_term because we want to only have one row per search term and instead increment the count value.
However when converting to utf8mb4 I am getting duplicate entry errors. Here is the command I am running:
ALTER TABLE `search_terms` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
Looking in the database I can see various examples like this:
fm2012
fm2012
fm2012
In it's current utf8 character set, these are all being treated as unique and exist within the database without ever having an issue with the unique index on search_term.
But when converting to utf8mb4 they are now being considered equal and throwing an error due to that index.
I can figure out how to merge these together easily enough, but i'm concerned this may be a symptom of a greater underlying problem. I'm not really sure how this has happened or what the consequences may be, so my questions are a bit vague:
Why is utf8mb4 treating these differently to utf8?
What are the possible consequences?
Is there someway I can do a conversion so things like "fm2012" never appear in my database and I only have "fm2012" (I am also using Laravel 5.1)
Your problem is the change of collation: you're using general_ci and you're converting to unicode_ci: general_ci is quite a simple collation that doesn't know much about unicode, but unicode_ci does.
The first "f" in your example string is a "Fullwidth Latin Small Letter F" (U+FF46) which is considered equal to "Latin Small Letter F" (U+0066) by unicode_ci but not by general_ci.
Normally it's recommended to use unicode_ci exactly because of its unicode-awareness but you could convert to utf8mb4_general_ci to prevent this problem.
To prevent this problem in the future, you should normalize your input before saving it in the DB. Normally you'd use NFC, but your case seems to call for NFKC. This should bring all "equivalent" strings to the same form.
Despite what was said previously it is not about general_ci being more simplistic than unicode_ci. Yes, it can be true, but the issue is that you need to keep it matching to the sub-type you have.
For example, my database is utf8_bin. I cannot convert to utf8mb4_unicode_ci nor to utf8mb4_general_ci. These commands will throw an error of a duplicate key being found. However the correct collation utf8mb4_bin completes without issues.
My system deals with spanish data. I am using laravel + mysql. My database collation is latin1 - default collation and my tables structure looks something like this:
CREATE TABLE `product` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(100) CHARACTER SET latin1 NOT NULL,
) ENGINE=InnoDB AUTO_INCREMENT=298 DEFAULT CHARSET=utf8mb4;
Have a few questions:
I load data from file to db. Is it a good practice to
utf8_encode($name) before inserting to db? I am currently doing so,
else some comparison throw error : SQLSTATE[HY000]: General error: 1267 Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (utf8_unicode_ci,COERCIBLE) for operation '='
If using utf8_encode is the way to go, do i need to utf8_encode even name i want to search? i.e. select... where name =
utf8_encoded(name)?
Is there any flaws or better way to handle the above? As i doing spanish for the first time (characters with accents).
Your product.name column has the character set latin1. You know that. It also has the collation latin1_swedish_ci. That's the default. The original developers of MySQL are Swedish. Because you're working in Spanish, you probably want to use latin1_spanish_ci for your collation; it sorts Ñ after N. The other Latin-language collations sort them together.
Because your product.name column is stored in latin1, it is a bad, not a good, idea to use utf8_encode() on text before storing it to that column.
Your best course of action, especially if your application is new, is to make the character set for all columns utf8mb4. That means changing the defined character set of your name column. Then you can convert text strings to unicode before storing them.
You probably would be wise to make the default collation of each table utf8mb4_spanish_ci as well. Collations get baked into indexes for varchar() columns. (If you're working in traditional Spanish, in which ch is a distinct letter, use utf8mb4_spanish2_ci.)
I'm trying to support UTF-8 characters into email addresses. If I understand correctly, email addresses are limited to 254 usable (ASCII) characters. Based on this, I would like to store email address in a VARCHAR(254) ASCII MySQL InnoDB column. One of the problems I'm encountering is to validate such scenarios. I'm trying to convert UTF-8 to ASCII but getting mixed results as shown below (I know the example is not a valid email but I could have used other characters - this is just to explain the problem):
<?php
$string = '🐼#🐼.🐼';
echo 'UTF-8 Value: ' . $string . '<br/>';
echo 'ASCII Length (from UTF-8 string):' . mb_strlen($string, 'ASCII') . '<br/>';
$stringAscii = mb_convert_encoding($string, 'ASCII', 'UTF-8');
echo 'ASCII Length:' . strlen($stringAscii) . '<br/>';
echo 'ASCII Value:' . $stringAscii . '<br/>';
The output is:
UTF-8 Value: 🐼#🐼.🐼
ASCII Length (from UTF-8 string)::14
ASCII Length:5
ASCII Value:?#?.
I would expect the length to be 14 characters in the ASCII string once it's converted? How can I convert the UTF-8 string to ASCII without losing its original length and value? Basically I'm looking for a way to store a UTF-8 string into its ASCII format while being able to convert it back to its original UTF-8 format.
I also tried other type of encoding output (e.g. byte outputs) but was unable to find any output matching the 14 characters length. I also tried iconv which is returning exceptions for there characters. The idea to convert in ASCII is that I can support this value as a primary key of a table in MySQL within my VARCHAR(254). I could always try to convert to HTML-ENTITIES but it will be hard to predict the maximum size of the string to reflect it in the DB schema.
An other option is to use a UTF-8MB4 encoded VARCHAR(256) column in MySQL but when used as a primary key, this will go above the 767 bytes index limit and require to enable large index in InnoDB which I would prefer to avoid.
Is there a way to achieve what I'm trying to do without using innodb_large_prefix=on in MySQL?
Nicholas you seem to have some fundamental confusions with Ascii Vs UTF-8 Character sets in your Question and your comments to answer(s).
UTF-8 Value: 🐼#🐼.🐼
ASCII Length (from UTF-8 string):14
ASCII Length:5
ASCII Value:?#?.
I would expect the length to be 14 characters in the ASCII string once it's converted?
No, If the Panda Face UTF-8 character was represented in Ascii how would it be represented? At best this would be subjective such as with a <3 or a B-) etc.
There is no translation of the Pandaface, hence it is substituted with the placeholder ? in the output character set. It is somewhat like trying to spell king but only with vowels. There are simply less ascii options than UTF8.
So please take away that Ascii is a practical sub-set of UTF-8, not vice versa.
MySQL Unique Storage Solution
MYSQL Unique indexes have a limit of 767 bytes in total. You can chain these indexes together and for any table MySQL can provide a total unique index of 3072 bytes. For the purposes of using a single index column of collation UTF8mb4_unicode_ci (ie, the one you should be using) then the unique index column would be:
<max index size in bytes> / <max bytes per character in collation>
767 / 4 = 191 characters.
Therefore MySQL will only unqiuely index the first 191 characters of any UTF-8 string.
To sidestep this limiter, you would make a new table, with two columns, an Auto_increment integer column, and a varchar column:
CREATE TABLE `emails` (
`id` int(8) NOT NULL AUTO_INCREMENT,
`email` varchar(256) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
PRIMARY KEY (`id`),
KEY `email` (`email`(191))
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
Then each time a new email address is added, you search this table if it already exists (the column is indexed but is not unique) and if not, then the email address is inserted and referenced by the id column.
The email column is always UTF8mb4 because this is full UTF8 unlike the MySQL standard utf8_ collation. MySQL can't uniquely qualify data larger than 767 bytes as you have said, but if your various other tables reference the id row for the email, that column on the other tables can be unique.
Some Further thoughts
1 htmlentities is not an effective solution because for any character the size of it's entity is always bigger, take the > character, which is > this is already 4 characters in length at best case, even if each of these could be stored in "1 byte", this would still be a larger storage than with > which as a general UTF-8 character at worst case would be 4-bytes.
htmlentities will only effect characters that have a specified HTML alternative, and I'm unsure if things like <PandaFace> or <shitpoo> have htmlentities(?).
2 What is the longest email address you have ever seen or even ever used, that is a real genuine address? While the maximum size of email addresses is 254 ascii characters, that is:
thisisaverylongandtediousemailaddresswhichisprettyimpractical.
andonlyreallyworth.jacksquitintheamount.ofspacethiscantakeupinyourdatabase
#home.somewhere.overtherainbow.ornear.somepot.of.irishgold.thinkaboutthis.
thisemailisthemaximumlengthallowed.co.uk.com
Now look at that code, that is the longest allowed ascii email address by definition. That' pretty long and while not impossible, the number of users who have email addresses (in ascii) of this length will be an extreme edgecase.
Taking this a step futher down this line, say that you have an email address that is 64 UTF-8 4-byte characters as you've set as the upper utf-8 limit,
So as ascii something of this length:
horsesandgoastandcatsanddogsandfleas#some.petting.zoo.org.uk.com
But as UTF-8 4byte characters and say this above email was translated into certain UTF-8 Chinese character sets, this email address length is still the upper range of what is practical for humans to actually use and have as their addresses. But it is not quite out the park, it's unlikely unless you're aiming for a specific market audience.
The MySQL Unique Indexing of 767 bytes would limit you to approximately 191 4-byte UTF-8 characters, then you'd be limited to 47 fully UTF-8 characters in an email address featuring 2 (well, max 3) non UTF-8 4byte character (such as # and .).
Example:
thisIsAnEmailOfUTF8CharasandA#IntheRightPlace.com
Now remember that this email doesn't look that long, it's of a more realistic size than others, but each and every character (except . and #) would need to be at the 4byte UTF-8 encoding for this to hit the MySQL unique index limit, so for example if each of the characters in the email was of a certain non-latin language such as Ethiopian or certain UTF-8 Chinese sets.
3
It is also worth noting that Chinese (and I think Japanese) characters are each words or syllabales in their own right (therefore bigger than simply letters), so (I hazard) few Chinese would have excessive email addresses instead you'd have:
猫#空间农场.com
This is donkey#spacefarm.com*, taking up 10 character spaces in Chinese, whereas the ascii latin takes up 20 character spaces.
Further to this there are some (sub)sets of Chinese and Japanese characters that are still not present in the UTF-8 standard. (annoyingly, the example above is one of these).
*^ Google translation, so may be wrong!
Some Conclusion Options
Store your Email in plaintext UTF-8 in a specific table with a unique AI column (as outlined above). reference/cross-reference the column AI id number to discover if the email text is unique on any other field/column in the database. Do not Unique the email column, simply index it, but unique the index reference to that column.
Store the Email address as a hash and check if the hash is unique such as with sha1 in PHP . SHA1 is better than MD5 because it is a longer hash so can accept more values without collisions (although collisions are still possible). Sha hashes are always 160 bit or 40 characters long and therefore comfortably fit into the MySQL unique column constraints.
Store your email address to a VARCHAR(190) length and expect that to cover 98%+ of your database usership.
MySQL unique index limit is not as likely to effect your emails as the criteria for valid email length.
You may be able to get away with using email addresses that are technically questionably valid but weather these are accepted by routers and DNS servers is pretty much up to each server.
Email is an old and anachronistic way of transporting data. Consider the future will be more like SnapChat [example] and other database based authenticated communications which have few of the curtailing limits that email inherits. Email is also very tedious to code with and prone to a wide variety of issues errors and problems as well as extremely poor security overheads.
MySQL Storing The Email Address
Option 1 ) Hash the email address and store the hash in a unique column.
Postives:
This will mean you can store the email in the same column as you'd originally intended. Email should be fixed length sha hash. MySQL Unique column contstraint would be valid.
Negatives
Hash collisions would be possible, email address itself would not be searchable or "de-codable".
Option 2 ) Store the email address plaintext in the UTF-8 column and simply limit the email VARCHAR field size to 190 characters.
Positives:
This would probably cover all likely valid email addresses.
Negatives:
Longer email addresses would be invalid and truncated, meaning they would be saved without error but would not be the same text strings (due to truncation).
Option 3 ) Store the Email in a new MySQL table with an indexed VARCHAR column and an auto_increment numerical reference column as detailed above.
This would then mean that any occurance of the email text would be replaced by a numerical reference for that row in the database. The column that features the original email text can then be a unique index.
Positives:
This means you can store emails as unique entities and can carry out SQL checks for if they already appear.
Negatives:
This would mean changing your current coding and SQL commands slightly to accommodate this new table as a reference table.
Example
Email Reference Table:
CREATE TABLE `email_reference` (
`id` int(8) NOT NULL AUTO_INCREMENT,
`email` varchar(256) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
PRIMARY KEY (`id`),
KEY `email` (`email`(191))
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
Users (example) table:
CREATE TABLE `userdata` (
`user_id` int(8) NOT NULL AUTO_INCREMENT,
`name` varchar(90) COLLATE utf8mb4_unicode_ci NOT NULL,
`email_ref` int(11) DEFAULT NULL,
`details` text COLLATE utf8mb4_unicode_ci NOT NULL,
PRIMARY KEY (`user_id`),
UNIQUE KEY `email_ref` (`email_ref`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
The above userdata table will have a unique column for email ref which will reference the email table. This unique column means no two userdata rows can reference the same row in the email_reference table.
Because it is a UNIQUE column it is a good idea to allow NULL values for if anyone for any reason doesn't have an email or other such "uniqueness escape" situations.
The long and the short of my long post is that I think your concerns appear to be mostly edge cases or due to imperfect Database structural design, rather than due to issues with character sets or unique keys themselves. If what you're envisaging with your system are not edge cases then using the MySQL AI int reference system I have outlined above should, with a little bit of foresight on your part, cover your needs.
I'm adding the missing details in my own answer (special thanks for Ignacio, andig, Martin and Markus Laire for helping me to put the pieces of this puzzle together).
There are two problems to this question :
Encoding conversion from UTF-8 to ASCII
MySQL index limit to 767 bytes without enabling innodb_large_prefix for MySQL < 5.7.7 (looks like this is now enabled by default).
Answer for "Encoding conversion from UTF-8 to ASCII"
ASCII is a subset of UTF-8 so not all characters can be converted. ASCII only uses 128 characters (the first 128) per byte while UTF-8 bytes can use more. The ideal solution would be to use an encoding that support all 256 possibilities per 8-bits bytes. Some encoding like cp1252 supports most characters but even if this is true, some characters are invisible which could end up causing issues.
For a true byte by byte conversion the only reliable option is to use binary. for our user case given we use MySQL, the best option would be to have a VARBINARY(254) (binary fields don't have encoding). After that it would be easy to simply:
INSERT into user_table set email_address='🐼#🐼.🐼';
SELECT * FROM user_table where email_address = '🐼#🐼.🐼';
To be safe, values can also be HEX('🐼') on the client of application side if needed. This is truly the most efficient solution for this problem given that you will only store email address in a 254 bytes column which is by RFC standard the maximum length with any encoding.
Answer for "MySQL index limit to 767 bytes"
It looks like InnoDB large prefixes is now the default configuration on MySQL >= 5.7.7 since it was mostly a backward compatible setting. While one could implement this complex UTF-8 to HTML-ENTITIES conversion, it probably make more sense to just upgrade MySQL when using a UTF-8 email address as a primary key. Or one could also simply enable large prefixes in the MySQL configuration for MySQL <= 5.7.7:
innodb_large_prefix=on
innodb_file_format=barracuda
Conslusion
Keep in mind that while some providers supports UTF-8 in email addresses, it is still not mainstream in 2016. In the meantime there are a few options to store the information but less to make sure it will make it to its destination.
You cannot "convert" a UTF8 string to ASCII at the same length if the characters do not have an ASCII representation as in your example.
What you could do is to create some kind of representation of the bytecodes that make up the UTF8 characters. I doubt that would be useful as email address though.
UPDATE
In UTF8 each character can consume multiple bytes. how many varies by character. If ASCII one character is one byte. So you can use each byte of the UTF8 character and see chat character that byte represents in ASCII. However- this will have absolutely nothing to to with the original UTF8 character except for those UTF8 characters that are represented by a single byte. IMHO those will match their ASCII representation.
I current have the following snippet of text in a text paragraph for my website
let’s get to it
The apostrophe character is part of the UTF-8 charset, and it saves properly in a table column that is designated a VARCHAR column, in the form
let’s get to it
Which is properly parsed by my client. If I put the same text into a TEXT column in MySQL, it's stored as the following:
letâs get to it.
Is there any reason the two would differ, and if so, how can I change it?
let’s is Mojibake. Latin1 is creeping in.
"text blob" -- which is it TEXT or BLOB? They are different datatypes.
letâs comes from htmlentities() or something equivalent. That can be stored and retrieved in VARCHAR, TEXT, or BLOB, regardless of CHARACTER SET. MySQL will not convert to that.
The Mojibake probably came from
The bytes you have in the client are correctly encoded in utf8 (good).
You connected with SET NAMES latin1 (or set_charset('latin1') or ...), probably by default. (It should have been utf8.)
The column in the tables may or may not have been CHARACTER SET utf8, but it should have been that.