Mysql where condition in different languages other than English [duplicate] - php

In a table x, there is a column with the values u and ü.
SELECT * FROM x WHERE column='u'.
This returns u AND ü, although I am only looking for the u.
The table's collation is utf8mb4_unicode_ci . Wherever I read about similar problems, everyone suggests to use this collation because they say that utf8mb4 really covers ALL CHARACTERS. With this collation, all character set and collation problems should be solved.
I can insert ü, è, é, à, Chinese characters, etc. When I make a SELECT *, they are also retrieved and displayed correctly.
The problem only occurs when I COMPARE two strings as in above example (SELECT WHERE) or when I use a UNIQUE INDEX on the column. When I use the UNIQUE INDEX, a "ü" is not inserted when I have a "u" in the column already. So, when SQL compares u and ü in order to decide whether the ü is unique, it thinks it is the same as the u and doesn't insert the ü.
I changed everything to utf8mb4 because I don't want to worry about character sets and collation anymore. However, it seems that utf8mb4 isn't the solution either when it comes to COMPARING strings.
I also tried this:
SELECT * FROM x WHERE _utf8mb4 'ü' COLLATE utf8mb4_unicode_ci = column.
This code is executable (looks pretty sophisticated). However, it also returns ü AND u.
I have talked to some people in India and here in China about this issue. We haven't found a solution yet.
If anyone could solve the mystery, it would be really great.
Add_On: After reading all the answers and comments below, here is a code sample which solves the problem:
SELECT * FROM x WHERE 'ü' COLLATE utf8mb4_bin = column
By adding "COLLATE utf8mb4_bin" to the SELECT query, SQL is invited to put the "binary glasses" (ending _bin) on when it looks at the characters in the column. With the binary glasses on, SQL sees now the binary code in the column. And the binary code is different for every letter and character and emoji which one can think of. So, SQL can now also see the difference between u and ü. Therefore, now it only returns the ü when the SELECT query looks for the ü and doesn't also return the u.
In this way, one can leave everything (database collation, table collation) the same, but only add "COLLATE utf8mb4_bin" to a query when exact differentiation is needed.
(Actually, SQL takes all other glasses off (utf8mb4_german_ci, _general_ci, _unicode_ci etc.) and only does what it does when it is not forced to do anything additional. It simply looks at the binary code and doesn't adjust its search to any special cultural background.)
Thanks everybody for the support, especially to Pred.

Collation and character set are two different things.
Character set is just an 'unordered' list of characters and their representation.
utf8mb4 is a character set and covers a lots of characters.
Collation defines the order of characters (determines the end result of order by for example) and defines other rules (such as which characters or character combinations should be treated as same). Collations are derived from character sets, there can be more than one collation for the same character set. (It is an extension to the character set - sorta)
In utf8mb4_unicode_ci all (most?) accented characters are treated as the same character, this is why you get u and ü. In short this collation is an accent insensitive collation.
This is similar to the fact that German collations treat ss and ß as same.
utf8mb4_bin is another collation and it treats all characters as different ones. You may or may not want to use it as default, this is up to you and your business rules.
You can also convert the collation in queries, but be aware, that doing so will prevent MySQL to use indexes.
Here is an example using a similar, but maybe a bit more familiar part of collations:
The ci at the end of the collations means Case Insensitive and almost all collations with ci has a pair ending with cs, meaning Case Sensitive.
When your column is case insensitive, the where condition column = 'foo' will find all of these: foo Foo fOo FoO FOo FoO fOO, FOO.
Now if you try to set the collation to case sensitive (utf8mb4_unicode_cs for example), all the above values are treated as different values.
The localized collations (like German, UK, US, Hungarian, whatever) follow the rules of the named language. In Germany ss and ß are the same and this is stated in the rules of the German language. When a German user searches for a value Straße, they will expect that a software (supporting german language or written in Germany) will return both Straße and Strasse.
To go further, when it comes to ordering, the two words are the same, they are equal, their meaning is the same so there is no particular order.
Don't forget, that the UNIQUE constraint is just a way of ordering/filtering values. So if there is a unique key defined on a column with German collation, it will not allow to insert both Straße and Strasse, since by the rules of the language, they should be treated as equal.
Now lets see our original collation: utf8mb4_unicode_ci, This is a 'universal' collation, which means, that it tries to simplify everything so since ü is not a really common character and most users have no idea how to type it in, this collation makes it equal to u. This is a simplification in order to support most of the languages, but as you already know, these kind of simplifications have some side effects. (like in ordering, filtering, using unique constraints, etc).
The utf8mb4_bin is the other end of the spectrum. This collation is designed to be as strict as it can be. To achieve this, it literally uses the character codes to distinguish characters. This means, each and every form of a character are different, this collation is implicitly case sensitive and accent sensitive.
Both of these have drawbacks: the localized and general collations are designed for one specific language or to provide a common solution. (utf8mb4_unicode_ci is the 'extension' of the old utf8_general_ci collation)
The binary requires extra caution when it comes to user interaction. Since it is CS and AS it can confuse users who are used to get the value 'Foo' when they are looking for the value 'foo'. Also as a developer, you have to be extra cautious when it comes to joins and other features. The INNER JOIN 'foo' = 'Foo' will return nothing, since 'foo' is not equal to 'Foo'.
I hope that these examples and explanation helps a bit.

utf8_collations.html lists what letters are 'equal' in the various utf8 (or utf8mb4) collations. With rare exceptions, all accents are stripped before comparing in any ..._ci collation. Some of the exceptions are language-specific, not Unicode in general. Example: In Icelandic É > E.
..._bin is the only collation that honors the treats accented letters as different. Ditto for case folding.
If you are doing a lot of comparing, you should change the collation of the column to ..._bin. When using the COLLATE clause in WHERE, an index cannot be used.
A note on ß. ss = ß in virtually all collations. In particular, utf8_general_ci (which used to be the the default) treated them as unequal. That one collation made no effort to treat any 2-letter combination (ss) as a single 'letter'. Also, due to a mistake in 5.0, utf8_general_mysql500_ci treats them unequal.
Going forward, utf8mb4_unicode_520_ci is the best through version 5.7. For 8.0, utf8mb4_0900_ai_ci is 'better'. The "520" and "900" refer to Unicode standards, so there may be even newer ones in the future.

You can try the utf8_bin collation and you shouldn't face this issue, but it will be case sensitive. The bin collations compare strictly, only separating the characters out according to the encoding selected, and once that's done, comparisons are done on a binary basis, much like many programming languages would compare strings.

I'll just add to the other answers that a _bin collation has its peculiarities as well.
For example, after the following:
CREATE TABLE `dummy` (`key` VARCHAR(255) NOT NULL UNIQUE);
INSERT INTO `dummy` (`key`) VALUES ('one');
this will fail:
INSERT INTO `dummy` (`key`) VALUES ('one ');
This is described in The binary Collation Compared to _bin Collations.
Edit: I've posted a related question here.

Related

why MySQL query shows results both for Ù† and Ú† when I call Ú†?

I have a database table with a column where I categorized Persian alphabetic letters to select with MySQL WHERE later. everything works fine for all letters, but I have a problem while selecting letter (چ) which is stored as (Ù†) in database and (ن) which is stored as (Ú†).
first I thought the problem could be from inserting same letters, but when I checked in database , letters where stored with different encoding I mean (Ù†) and (Ú†).
when I zoom in these letters the tick over U is different. both letters are echoed correctly when I echo them on webpage, but when I choose to select letters WHERE letter = 'چ' it shows letters with (ن) too!!!
all of the webpages that insert and read data from DB are in UTF-8 and database collation is utf_persian-ci.
I cant find where the problem is with this? any help is appreciated,
Mojibake. (or not; see below) Probably:
The bytes you have in the client are correctly encoded in utf8 (good).
You connected with SET NAMES latin1 (or set_charset('latin1') or ...), probably by default. (It should have been utf8.)
The column in the tables may or may not have been CHARACTER SET utf8, but it should have been that.
For PHP:
⚈ mysqli interface: mysqli_set_charset('utf8') function.
⚈ PDO interface: set the charset attribute of the PDO dsn or via SET NAMES utf8.
The COLLATION (eg, utf8_persion_ci) is not relevant to Mojibake. It is relevant to how characters are ordered.
Edit
You say "is stored as (Ù†)" -- How do you know? Most attempts to see what is stored are subject to the client fiddling with the bytes. This is a sure way to see what is there:
SELECT col, HEX(col) FROM tbl ...
For چ, the HEX should be DA86 for proper utf8 (or utf8mb4) encoding. If you get C39AE280A0, then you have "double encoding". In general, Arabic/Persian/Farsi should be of the form Dxyy.
If you read چ while connected with latin1, you will get Ù†, which is DA86 in latin1 encoding (Ù = DA and † = 86).
ن encodes as D986.
Double Encoding
I used hex(col) to send query and got C399E280A0 for ن and C39AE280A0 for چ .
So, you have "double encoding", not "Mojibake".
C399 is utf8 for Ù; E280A0 is utf8 for †. Your character was changed from latin1 to utf8 twice. Usually the end result is invisible to the outside world, but messed up in the table. That is because the SELECT decodes twice. However, since you are seeing only one decode, things are not that simple.
Caveat: You have a situation where I have not experimented; the advice I give you could be wrong.
Here's what probably happened.
The client had characters encoded as utf8 (good) hex: D986;
When inserting, the application lied by claiming that the client had latin1 encoding. (This is the old default.); D9 converted to Ù and 86 converted to †;
The column in the table declared CHARACTER SET utf8 (good). But now the Ù is stored as C399 and the † is stored as E280A0, for a total of 5 bytes;
When reading the connection claimed utf8 (good) for the client, so those 5 bytes were turned back into Ù†;
The client dutifully said the utf8 data was Ù†.
Notice the imbalance between the INSERT and the SELECT. You tagged this PHP; did PHP both write and read the data? Did it have a different setting for the charset for writing and reading?
The problem seems to be only in setting the charset for writing. It needed to be explicitly utf8, not defaulting to latin1.
But what about the data? If everything I said (about double encoding) matches what you have, then an UPDATE can fix the data. See my blog for the details.
This is a typical result of using a 'locale specific unicode encoding', in your case utf8_persian_ci. I expect that if you switch your collation to utf8_unicode_ci, it will work as expected.
If by any change you want to get rid of the case-insensitivity, you could switch to utf8_bin.
For further reference see the MySQL documentation.

PHP - bad encoded turkish characters in MySQL database

I am working on a turkish website, which has stored many malformed turkish characters in a MySQL database, like:
- ş as þ
- ı as ý
- ğ as ð
- Ý as İ
i can not change the data in the database, because the database are updated daily and the new data will contain the malformed characters again. So my idea was to change the data in PHP instead of changing the data in the database. I have tried some steps:
Turkish characters are not displayed correctly
Fix Turkish Charset Issue Html / PHP (iconv?)
PHP Turkish Language displaying issue
PHP MYSQL encoding issue ( Turkish Characters )
I am using the PHP-MySQLi-Database-Class available on GitHub with utf8 as charset.
I have even tried to replace the malformed characters with str_replace, like:
$newString = str_replace ( chr ( 253 ), "ı", $newString );
My question is, how can i solve the issue without changing the characters in the database? Are there any best practices? Is it a good option just to replace the characters?
EDIT:
solved it by using
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-9" />
2022 update. I made a wide research and I found this solution and it's working.
let's say your db_connection is $mysqli:
$mysqli = mysqli_connect($hostname, $username, $password, $database) OR DIE ("Baglanti saglanamadi!");
just add this line after. it works like magic with all languages even Arabic:
mysqli_set_charset($mysqli, 'utf8');
Two solutions are good
PHP MYSQL encoding issue ( Turkish Characters )
PHP Turkish Language displaying issue
Also you can set configuration on phpMyAdmin
Operations > Table options > Collation > select utf8_general_ci
if you create the tables already edit the collation structures also
SELECT CONVERT(CONVERT(UNHEX('d0dddef0fdfe') USING ...) USING utf8);
latin5 / iso-8859-1 shows ĞİŞğış
latin1 / iso-8859-9 shows ÐÝÞðýþ
You are confusing two similar encodings; see the first paragraph in https://en.wikipedia.org/wiki/ISO/IEC_8859-9 .
"Collation" is only for sorting. But first you need to change the CHARACTER SET to latin5. Then change the collation to latin5_turkish_ci. (Since that is the default for latin5, no action need be taken.)
This may suffice to make the change in MySQL: EDIT 3
NO, this is probably wring -- ALTER TABLE tbl CONVERT TO CHARACTER SET latin5;
After seeing more of the issue, this "2-step ALTER" is probably correct:
ALTER TABLE Tbl MODIFY COLUMN col VARBINARY(...) ...;
ALTER TABLE Tbl MODIFY COLUMN col VARCHAR(...) ... CHARACTER SET latin5 ...;
Do that for each table. Be sure to test this on a copy of your data first.
The 2-step ALTER is useful for when the bytes are correct, but the CHARACTER SET is not.
CONVERT TO should be used when the characters are correct, but you want a different encoding (and CHARACTER SET). See Case 5.
Edit 1
E7 and FD and cp1250, dec8, latin1 and latin2 for ç and ý. FD in latin5 is ı. I conclude that your encoding is latin1, not latin5.
You say you cannot change the "scripts". Let's look at your limitations. Are you restricted on the INSERT side? Or the SELECT side? Or both? What is rendering the text; html? MySQL is willing to change from latin1 to/from latin5 and you insert/select (based on a few settings). And/or you could lie to HTML (via a meta tag) to get it to interpret the bytes differently. Please spell out the details of the data flow.
Edit 2
Given that the HEX in the table is E7FD6B6172FD6C6D6173FD6E61, and it should be rendered as çıkarılmasına, ... Note especially the second letter needs to show as ı (Turkish dotless small I), not ý (small Y with acute), correct?
Start by trying
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-9"/>
That should give you the `latin5 rendering, as you already found out. IANA Reference.
As for "Best practice", that would involve changing the way text is inserted. You have stated this as off-limits.
Apparently you have latin5 characters stored in a latin1 column. Since latin1 does not involve any checking, you can insert and retrieve latin5 characters without any trouble.
This does not address the desire to have Turkish collation. If necessary, I can probably concoct a way to specify Turkish ordering on particular statements; please provide a sample statement.

Reading special characters from Latin 1 Microsoft SQL database with php

I have a predefined Microsoft SQL database. The database collation is specified as SQL_Latin1_General_CP1_CI_AS.
The database version is: SQL Server Express 10.0.1600.22
Some of the tables have values with special characters (I assume UTF-8). I am reading these tables with php mssql, and I end up with question marks in the output ????? ???? ?????
I have tried playing with ini_set('mssql.charset','utf8'), with different encoding values such as windows-cp1251, windows-cp1252 with no luck.
I am not sure how to proceed with this? I guess that I need the equivalent of MySQL SET NAMES UTF-8, but I am not sure how to do that in MSSQL. Any ideas?
Converting the tables to utf8 unfortunately is not an option. The field type is nvarchar(250)
A lot of missing information here; for example, the type of the field in the database (varchar, nvarchar???).
Try converting your field to an unicode field if it's stored as a varchar field; for example:
Select cast (field1 as nvarchar(200)) as field1, ...
Also, even if the database collation is SQL_Latin1_General_CP1_CI_AS, each field can have its own specific collation. The database collation is mainly the default collation to be used when you don't specify the collation for a field.
Well, the error could be in a variety of places. The first thing to check is to make sure that your web page can display the extended characters correctly on your web site pages by writing some; for example:
<html>...<p>é</p>...
Of course, if unecessary to try with an HTML entities such as é for doing these tests because the problem is not there; you must use real encoded extended characters.
After that, you can check that your extended characters have been correctly stored in your database by checking their numerical values with the function Unicode(). You can also check that you can write an extended character using the NChar() function; for example:
Select Unicode (substring (field1, 1, 1)), Unicode (N'é'), NChar(233), ...
The Unicode() and NChar() functions are standard Microsoft SQL functions.
It has been many years since the last time that I have coded with PHP, however, from what I recall, the default values for the configuration were sufficient for displaying any extended characters coming from a SQL-Server database using a nvarchar or ntext field.
Use the connection option: "CharacterSet" => "UTF-8" as stated at https://msdn.microsoft.com/en-us/library/cc626307(v=sql.105).aspx
It worked for me. No further conversion needed.

Finding the perfect database collation [duplicate]

I'm trying to figure out what collation I should be using for various types of data. 100% of the content I will be storing is user-submitted.
My understanding is that I should be using UTF-8 General CI (Case-Insensitive) instead of UTF-8 Binary. However, I can't find a clear a distinction between UTF-8 General CI and UTF-8 Unicode CI.
Should I be storing user-submitted content in UTF-8 General or UTF-8 Unicode CI columns?
What type of data would UTF-8 Binary be applicable to?
In general, utf8_general_ci is faster than utf8_unicode_ci, but less correct.
Here is the difference:
For any Unicode character set, operations performed using the _general_ci collation are faster than those for the _unicode_ci collation. For example, comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci. The reason for this is that utf8_unicode_ci supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages “ß” is equal to “ss”. utf8_unicode_ci also supports contractions and ignorable characters. utf8_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.
Quoted from:
http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html
For more detailed explanation, please read the following post from MySQL forums:
http://forums.mysql.com/read.php?103,187048,188748
As for utf8_bin:
Both utf8_general_ci and utf8_unicode_ci perform case-insensitive comparison. In constrast, utf8_bin is case-sensitive (among other differences), because it compares the binary values of the characters.
You should also be aware of the fact, that with utf8_general_ci when using a varchar field as unique or primary index inserting 2 values like 'a' and 'á' would give a duplicate key error.
utf8_bin compares the bits blindly. No case folding, no accent stripping.
utf8_general_ci compares one codepoint with one codepoint. It does case folding and accent stripping, but no 2-character comparisons; for example: ij is not equal ij in this collation.
utf8_*_ci is a set of language-specific rules, but otherwise like unicode_ci. Some special cases: Ç, Č, ch, ll
utf8_unicode_ci follows an old Unicode standard for comparisons. ij=ij, but ae != æ
utf8_unicode_520_ci follows an newer Unicode standard. ae = æ
See collation chart for details on what is equal to what in various utf8 collations.
utf8, as defined by MySQL is limited to the 1- to 3-byte utf8 codes. This leaves out Emoji and some of Chinese. So you should really switch to utf8mb4 if you want to go much beyond Europe.
The above points apply to utf8mb4, after suitable spelling change. Going forward, utf8mb4 and utf8mb4_unicode_520_ci are preferred. Or (in 8.0) utf8mb4_0900_ai_ci
utf16 and utf32 are variants on utf8; there is virtually no use for them.
ucs2 is closer to "Unicode" than "utf8"; there is virtually no use for it.
Accepted answer is outdated.
If you use MySQL 5.5.3+, use utf8mb4_unicode_ci instead of utf8_unicode_ci to ensure the characters typed by your users won't give you errors.
utf8mb4 supports emojis for example, whereas utf8 might give you hundreds of encoding-related bugs like:
Incorrect string value: ‘\xF0\x9F\x98\x81…’ for column ‘data’ at row 1
Really, I tested saving values like 'é' and 'e' in column with unique index and they cause duplicate error on both 'utf8_unicode_ci' and 'utf8_general_ci'. You can save them only in 'utf8_bin' collated column.
And mysql docs (in http://dev.mysql.com/doc/refman/5.7/en/charset-applications.html) suggest into its examples set 'utf8_general_ci' collation.
[mysqld]
character-set-server=utf8
collation-server=utf8_general_ci

Mixed UTF-8 and latin1 tables with PDO

There is an existing database/tables where i cannot change the charset. These tables use the collation "latin1_swedish_ci" but there is UTF-8 data stored inside. For example string "fußball" (german football) is saved as "fußball". That's the part i can not change.
My whole script works just fine with UTF-8 and it's own UTF-8 Tables and i use PDO(mySQL) with an UTF-8 Connection to query. But sometimes i have to query some "old" latin1 tables. Is there any "cool" way for solving this instead of sending SET NAMES.
This is my very first question at stackoverflow! :-)
It's actually very easy to think that data is encoded in one way, when it is actually encoded in some other way: this is because any attempt to directly retrieve the data will result in conversion first to the character set of your database connection and then to the character set of your output medium—therefore you should first verify the actual encoding of your stored data through either SELECT BINARY myColumn FROM myTable WHERE ... or SELECT HEX(myColumn) FROM myTable WHERE ....
Once you are certain that you have UTF-8 encoded data stored within a Windows-1252 encoded column (i.e. you are seeing 0xc39f where the character ß is expected), what you really want is to drop the encoding information from the column and then tell MySQL that the data is actually encoded as UTF-8. As documented under ALTER TABLE Syntax:
Warning 
The CONVERT TO operation converts column values between the character sets. This is not what you want if you have a column in one character set (like latin1) but the stored values actually use some other, incompatible character set (like utf8). In this case, you have to do the following for each such column:
ALTER TABLE t1 CHANGE c1 c1 BLOB;
ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8;
The reason this works is that there is no conversion when you convert to or from BLOB columns.
Henceforth MySQL will correctly convert selected data to that of the connection's character set, as desired. That is, if a connection uses UTF-8, no conversion will be necessary; whereas a connection using Windows-1252 will receive strings converted to that character set.
Not only that, but string comparisons within MySQL will be correctly performed. For example, if you currently connect with the UTF-8 character set and search for 'fußball', you won't get any results; whereas you would after the modifications above.
The pitfall to which you allude, of having to change numerous legacy scripts, only applies insofar as those legacy scripts are using an incorrect connection character set (for example, are telling MySQL that they use Windows-1252 whereas they are in fact sending and expecting receipt of data in UTF-8). You really should fix this in any case, as it can lead to all sorts of horrors down the road.
I solved it with creating another database handle in my DB class, that uses latin1 so whenever i need to query the "legacy tables" i can use
$pdo = Db::getInstance();
$pdo->legacyDbh->query("MY QUERY");
# instead of
$pdo->dbh->query("MY QUERY");
if anyone has a better solution that also do not touch the tables.. :-)

Categories