Prepare unique values for utf_unicode_ci in php - php

I need to read values from an ISO-8859-1 encoded file in PHP and use PDO to write them to a database table that's encoded utf8_unicode_ci and has a unique index. Sometimes the data is missing special chars which leads to duplicate key errors. Example: the data contains "Entrainement" and "Entraînement". Is there a PHP string function I can use to avoid this?
Preferably a conversion function so I don't have to iterate over the whole array to check if a value was already inserted.
Here is an example of what I'm trying to do:
$values = array("Entraînement", "Entrainement");
$db = new PDO("mysql:dbname=mydb;host=localhost;charset=utf8", "user", "pw");
$db->exec("SET NAMES 'UTF-8'");
$stmt = $db->prepare("INSERT INTO mytable(myvalue) VALUES(?)");
$already_inserted = array();
foreach($values as $v) {
$v = $v_inserted = iconv('iso-8859-1', 'utf-8', $v);
// Do magic string conversion here
// $v_inserted = collation_convert($v_inserted)
if(isset($already_inserted[$v_inserted])) {
continue;
}
if($stmt->execute(array($v))) {
$already_inserted[$v_inserted] = true;
}
}
This example should only insert "Entraînement" and skip over "Entrainement".
In the original program I'm using Doctrine ORM instead of PDO so I can' do much in SQL. Also, I have special chars in the whole Latin1 range - French, German, Spanish, etc.
I can't change the DB field definition to utf8_binbecause it's part of an ecommerce package - all sorts of things might break.

Well you should definitely convert the values to UTF-8 and use UTF-8 connection encoding. Otherwise your application cannot take advantage of the UTF-8 at all because your application will only be able to send and receive characters that ISO-8859-1 contains. This is very, very little amount compared to Unicode ☹.
That is unrelated to your issue*, in the unicode_ci collation, î is considered same as i.
If you need to consider them as different characters use some other collation:
SELECT 'î' = 'i' COLLATE 'utf8_unicode_ci'
//1
SELECT 'î' = 'i' COLLATE 'utf8_bin'
//0
There is no German** collation so I guess utf8_bin is what you want here.
*There is only an issue when the declared connection encoding does not match the encoding of the physical bytes you send over. I.E. If you send ISO-8859-1 bytes with UTF-8 connection encoding, you will get crap if not an error. And vice versa.
**I looked that up from your profile, if you in fact need some other language there might be a collation for that.

Related

Query with email headers from special Latin characters rejected by PHP mysqli_query and MariaDB command line, works in HeidiSQL

I have encountered a scenario where an email from someone in Europe keeps failing to execute. After minimizing the query I've determined that after all special characters like å and é are removed the query works fine in PHP / mysqli_query. The queries also don't work in MariaDB's command line though they do work in HeidiSQL, I imagine whatever HeidiSQL uses it internally adjusts strings used in the Query tabs.
Let's get the following out of the way:
Database Character Set: utf8mb4.
Database Collation: utf8mb4_unicode_520_ci.
Database column collation: utf8mb4_unicode_520_ci.
The correct query for the request method SET CHARACTER SET 'utf8mb4' is being correctly executed.
Here is the query:
INSERT INTO example_table (example_column) VALUES ('Håko');
I should note that I tried the following (which also failed) even though I firmly believe that this issue occurs from and should be resolved via PHP:
INSERT INTO example_table (example_column) VALUES (CONVERT('Håko' USING utf8));
Here is the MariaDB error:
Incorrect string value: '\xE9rard ...'
Like I said this string is originating from an email message so I'm pretty sure that the issue is with PHP, not MariaDB. So let's go backwards to that code that seems to otherwise work. Please keep in mind that this has taken at least two days to put together in the correct order to even get the strings to appear correctly in the MariaDB query log without being incorrectly converted to UTF-8 and corrupting the special Latin characters:
<?php
$s1 = '=?iso-8859-1?Q?=22G=E9rd_Tabt=22?= <berbs#example.com>';//"Gérd Tabt" <berbs#example.com>
if (strlen($s1) > 0)
{
if (substr_count($s1, '=?') && substr_count($s1, '?= '))
{
$p = explode('?= ', $s1);
$p[0] = $p[0].'?=';
$s2 = imap_mime_header_decode($p[0])[0]->text.' '.$p[1];
}
else {$s2 = imap_mime_header_decode($s1)[0]->text;}
if (strpos($s1, '=96') !== false) {$s2 = mb_convert_encoding($s2, 'UTF-8', 'CP1252');}
else if (mb_convert_encoding($s2, 'UTF-8') == substr_count($s1, '?')) {$s2 = mb_convert_encoding($s2, 'UTF-8');}
}
else {$s2 = $s1;}
?>
There isn't any other relevant code handling this header string.
What is causing what I presume to be UTF-8 encoded strings to break PHP's mysqli_query and the MariaDB command line from working with this query?
Where did the hex E9 come from? That is encoded latin1. Yet your configuration seems to claim that your client is encoded utf8mb4. You must have the connection charset match what the encoding is in the client. The database and table and client can have a different encoding; MariaDB is happy to convert on the fly when INSERTing or SELECTing.
For more analysis, see Trouble with UTF-8 characters; what I see is not what I stored
if (mb_convert_encoding($s2, 'UTF-8') == substr_count($s1, '?'))
This makes no sense: comparing a string (converted from anything to UTF-8) against an integer (amount of matches) will only ever be equal when the converted text is '0', which is also the amount of finding '?' in it, and due to the type unsafe comparison parameter == this is the only scenario where '0' equals 0.
So your text is never converted to UTF-8 and remains whatever it was (in this case ISO-8859-1).
mb_convert_encoding($s2, 'UTF-8')
Sure you want to convert to UTF-8 without telling the source encoding? ISO-8859-1 as per email header isn't the only one to expect - why not extracting that information and passing it to the function?
MariaDB is right: you're handing over ISO-8859-1 encoded text in that case, while the DBMS expects the UTF-8 encoding.

Solving UTF8 & french accents incompatibility

I have a PHP script which saves user content into a mysql database (PHP 5.4, mysql 5.5.31)
All string-related fields in my database have utf8_unicode_ci as collation.
My (simplified) code looks like this:
$db_handle = mysql_connect('localhost', 'username', 'password');
mysql_select_db('my_db');
mysql_set_charset('utf8', $db_handle);
// ------ INSERT: First example -------
$s = "je viens de télécharger et installer le logiciel";
$sql = "INSERT INTO my_table (post_id, post_subject, post_text) VALUES (1, 'subject 1', '$s')";
mysql_query($sql, $db_handle);
// ------ INSERT: Second example -------
$s = "EPrints and العربية";
$sql = "INSERT INTO my_table (post_id, post_subject, post_text) VALUES (2, 'subject 2', '$s')";
mysql_query($sql, $db_handle);
// -------------
mysql_close($db_handle);
The problem is, the first insert (latin text with the é accents) fails unless I comment this line:
mysql_set_charset('utf8', $db_handle);
But the second query (mix of latin & arabic content) will fail unless I call mysql_set_charset('utf8', $db_handle);
I've been struggling with this for 2 days now. I thought UTF8 does support characters like the french accents, but obviously it doesn't!
How can I fix this?
mysql_set_charset('utf8', $db_handle) tells the database that the data you're going to send it will be encoded in UTF-8. If the result is messed up, that means you did not in fact send UTF-8 encoded text. Double check the encoding of what you're sending.
I thought UTF8 does support characters like the french accents, but obviously it doesn't!
I does just fine.
See What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text and Handling Unicode Front To Back In A Web App.
Is the PHP text in UTF-8? This concerns the encoding of the editor. When yes, then the bytes in the string literal should already be okay.
It seems to be the case as Arabic is written too.
Use prepared statements for the SQL. This has several advantages: security (SQL injection), escaping of quotes and other special characters, and ... maybe ... encoding of the SQL string.
Unlikely: try
$s = utf8_encode("je viens de télécharger et installer le logiciel");
Though I can foresee another problem: the definition of utf8_encode expects an ISO-8859-1 string, feasible for French, but not for Arabic. If this works, the encoding of the PHP is wrong somehow.
(I find Java to be more consistent w.r.t. Unicode, so I am not entirely sure for PHP.)
The issue of knowing the encoding and converting if necessary, can be addressed using something like this, which makes sure that coding is CP1252. Reverse this to make sure it is UTF8.
function conv_text($value) {
$result = mb_detect_encoding($value." ","UTF-8,CP1252") == "UTF-8" ? iconv("UTF-8", "CP1252", $value ) : $value;
return $result;
}

Convert latin1 characters on a UTF8 table into UTF8

Only today I realized that I was missing this in my PHP scripts:
mysql_set_charset('utf8');
All my tables are InnoDB, collation "utf8_unicode_ci", and all my VARCHAR columns are "utf8_unicode_ci" as well. I have mb_internal_encoding('UTF-8'); on my PHP scripts, and all my PHP files are encoded as UTF-8.
So, until now, every time I "INSERT" something with diacritics, example:
mysql_query('INSERT INTO `table` SET `name`="Jáuò Iñe"');
The 'name' contents would be, in this case: Jáuò Iñe.
Since I fixed the charset between PHP and MySQL, new INSERTs are now storing correctly. However, I want to fix all the older rows that are "messed" at the moment. I tried many things already, but it always breaks the strings on the first "illegal" character. Here is my current code:
$m = mysql_real_escape_string('¿<?php echo "¬<b>\'PHP á (á)ţăriîş </b>"; ?> ă-ţi abcdd;//;ñç´พดแทฝใจคçăâξβψδπλξξςαยนñ ;');
mysql_set_charset('utf8');
mysql_query('INSERT INTO `table` SET `name`="'.$m.'"');
mysql_set_charset('latin1');
mysql_query('INSERT INTO `table` SET `name`="'.$m.'"');
mysql_set_charset('utf8');
$result = mysql_iquery('SELECT * FROM `table`');
while ($row = mysql_fetch_assoc($result)) {
$message = $row['name'];
$message = mb_convert_encoding($message, 'ISO-8859-15', 'UTF-8');
//$message = iconv("UTF-8", "ISO-8859-1//IGNORE", $message);
mysql_iquery('UPDATE `table` SET `name`="'.mysql_real_escape_string($message).'" WHERE `a1`="'.$row['a1'].'"');
}
It "UPDATE"s with the expected characters, except that the string gets truncated after the character "ă". I mean, that character and following chars are not included on the string.
Also, testing with the "iconv()" (that is commented on the code) does the same, even with //IGNORE and //TRANSLIT
I also tested several charsets, between ISO-8859-1 and ISO-8859-15.
From what you describe, it seems you have UTF-8 data that was originally stored as Latin-1 and then not converted correctly to UTF-8. The data is recoverable; you'll need a MySQL function like
convert(cast(convert(name using latin1) as binary) using utf8)
It's possible that you may need to omit the inner conversion, depending on how the data was altered during the encoding conversion.
After I searched about an hour or two for this answer, I needed to migrate an old tt_news db from typo into a new typo3 version. I tried to convert the charset in the export file and import it back already, but didn't get it working.
Then I tried the answer above from ABS and started an update on the table:
UPDATE tt_news SET
title=convert(cast(convert(title using latin1) as binary) using utf8),
short=convert(cast(convert(short using latin1) as binary) using utf8),
bodytext=convert(cast(convert(bodytext using latin1) as binary) using utf8)
WHERE 1
You can also convert imagecaption, imagealttext, imagetitletext and keywords if needed.
Hope this will help somebody migrating tt_news to new typo3 version.
the way is better way
use connection tow you database normal
then use this code to make what you need
you must make your page encoding utf-8 by meta in header cod html (dont forget this)
then use this code
$result = mysql_query('SELECT * FROM shops');
while ($row = mysql_fetch_assoc($
$name= iconv("windows-1256", "UTF-8", $row['name']);
mysql_query("SET NAMES 'utf8'");
mysql_query("update `shops` SET `name`='".$name."' where ID='$row[ID]' ");
}
I highly recommend using 'utf8mb4' instead of 'utf8', since utf8 cannot store some chinese characters and emojis.

How to extract a UTF-8 string (In Arabic) from a MySQL DB and echo to screen using PHP

I have a MySQL db, i've set collation = utf8_unicode_ci.
I'm trying to fetch the value through PHP but i'm getting "???" instead of the actual string.
I have read about this subject and tried using mb_convert_encoding but it didn't work, what am I missing?
Can someone please post a code snippet that actually pulls a value from a DB and echos the string to the screen?
Thanks,
I have a MySQL db, i've set collation = utf8_unicode_ci.
I'm trying to fetch the value through PHP but i'm getting "???" instead of the actual string.
Character sets are how characters are encoded.
Collations are how characters are sorted.
These are different things. Chances are that your tables or columns have the right collation, but the wrong character set. The Internationalization section of the MySQL manual has a great deal of information on how to set things up correctly.
Can someone please post a code snippet that actually pulls a value from a DB and echos the string to the screen?
Let's demonstrate how to use utf8 as a character set, and the utf8 "general case insensitive" collation. I'm using PDO in this example, but the same general idea should work with mysqli as well. I wouldn't advise using the old mysql extension.
// Let's tell MySQL we're going to be working with utf8 data.
// http://dev.mysql.com/doc/refman/5.1/en/charset-connection.html
$db->query("SET NAMES 'utf8'");
// Create a table with our proper charset and collation.
// If we needed to, we could specify the charset and collation with
// each column.
// http://dev.mysql.com/doc/refman/5.1/en/charset-column.html
// We could also set the defaults at the database level.
// http://dev.mysql.com/doc/refman/5.1/en/charset-database.html
$db->query('
CREATE TABLE foo(
bar TEXT
)
DEFAULT CHARACTER SET utf8
DEFAULT COLLATE utf8_general_ci
ENGINE=InnoDB
');
// I don't know Arabic, so I'll type this in English. It should
// work fine in Arabic, as long as the string is encoded as utf8.
$sth = $db->prepare("INSERT INTO foo(bar) VALUES(?)");
$sth->execute(array("Hello, world!"));
$sth = $db->query("SELECT bar FROM foo LIMIT 1");
$row = $sth->fetch(PDO::FETCH_NUM);
echo $row[0]; // Will echo "Hello, world!", or whatever you inserted.
#tomp's comment below is correct. Make sure to emit a proper character set with your content type header. For example:
header('Content-type: text/html; charset=utf-8'); // Note the dash!

JSON specialchars JSON php 5.2.13

I'm getting crazy over these encoding probs...
I use json_decode and json_encode to store and retrieve data. What I did find out is, that json always needs utf-8. No problem there. I give json 'hellö' in utf-8, in my DB it looks like hellu00f6. Ok, codepoint. But when I use json_decode, it won't decode the codepoint back, so I still have hellu00f6.
Also, in php 5.2.13 it seems like there are still no optionial tags in JSON. How can I convert the codepoint caracters back to the correct specialcharacter for display in the browser?
Greetz and thanks
Maenny
It could be because of the backslash preceding the codepoint in the JSON unicode string: ö is represented \u00f6. When stored in your DB, the DBMS doesn't knows how to interpret \u00f6 so I guess it reads (and store) it as u00f6.
Are you using an escaping function ?
Try adding a backslash on unicode-escaped chars:
$json = str_replace("\\u", "\\\\u", $json);
The preceding post already explains, why your example did not work as expected.
However, there are some good coding practices when working with databases, which are important to improve the security of your application (i.e. prevent SQL-injection).
The following example intends to show some of these practices, and assumes PHP 5.2 and MySQL 5.1. (Note that all files and database entries are stored using UTF-8 encoding.)
The database used in this example is called test, and the table was created as follows:
CREATE TABLE `test`.`entries` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY ,
`data` VARCHAR( 100 ) NOT NULL
) ENGINE = InnoDB CHARACTER SET utf8 COLLATE utf8_bin
(Note that the encoding is set to utf8_bin.)
It follows the php code, which is used for both, adding new entries and creating JSON:
<?
$conn = new PDO('mysql:host=localhost;dbname=test','root','xxx');
$conn->exec("SET NAMES 'utf8'"); // Enable UTF-8 charset for db-communication ..
if(isset($_GET['add_entry'])) {
header('Content-Type: text/plain; charset=UTF-8');
// Add new DB-Entry:
$data = $conn->quote($_GET['add_entry']);
if($conn->exec('INSERT INTO `entries` (`data`) VALUES ('.$data.')')) {
$id = $conn->lastInsertId();
echo 'Created entry '.$id.': '.$_GET['add_entry'];
} else {
$info = $conn->errorInfo();
echo 'Unable to create entry: '. $info[2];
}
} else {
header('Content-Type: text/json; charset=UTF-8');
// Output DB-Entries as JSON:
$entries = array();
if($res = $conn->query('SELECT * FROM `entries`')) {
$res->setFetchMode(PDO::FETCH_ASSOC);
foreach($res as $row) {
$entries[] = $row;
}
}
echo json_encode($entries);
}
?>
Note the usage of the method $conn->quote(..) before passing data to the database. As mentioned in the preceding post, it would even be better to use prepared statements, since they already do the whole escaping. Thus, it would be better if we write:
$prepStmt = $conn->prepare('INSERT INTO `entries` (`data`) VALUES (:data)');
if($prepStmt->execute(array('data'=>$_GET['add_entry']))) {...}
instead of
$data = $conn->quote($_GET['add_entry']);
if($conn->exec('INSERT INTO `entries` (`data`) VALUES ('.$data.')')) {...}
Conclusion: Using UTF-8 for all character data stored or transmitted to the user is reasonable. It makes the development of internationalized web applications way easier. To make sure, user-input is properly sent to the database, using an escape function is a good idea. Otherwise, using prepared statements make life and development even easier and furthermore improves your applications security, since SQL-Injection is prevented.

Categories