Encoding SQL_Latin1_General_CP1_CI_AS into UTF-8 - php

I'm generating a XML file with PHP using DomDocument and I need to handle asian characters. I'm pulling data from the MSSQL2008 server using the pdo_mssql driver and I apply utf8_encode() on the XML attribute values. Everything works fine as long as there's no special characters.
The server is MS SQL Server 2008 SP3
The database, table and column collation are all SQL_Latin1_General_CP1_CI_AS
I'm using PHP 5.2.17
Here's my PDO object:
$pdo = new PDO("mssql:host=MyServer,1433;dbname=MyDatabase", user123, password123);
My query is a basic SELECT.
I know storing special characters into SQL_Latin1_General_CP1_CI_AS columns isn't great, but ideally it would be nice to make it work without changing it, because other non-PHP programs already use that column and it works fine. In SQL Server Management Studio I can see the asian characters correctly.
Considering all the details above, how should I process the data?

I found how to solve it, so hopefully this will be helpful to someone.
First, SQL_Latin1_General_CP1_CI_AS is a strange mix of CP-1252 and UTF-8.
The basic characters are CP-1252, so this is why all I had to do was UTF-8 and everything worked. The asian and other UTF-8 characters are encoded on 2 bytes and the php pdo_mssql driver seems to hate varying length characters so it seems to do a CAST to varchar (instead of nvarchar) and then all the 2 byte characters become question marks ('?').
I fixed it by casting it to binary and then I rebuild the text with php:
SELECT CAST(MY_COLUMN AS VARBINARY(MAX)) FROM MY_TABLE;
In php:
//Binary to hexadecimal
$hex = bin2hex($bin);
//And then from hex to string
$str = "";
for ($i=0;$i<strlen($hex) -1;$i+=2)
{
$str .= chr(hexdec($hex[$i].$hex[$i+1]));
}
//And then from UCS-2LE/SQL_Latin1_General_CP1_CI_AS (that's the column format in the DB) to UTF-8
$str = iconv('UCS-2LE', 'UTF-8', $str);

I know this post is old, but the only thing that work for me was
iconv("CP850", "UTF-8//TRANSLIT", $var);
I had the same issues with SQL_Latin1_General_CP1_CI_AI, maybe it work for SQL_Latin1_General_CP1_CI_AS too.

You can try so:
header("Content-Type: text/html; charset=utf-8");
$dbhost = "hostname";
$db = "database";
$query = "SELECT *
FROM Estado
ORDER BY Nome";
$conn = new PDO( "sqlsrv:server=$dbhost ; Database = $db", "", "" );
$stmt = $conn->prepare( $query, array(PDO::ATTR_CURSOR => PDO::CURSOR_SCROLL, PDO::SQLSRV_ATTR_CURSOR_SCROLL_TYPE => PDO::SQLSRV_CURSOR_BUFFERED, PDO::SQLSRV_ENCODING_SYSTEM) );
$stmt->execute();
while ( $row = $stmt->fetch( PDO::FETCH_ASSOC ) )
{
// CP1252 == code page Latin1
print iconv("CP1252", "ISO-8859-1", "$row[Nome] <br>");
}

For me, none of the above was the direct solution--though I did use parts of above solutions. This worked for me with the Vietnamese alphabet. If you come across this post and none of the above work for you, try:
$req = "SELECT CAST(MY_COLUMN as VARBINARY(MAX)) as MY_COLUMN FROM MY_TABLE";
$stmt = $conn->prepare($req);
$stmt->execute();
while ($row = $stmt->fetch(PDO::FETCH_ASSOC)) {
$str = pack("H*",$row['MY_COLUMN']);
$str = mb_convert_encoding($z, 'HTML-ENTITIES','UCS-2LE');
print_r($str);
}
And a little bonus--I had to json_encode this data and was (duh) getting html code instead of the special characters. to fix just use html_entity_decode() on the strings before sending with json_encode.

No need for crazy stuff. Collation SQL_Latin1_General_CP1_CI_AS character encoding is: Windows-1252
This works perfect for me: $str = mb_convert_encoding($str, 'UTF-8', 'Windows-1252');

By default, PDO uses PDO::SQLSRV_ENCODING_UTF8 for sending/receiving data.
If your current collate is LATIN1, have you tried specifiying PDO::SQLSRV_ENCODING_SYSTEM to let PDO know that you want to use the current system encoding instead of UTF-8 ?
You could even use PDO::SQLSRV_ENCODING_BINARY which returns data in a binary form (no encoding or translation is done when transfering data). This way, you could handle character encoding on your side.
More documentation here: http://ca3.php.net/manual/en/ref.pdo-sqlsrv.php

Thanks #SGr for answer.
I found out a better way for doing that :
SELECT CAST(CAST(MY_COLUMN AS VARBINARY(MAX)) AS VARCHAR(MAX)) as MY_COLUMN FROM MY_TABLE;
and also try with:
SELECT CAST(MY_COLUMN AS VARBINARY(MAX)) as MY_COLUMN FROM MY_TABLE;
And in PHP you should just convert it to UTF-8 :
$string = iconv('UCS-2LE', 'UTF-8', $row['MY_COLUMN']);

Related

Solving UTF8 & french accents incompatibility

I have a PHP script which saves user content into a mysql database (PHP 5.4, mysql 5.5.31)
All string-related fields in my database have utf8_unicode_ci as collation.
My (simplified) code looks like this:
$db_handle = mysql_connect('localhost', 'username', 'password');
mysql_select_db('my_db');
mysql_set_charset('utf8', $db_handle);
// ------ INSERT: First example -------
$s = "je viens de télécharger et installer le logiciel";
$sql = "INSERT INTO my_table (post_id, post_subject, post_text) VALUES (1, 'subject 1', '$s')";
mysql_query($sql, $db_handle);
// ------ INSERT: Second example -------
$s = "EPrints and العربية";
$sql = "INSERT INTO my_table (post_id, post_subject, post_text) VALUES (2, 'subject 2', '$s')";
mysql_query($sql, $db_handle);
// -------------
mysql_close($db_handle);
The problem is, the first insert (latin text with the é accents) fails unless I comment this line:
mysql_set_charset('utf8', $db_handle);
But the second query (mix of latin & arabic content) will fail unless I call mysql_set_charset('utf8', $db_handle);
I've been struggling with this for 2 days now. I thought UTF8 does support characters like the french accents, but obviously it doesn't!
How can I fix this?
mysql_set_charset('utf8', $db_handle) tells the database that the data you're going to send it will be encoded in UTF-8. If the result is messed up, that means you did not in fact send UTF-8 encoded text. Double check the encoding of what you're sending.
I thought UTF8 does support characters like the french accents, but obviously it doesn't!
I does just fine.
See What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text and Handling Unicode Front To Back In A Web App.
Is the PHP text in UTF-8? This concerns the encoding of the editor. When yes, then the bytes in the string literal should already be okay.
It seems to be the case as Arabic is written too.
Use prepared statements for the SQL. This has several advantages: security (SQL injection), escaping of quotes and other special characters, and ... maybe ... encoding of the SQL string.
Unlikely: try
$s = utf8_encode("je viens de télécharger et installer le logiciel");
Though I can foresee another problem: the definition of utf8_encode expects an ISO-8859-1 string, feasible for French, but not for Arabic. If this works, the encoding of the PHP is wrong somehow.
(I find Java to be more consistent w.r.t. Unicode, so I am not entirely sure for PHP.)
The issue of knowing the encoding and converting if necessary, can be addressed using something like this, which makes sure that coding is CP1252. Reverse this to make sure it is UTF8.
function conv_text($value) {
$result = mb_detect_encoding($value." ","UTF-8,CP1252") == "UTF-8" ? iconv("UTF-8", "CP1252", $value ) : $value;
return $result;
}

Prepare unique values for utf_unicode_ci in php

I need to read values from an ISO-8859-1 encoded file in PHP and use PDO to write them to a database table that's encoded utf8_unicode_ci and has a unique index. Sometimes the data is missing special chars which leads to duplicate key errors. Example: the data contains "Entrainement" and "Entraînement". Is there a PHP string function I can use to avoid this?
Preferably a conversion function so I don't have to iterate over the whole array to check if a value was already inserted.
Here is an example of what I'm trying to do:
$values = array("Entraînement", "Entrainement");
$db = new PDO("mysql:dbname=mydb;host=localhost;charset=utf8", "user", "pw");
$db->exec("SET NAMES 'UTF-8'");
$stmt = $db->prepare("INSERT INTO mytable(myvalue) VALUES(?)");
$already_inserted = array();
foreach($values as $v) {
$v = $v_inserted = iconv('iso-8859-1', 'utf-8', $v);
// Do magic string conversion here
// $v_inserted = collation_convert($v_inserted)
if(isset($already_inserted[$v_inserted])) {
continue;
}
if($stmt->execute(array($v))) {
$already_inserted[$v_inserted] = true;
}
}
This example should only insert "Entraînement" and skip over "Entrainement".
In the original program I'm using Doctrine ORM instead of PDO so I can' do much in SQL. Also, I have special chars in the whole Latin1 range - French, German, Spanish, etc.
I can't change the DB field definition to utf8_binbecause it's part of an ecommerce package - all sorts of things might break.
Well you should definitely convert the values to UTF-8 and use UTF-8 connection encoding. Otherwise your application cannot take advantage of the UTF-8 at all because your application will only be able to send and receive characters that ISO-8859-1 contains. This is very, very little amount compared to Unicode ☹.
That is unrelated to your issue*, in the unicode_ci collation, î is considered same as i.
If you need to consider them as different characters use some other collation:
SELECT 'î' = 'i' COLLATE 'utf8_unicode_ci'
//1
SELECT 'î' = 'i' COLLATE 'utf8_bin'
//0
There is no German** collation so I guess utf8_bin is what you want here.
*There is only an issue when the declared connection encoding does not match the encoding of the physical bytes you send over. I.E. If you send ISO-8859-1 bytes with UTF-8 connection encoding, you will get crap if not an error. And vice versa.
**I looked that up from your profile, if you in fact need some other language there might be a collation for that.

Convert latin1 characters on a UTF8 table into UTF8

Only today I realized that I was missing this in my PHP scripts:
mysql_set_charset('utf8');
All my tables are InnoDB, collation "utf8_unicode_ci", and all my VARCHAR columns are "utf8_unicode_ci" as well. I have mb_internal_encoding('UTF-8'); on my PHP scripts, and all my PHP files are encoded as UTF-8.
So, until now, every time I "INSERT" something with diacritics, example:
mysql_query('INSERT INTO `table` SET `name`="Jáuò Iñe"');
The 'name' contents would be, in this case: Jáuò Iñe.
Since I fixed the charset between PHP and MySQL, new INSERTs are now storing correctly. However, I want to fix all the older rows that are "messed" at the moment. I tried many things already, but it always breaks the strings on the first "illegal" character. Here is my current code:
$m = mysql_real_escape_string('¿<?php echo "¬<b>\'PHP á (á)ţăriîş </b>"; ?> ă-ţi abcdd;//;ñç´พดแทฝใจคçăâξβψδπλξξςαยนñ ;');
mysql_set_charset('utf8');
mysql_query('INSERT INTO `table` SET `name`="'.$m.'"');
mysql_set_charset('latin1');
mysql_query('INSERT INTO `table` SET `name`="'.$m.'"');
mysql_set_charset('utf8');
$result = mysql_iquery('SELECT * FROM `table`');
while ($row = mysql_fetch_assoc($result)) {
$message = $row['name'];
$message = mb_convert_encoding($message, 'ISO-8859-15', 'UTF-8');
//$message = iconv("UTF-8", "ISO-8859-1//IGNORE", $message);
mysql_iquery('UPDATE `table` SET `name`="'.mysql_real_escape_string($message).'" WHERE `a1`="'.$row['a1'].'"');
}
It "UPDATE"s with the expected characters, except that the string gets truncated after the character "ă". I mean, that character and following chars are not included on the string.
Also, testing with the "iconv()" (that is commented on the code) does the same, even with //IGNORE and //TRANSLIT
I also tested several charsets, between ISO-8859-1 and ISO-8859-15.
From what you describe, it seems you have UTF-8 data that was originally stored as Latin-1 and then not converted correctly to UTF-8. The data is recoverable; you'll need a MySQL function like
convert(cast(convert(name using latin1) as binary) using utf8)
It's possible that you may need to omit the inner conversion, depending on how the data was altered during the encoding conversion.
After I searched about an hour or two for this answer, I needed to migrate an old tt_news db from typo into a new typo3 version. I tried to convert the charset in the export file and import it back already, but didn't get it working.
Then I tried the answer above from ABS and started an update on the table:
UPDATE tt_news SET
title=convert(cast(convert(title using latin1) as binary) using utf8),
short=convert(cast(convert(short using latin1) as binary) using utf8),
bodytext=convert(cast(convert(bodytext using latin1) as binary) using utf8)
WHERE 1
You can also convert imagecaption, imagealttext, imagetitletext and keywords if needed.
Hope this will help somebody migrating tt_news to new typo3 version.
the way is better way
use connection tow you database normal
then use this code to make what you need
you must make your page encoding utf-8 by meta in header cod html (dont forget this)
then use this code
$result = mysql_query('SELECT * FROM shops');
while ($row = mysql_fetch_assoc($
$name= iconv("windows-1256", "UTF-8", $row['name']);
mysql_query("SET NAMES 'utf8'");
mysql_query("update `shops` SET `name`='".$name."' where ID='$row[ID]' ");
}
I highly recommend using 'utf8mb4' instead of 'utf8', since utf8 cannot store some chinese characters and emojis.

UTF-8 in SQL Server 2008 Database + PHP

I want to store data with PHP in a MS SQL 2008 Database.
I`ve got problems with letters like ä ö ü ß, they are displayed incorrect in the database and when i display it on the website.
It works when I utf8_encode the data on input and utf8_decode on output the data with PHP.
Is there a other easier way to solve this?
I've solved this once, the problem is that the PHP's mssql driver is broken (can't find the link to bugs.php.net, but it is there) and fails when it comes to nchar and nvarchar fieldsa and utf8. You'll need to convert the data and queries a little bit:
SELECT some_nvarchar_field FROM some_table
First, you need to change the output to binary - that way it won't get corrupted:
SELECT CONVERT(varbinary(MAX), some_nvarchar_field) FROM some_table;
Then in PHP, you'll need to convert it to UTF-8 back, there you'll need iconv extension
iconv('UTF-16LE', 'UTF-8', $result['some_nvarchar_field']);
This fixes selectiong data from database, however, if you want to actually put some data TO the database, or add a WHERE clause, you'll still be getting errors, so the fix for WHERE, UPDATE, INSERT and so on is by converting the string to hexadecimal form:
Imagine you have this query:
INSERT INTO some_table (some_nvarchar_field) VALUES ('ŽČŘĚÝÁÖ');
Now, we'll have to run some PHP:
$value = 'ŽČŘĚÝÁÖ';
$value = iconv('UTF-8', 'UTF-16LE', $value); //convert into native encoding
$value = bin2hex($value); //convert into hexadecimal
$query = 'INSERT INTO some_table (some_nvarchar_field) VALUES(CONVERT(nvarchar(MAX), 0x'.$value.'))';
The query becomes this:
INSERT INTO some_table (some_nvarchar_field) VALUES (CONVERT(nvarchar(MAX), 0x7d010c0158011a01dd00c100d600));
And that will work!
I've tested this with MS SQL server 2008 on Linux using FreeTDS and it works just fine, I've got some huge websites runing on this with no issues what so ever.
I searched for two days how to insert UTF-8 data (from web forms) into MSSQL 2008 through PHP. I read everywhere that you can't, you need to convert to UCS2 first (like cypher's solution recommends).
On Windows environment SQLSRV said to be a good solution, which I couldn't try, since I am developing on Mac OSX.
However, FreeTDS manual (what PHP mssql uses on OSX) says to add a letter "N" before the opening quote:
mssql_query("INSERT INTO table (nvarcharField) VALUES (N'űáúőűá球最大的采购批发平台')", +xon);
According to this discussion, N character tells the server to convert to Unicode.
https://softwareengineering.stackexchange.com/questions/155859/why-do-we-need-to-put-n-before-strings-in-microsoft-sql-server
i can insert by code
$value = $_POST['first_name'];
$value = iconv("UTF-8","UCS-2LE",$value);
$value2 = $_POST['last_name'];
$value2 = iconv("UTF-8","UCS-2LE",$value2);
$query = "INSERT INTO tbl_sample(first_name, last_name) VALUES (CONVERT(VARBINARY(MAX), '".$value."') , CONVERT(VARBINARY(MAX), '".$value2."'))";
odbc_exec($connect, $query);
if i can not search

JSON specialchars JSON php 5.2.13

I'm getting crazy over these encoding probs...
I use json_decode and json_encode to store and retrieve data. What I did find out is, that json always needs utf-8. No problem there. I give json 'hellö' in utf-8, in my DB it looks like hellu00f6. Ok, codepoint. But when I use json_decode, it won't decode the codepoint back, so I still have hellu00f6.
Also, in php 5.2.13 it seems like there are still no optionial tags in JSON. How can I convert the codepoint caracters back to the correct specialcharacter for display in the browser?
Greetz and thanks
Maenny
It could be because of the backslash preceding the codepoint in the JSON unicode string: ö is represented \u00f6. When stored in your DB, the DBMS doesn't knows how to interpret \u00f6 so I guess it reads (and store) it as u00f6.
Are you using an escaping function ?
Try adding a backslash on unicode-escaped chars:
$json = str_replace("\\u", "\\\\u", $json);
The preceding post already explains, why your example did not work as expected.
However, there are some good coding practices when working with databases, which are important to improve the security of your application (i.e. prevent SQL-injection).
The following example intends to show some of these practices, and assumes PHP 5.2 and MySQL 5.1. (Note that all files and database entries are stored using UTF-8 encoding.)
The database used in this example is called test, and the table was created as follows:
CREATE TABLE `test`.`entries` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY ,
`data` VARCHAR( 100 ) NOT NULL
) ENGINE = InnoDB CHARACTER SET utf8 COLLATE utf8_bin
(Note that the encoding is set to utf8_bin.)
It follows the php code, which is used for both, adding new entries and creating JSON:
<?
$conn = new PDO('mysql:host=localhost;dbname=test','root','xxx');
$conn->exec("SET NAMES 'utf8'"); // Enable UTF-8 charset for db-communication ..
if(isset($_GET['add_entry'])) {
header('Content-Type: text/plain; charset=UTF-8');
// Add new DB-Entry:
$data = $conn->quote($_GET['add_entry']);
if($conn->exec('INSERT INTO `entries` (`data`) VALUES ('.$data.')')) {
$id = $conn->lastInsertId();
echo 'Created entry '.$id.': '.$_GET['add_entry'];
} else {
$info = $conn->errorInfo();
echo 'Unable to create entry: '. $info[2];
}
} else {
header('Content-Type: text/json; charset=UTF-8');
// Output DB-Entries as JSON:
$entries = array();
if($res = $conn->query('SELECT * FROM `entries`')) {
$res->setFetchMode(PDO::FETCH_ASSOC);
foreach($res as $row) {
$entries[] = $row;
}
}
echo json_encode($entries);
}
?>
Note the usage of the method $conn->quote(..) before passing data to the database. As mentioned in the preceding post, it would even be better to use prepared statements, since they already do the whole escaping. Thus, it would be better if we write:
$prepStmt = $conn->prepare('INSERT INTO `entries` (`data`) VALUES (:data)');
if($prepStmt->execute(array('data'=>$_GET['add_entry']))) {...}
instead of
$data = $conn->quote($_GET['add_entry']);
if($conn->exec('INSERT INTO `entries` (`data`) VALUES ('.$data.')')) {...}
Conclusion: Using UTF-8 for all character data stored or transmitted to the user is reasonable. It makes the development of internationalized web applications way easier. To make sure, user-input is properly sent to the database, using an escape function is a good idea. Otherwise, using prepared statements make life and development even easier and furthermore improves your applications security, since SQL-Injection is prevented.

Categories