I'm getting crazy over these encoding probs...
I use json_decode and json_encode to store and retrieve data. What I did find out is, that json always needs utf-8. No problem there. I give json 'hellö' in utf-8, in my DB it looks like hellu00f6. Ok, codepoint. But when I use json_decode, it won't decode the codepoint back, so I still have hellu00f6.
Also, in php 5.2.13 it seems like there are still no optionial tags in JSON. How can I convert the codepoint caracters back to the correct specialcharacter for display in the browser?
Greetz and thanks
Maenny
It could be because of the backslash preceding the codepoint in the JSON unicode string: ö is represented \u00f6. When stored in your DB, the DBMS doesn't knows how to interpret \u00f6 so I guess it reads (and store) it as u00f6.
Are you using an escaping function ?
Try adding a backslash on unicode-escaped chars:
$json = str_replace("\\u", "\\\\u", $json);
The preceding post already explains, why your example did not work as expected.
However, there are some good coding practices when working with databases, which are important to improve the security of your application (i.e. prevent SQL-injection).
The following example intends to show some of these practices, and assumes PHP 5.2 and MySQL 5.1. (Note that all files and database entries are stored using UTF-8 encoding.)
The database used in this example is called test, and the table was created as follows:
CREATE TABLE `test`.`entries` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY ,
`data` VARCHAR( 100 ) NOT NULL
) ENGINE = InnoDB CHARACTER SET utf8 COLLATE utf8_bin
(Note that the encoding is set to utf8_bin.)
It follows the php code, which is used for both, adding new entries and creating JSON:
<?
$conn = new PDO('mysql:host=localhost;dbname=test','root','xxx');
$conn->exec("SET NAMES 'utf8'"); // Enable UTF-8 charset for db-communication ..
if(isset($_GET['add_entry'])) {
header('Content-Type: text/plain; charset=UTF-8');
// Add new DB-Entry:
$data = $conn->quote($_GET['add_entry']);
if($conn->exec('INSERT INTO `entries` (`data`) VALUES ('.$data.')')) {
$id = $conn->lastInsertId();
echo 'Created entry '.$id.': '.$_GET['add_entry'];
} else {
$info = $conn->errorInfo();
echo 'Unable to create entry: '. $info[2];
}
} else {
header('Content-Type: text/json; charset=UTF-8');
// Output DB-Entries as JSON:
$entries = array();
if($res = $conn->query('SELECT * FROM `entries`')) {
$res->setFetchMode(PDO::FETCH_ASSOC);
foreach($res as $row) {
$entries[] = $row;
}
}
echo json_encode($entries);
}
?>
Note the usage of the method $conn->quote(..) before passing data to the database. As mentioned in the preceding post, it would even be better to use prepared statements, since they already do the whole escaping. Thus, it would be better if we write:
$prepStmt = $conn->prepare('INSERT INTO `entries` (`data`) VALUES (:data)');
if($prepStmt->execute(array('data'=>$_GET['add_entry']))) {...}
instead of
$data = $conn->quote($_GET['add_entry']);
if($conn->exec('INSERT INTO `entries` (`data`) VALUES ('.$data.')')) {...}
Conclusion: Using UTF-8 for all character data stored or transmitted to the user is reasonable. It makes the development of internationalized web applications way easier. To make sure, user-input is properly sent to the database, using an escape function is a good idea. Otherwise, using prepared statements make life and development even easier and furthermore improves your applications security, since SQL-Injection is prevented.
Related
I need to read values from an ISO-8859-1 encoded file in PHP and use PDO to write them to a database table that's encoded utf8_unicode_ci and has a unique index. Sometimes the data is missing special chars which leads to duplicate key errors. Example: the data contains "Entrainement" and "Entraînement". Is there a PHP string function I can use to avoid this?
Preferably a conversion function so I don't have to iterate over the whole array to check if a value was already inserted.
Here is an example of what I'm trying to do:
$values = array("Entraînement", "Entrainement");
$db = new PDO("mysql:dbname=mydb;host=localhost;charset=utf8", "user", "pw");
$db->exec("SET NAMES 'UTF-8'");
$stmt = $db->prepare("INSERT INTO mytable(myvalue) VALUES(?)");
$already_inserted = array();
foreach($values as $v) {
$v = $v_inserted = iconv('iso-8859-1', 'utf-8', $v);
// Do magic string conversion here
// $v_inserted = collation_convert($v_inserted)
if(isset($already_inserted[$v_inserted])) {
continue;
}
if($stmt->execute(array($v))) {
$already_inserted[$v_inserted] = true;
}
}
This example should only insert "Entraînement" and skip over "Entrainement".
In the original program I'm using Doctrine ORM instead of PDO so I can' do much in SQL. Also, I have special chars in the whole Latin1 range - French, German, Spanish, etc.
I can't change the DB field definition to utf8_binbecause it's part of an ecommerce package - all sorts of things might break.
Well you should definitely convert the values to UTF-8 and use UTF-8 connection encoding. Otherwise your application cannot take advantage of the UTF-8 at all because your application will only be able to send and receive characters that ISO-8859-1 contains. This is very, very little amount compared to Unicode ☹.
That is unrelated to your issue*, in the unicode_ci collation, î is considered same as i.
If you need to consider them as different characters use some other collation:
SELECT 'î' = 'i' COLLATE 'utf8_unicode_ci'
//1
SELECT 'î' = 'i' COLLATE 'utf8_bin'
//0
There is no German** collation so I guess utf8_bin is what you want here.
*There is only an issue when the declared connection encoding does not match the encoding of the physical bytes you send over. I.E. If you send ISO-8859-1 bytes with UTF-8 connection encoding, you will get crap if not an error. And vice versa.
**I looked that up from your profile, if you in fact need some other language there might be a collation for that.
I have one problem. I have excel file saved as CSV and I need to read that file with PHP and insert into mysql but problem is with char set specifically čćšđž. I tried utf8_encode() and almost everything I could think of.
Examle:
It inserts "Petroviæ" but it should be "Petrović"
EDIT:
<?php
mysql_connect("localhost", "user", "pw");
mysql_select_db("database");
$fajl = "Prodajna mreza.csv";
$handle = #fopen($fajl, "r");
if ($handle) {
$size = filesize($fajl);
if(!$size) {
echo "File is empty.\n";
exit;
}
$csvcontent = fread($handle,$size);
$red = 1;
foreach(explode("\n",$csvcontent) as $line) {
if(strlen($line) <= 20)
{
$red++;
continue;
}
if($red == 1)
{
$red++;
continue;
}
$nesto = explode(",", $line);
if($nesto[0] == '')
continue;
mysql_query("INSERT INTO table(val1, val2, val3, val4, val5, val6, val7, val8) VALUES ('".$nesto[0]."','".$nesto[1]."','".$nesto[2]."','".$nesto[3]."','".$nesto[4]."','".$nesto[5]."','".$nesto[6]."','".$nesto[7]."')");
$red++;
}
fclose($handle);
}
mysql_close();
?>
First off: Using this mysql extension is discouraged. So you might want to switch to something else. Also notice that the way you compose your query by simply pasting strings makes it vulnerable to SQL injection attacks. You should only do this if you are really really sure that there won't be any ugly surprises in the content of the files you read.
It appears that neither your file reading nor the client-side mysql code does anything related to charset conversion, so I'd assume that those simply pass on bytes, without caring about their interpretation. So you only have to make sure that the server interprets those bytes correctly.
Judging from the example you gave, where a ć got turned into an æ, I'd say your file is in ISO-8859-2 but the database is reading it differently, most probably as ISO-8859-1. You should ensure that your database actually can accept all ISO-8859-2 characters for its columns. Read the MySQL manual on character set support and set some suitable default characterset (probably best on the database level) either to utf8 (preferred) or latin2. You might have to recreate your tables for this change to apply.
Next, you should set the character set of the connection to match that of the file. So utf8 is definitely wrong here, and latin2 the way to go.
Using your current API, mysql_set_charset("latin2") can be used to accomplish that.
That page also describes equivalent approaches for use with other frontends.
As an alternative, you can use a query to set this: mysql_query("SET NAMES 'latin2';");
After all this is done, you should also ensure that things are set up correctly for any script which reads from the database. In other words, the charset of the generated HTML must match the character_set_results of the MySQL session. Otherwise it might well be that things are stored correctly in your database but still appear broken when displayed to the user. If you have the choice, I'd say use utf8 in that case, as doing so makes it easier to include different data whenever the need arises.
If some problems remain, you should pinpoint whether they are while reading from file into php, while exchanging data with mysql, or while presenting the result in HTML. The string "Petrovi\xc4\x87" is the utf8 representation of your example, and "Petrovi\xe6" is the latin2 form. You can use these strings to explicitely pass on data with a known encoding, or to check an incoming transferred value against one of these strings.
it shouldn't be a problem importing a csv from a csv to a database if both the file and the database collation are utf-8.
<?php
db = #mysql_connect('localhost', 'user', 'pass');
#mysql_select_db('my_database');
$CSVFile = "file.csv";
mysql_query('LOAD DATA LOCAL INFILE "' . $CSVFile . '" INTO TABLE my_table
FIELDS TERMINATED BY "," LINES TERMINATED BY "\\r\\n";');
mysql_close($db);
?>
you can add your own. CSV in phpmyadmin...
Import -> format = csv and click on "import"
Or if you don't want use phpmyadmin !
BULK INSERT csv_dump
FROM 'c:\file.csv'
WITH
(
FIELDTERMINATOR = '\t',
ROWTERMINATOR = '\n'
)
I'm generating a XML file with PHP using DomDocument and I need to handle asian characters. I'm pulling data from the MSSQL2008 server using the pdo_mssql driver and I apply utf8_encode() on the XML attribute values. Everything works fine as long as there's no special characters.
The server is MS SQL Server 2008 SP3
The database, table and column collation are all SQL_Latin1_General_CP1_CI_AS
I'm using PHP 5.2.17
Here's my PDO object:
$pdo = new PDO("mssql:host=MyServer,1433;dbname=MyDatabase", user123, password123);
My query is a basic SELECT.
I know storing special characters into SQL_Latin1_General_CP1_CI_AS columns isn't great, but ideally it would be nice to make it work without changing it, because other non-PHP programs already use that column and it works fine. In SQL Server Management Studio I can see the asian characters correctly.
Considering all the details above, how should I process the data?
I found how to solve it, so hopefully this will be helpful to someone.
First, SQL_Latin1_General_CP1_CI_AS is a strange mix of CP-1252 and UTF-8.
The basic characters are CP-1252, so this is why all I had to do was UTF-8 and everything worked. The asian and other UTF-8 characters are encoded on 2 bytes and the php pdo_mssql driver seems to hate varying length characters so it seems to do a CAST to varchar (instead of nvarchar) and then all the 2 byte characters become question marks ('?').
I fixed it by casting it to binary and then I rebuild the text with php:
SELECT CAST(MY_COLUMN AS VARBINARY(MAX)) FROM MY_TABLE;
In php:
//Binary to hexadecimal
$hex = bin2hex($bin);
//And then from hex to string
$str = "";
for ($i=0;$i<strlen($hex) -1;$i+=2)
{
$str .= chr(hexdec($hex[$i].$hex[$i+1]));
}
//And then from UCS-2LE/SQL_Latin1_General_CP1_CI_AS (that's the column format in the DB) to UTF-8
$str = iconv('UCS-2LE', 'UTF-8', $str);
I know this post is old, but the only thing that work for me was
iconv("CP850", "UTF-8//TRANSLIT", $var);
I had the same issues with SQL_Latin1_General_CP1_CI_AI, maybe it work for SQL_Latin1_General_CP1_CI_AS too.
You can try so:
header("Content-Type: text/html; charset=utf-8");
$dbhost = "hostname";
$db = "database";
$query = "SELECT *
FROM Estado
ORDER BY Nome";
$conn = new PDO( "sqlsrv:server=$dbhost ; Database = $db", "", "" );
$stmt = $conn->prepare( $query, array(PDO::ATTR_CURSOR => PDO::CURSOR_SCROLL, PDO::SQLSRV_ATTR_CURSOR_SCROLL_TYPE => PDO::SQLSRV_CURSOR_BUFFERED, PDO::SQLSRV_ENCODING_SYSTEM) );
$stmt->execute();
while ( $row = $stmt->fetch( PDO::FETCH_ASSOC ) )
{
// CP1252 == code page Latin1
print iconv("CP1252", "ISO-8859-1", "$row[Nome] <br>");
}
For me, none of the above was the direct solution--though I did use parts of above solutions. This worked for me with the Vietnamese alphabet. If you come across this post and none of the above work for you, try:
$req = "SELECT CAST(MY_COLUMN as VARBINARY(MAX)) as MY_COLUMN FROM MY_TABLE";
$stmt = $conn->prepare($req);
$stmt->execute();
while ($row = $stmt->fetch(PDO::FETCH_ASSOC)) {
$str = pack("H*",$row['MY_COLUMN']);
$str = mb_convert_encoding($z, 'HTML-ENTITIES','UCS-2LE');
print_r($str);
}
And a little bonus--I had to json_encode this data and was (duh) getting html code instead of the special characters. to fix just use html_entity_decode() on the strings before sending with json_encode.
No need for crazy stuff. Collation SQL_Latin1_General_CP1_CI_AS character encoding is: Windows-1252
This works perfect for me: $str = mb_convert_encoding($str, 'UTF-8', 'Windows-1252');
By default, PDO uses PDO::SQLSRV_ENCODING_UTF8 for sending/receiving data.
If your current collate is LATIN1, have you tried specifiying PDO::SQLSRV_ENCODING_SYSTEM to let PDO know that you want to use the current system encoding instead of UTF-8 ?
You could even use PDO::SQLSRV_ENCODING_BINARY which returns data in a binary form (no encoding or translation is done when transfering data). This way, you could handle character encoding on your side.
More documentation here: http://ca3.php.net/manual/en/ref.pdo-sqlsrv.php
Thanks #SGr for answer.
I found out a better way for doing that :
SELECT CAST(CAST(MY_COLUMN AS VARBINARY(MAX)) AS VARCHAR(MAX)) as MY_COLUMN FROM MY_TABLE;
and also try with:
SELECT CAST(MY_COLUMN AS VARBINARY(MAX)) as MY_COLUMN FROM MY_TABLE;
And in PHP you should just convert it to UTF-8 :
$string = iconv('UCS-2LE', 'UTF-8', $row['MY_COLUMN']);
This question already has an answer here:
json_encode problems with utf8 [closed]
(1 answer)
Closed 6 years ago.
I am writing to the database in the form of data from a form with jQuery json_encode.
However, data from the database will corrupt.
$db->query("SET NAMES utf8");
$kelime = array("Merhaba","Dünya");
$bilgi = json_encode($kelime);
$incelemeEkle = "
INSERT INTO incelemeRapor SET
bigData = '".$bilgi."'
";
$db->query($incelemeEkle);
Database Table Schema;
CREATE TABLE `incelemeRapor` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`bigData` text COLLATE utf8_unicode_ci,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=2 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
MySQL Inserted Example Data;
["Merhaba","Du00fcnya"]
Always escape your data before puting it in a SQL query:
$incelemeEkle = "
INSERT INTO incelemeRapor SET
bigData = '".mysql_real_escape_string($bilgi)."'
";
(added mysql_real_escape_string() call)
json_encode() encodes non-ascii characters with the \u<code-point> notation; so json_encode(array("Merhaba","Dünya")); returns ["Merhaba","D\u00fcnya"].
Then this string is embeded in a SQL query:
INSERT INTO incelemeRapor SET
bigData = '["Merhaba","D\u00fcnya"]'
There is no special meaning for the escape sequence \u, so MySQL just removes the \; and this results in ["Merhaba","Du00fcnya"] being stored in database.
So if you escape the string, the query becomes:
$incelemeEkle = "
INSERT INTO incelemeRapor SET
bigData = '["Merhaba","D\\u00fcnya"]'
";
And ["Merhaba","D\u00fcnya"] is stored in the database.
I tried with mysql_real_escape_string() but not worked for me (result to empty field in database).
So I looked here : http://php.net/manual/fr/json.constants.php and the flag JSON_UNESCAPED_UNICODE worked for me fine :
$json_data = json_encode($data,JSON_UNESCAPED_UNICODE);
JSON_UNESCAPED_UNICODE is available only since PHP 5.4.0 !
So in addition to ensuring that your database is using utf8_unicode_ci, you also want to make sure PHP is using the proper encoding. Typically I run the following two commands at the top of any function which is going to potentially have foreign characters within them. Even better is to run it as one of the first commands when your app starts:
mb_language('uni');
mb_internal_encoding('UTF-8');
Those two lines have saved me a ton of headaches!
Like user576875 says, you just need to correctly treat your string before inserting it into the database. mysql_real_escape_string() is one way to do that. Prepared statements are another way. This will also save you from the SQL injection security issue that you might be susceptible to if you write user input directly into SQL. Always use one of the above two methods.
Also, note that this has little to do with UTF8. JSON is ASCII safe, so as long as you use an ASCII like character set (utf8, iso-8859-1), the data will be inserted and stored correctly.
I would apply BASE64 encoding to the JSON string. This should work with nearly every php setting, database, database version and setting:
$values = array("Test" => 1, "the" => 2, "West" => 3);
$encoded = base64_encode(json_encode($values));
$decoded = json_decode(base64_decode($encoded), true);
I am trying to store a list of countries in a mySQL database.
I am having problems storing (non English) names like these:
São Tomé and Príncipe
República de El Salvador
They are stored with strange characters in the db, (and therefore output strangely in my HTML pages).
I have tried using different combinations of collations for the database and the MySQL connection collation:
The "obvious" setting was to use utf8_unicode_ci for both the databse and the connection information. To my utter surprise, that did not solve the problem.
Does anyone know how to resolve this issue?
[Edit]
It turns out the problem is not to do with collation, but rather encoding, as pointed out by the col. I notice that at the command line, I can type two separate commands:
SET NAMES utf8
followed by
[QUERY]
where [QUERY] is my SQL statment to retrieve the names, and that works (names are no longer mangled). However, when I do the same thing programatically (i.e. through code), I still get the mangled names. I tried combining the two statements like this:
SET NAMES utf8; [QUERY]
at the command line, again, this returned the correct strings. Once again, when I tried the same statements through code, I got wrong values.
This is a snippet of what my code looks like:
$mysqli = self::get_db_connection();
$mysqli->query('SET NAMES utf8');
$sql = 'SELECT id, name FROM country';
$results = self::fetch($sql);
the fetch method is:
private static function fetch($query)
{
$rows = array();
if (!empty($query))
{
$mysqli = self::get_db_connection();
if ($mysqli->connect_errno)
{
self::logError($mysqli->connect_error);
}
else
{
if ($result = $mysqli->query($query))
{
if(is_object($result)){
while ($row = $result->fetch_array(MYSQLI_ASSOC))
$rows[] = $row;
$result->close();
}
}
}
}
return $rows;
}
Can anyone spot what I may be doing thats wrong?
Just to clarify, the HTTP headers in the page are set correctly
'Content-type': 'text/html; charset=utf-8'
so thats not the issue here.
As a matter of fact, collation affects nothing of a kind. it's a thing used for ordering and comparison, not recoding.
It is encoding responsible for the characters itself.
So, your problem comes not from the table collation but from the connection encoding
SET NAMES utf8
query should solve the problem, at leas for the newly inserted data
if you use uf8 everywhere*, it will work - seems like you forgot anything
*everywhere means: for your database-collation and -connection, for your (php?) script files and for the pages that are sent to the browser (by setting a meta-tag or, better, set an uftf-8-header)