Reading JSON-encoded UTF-8 data from PHP service into Ruby app - php

My brain hurts, so I need help to solve this issue. I have read about many similar encoding problems but I can't find any info that helps me with this particular issue.
I have a PHP service which reads data from database. It sets the charset to:
mysql_set_charset('utf8', $con)
Basically it then executes a query and loops thru the db items like this:
while($row = mysql_fetch_object($result)) {
$row->MyField1 = utf8_encode($row->MyField1);
$row->MyField2 = utf8_encode($row->MyField2);
...
$res[] = $row;
}
and ends with:
print json_encode($res);
I then read the data from Ruby (a sinatra app) with:
uri = URI(str)
source = Net::HTTP.get(uri)
src = JSON.parse(source)
src.each do |s|
# Display s.MyField1 in HTML here.... HTML page is HTML5 and <meta charset="utf-8">
end
The problem is that I display strings like:
ALLMÃNHETENS ÃKNING
where 'Ã' is some unknown character to me. It should properly be 'Ä' in the first occurrence and 'Å' (Swedish A with 'ring diacritic' http://en.wikipedia.org/wiki/Ring_(diacritic)).
Is it the PHP code that is wrong? Or the Ruby code? If anyone could point me in the right direction I would be very grateful? Frankly I don't know where to start chasing bugs.

OK It seems the answer is very accurately described here:
http://www.i18nqa.com/debug/bug-utf-8-latin1.html
And the important thing is that my PHP-script must add:
Content-type: application/json; charset=utf-8
otherwise the UTF-8 will have their individual bytes interpreted as ISO-8859-1 or Windows-1252 causing exactly the described bevahiour above.
IMPORTANT ADDITION: I also had to remove the utf8_encode() call in the PHP-script since the data already was UTF-8 encoded (or maybe the mysql_set_charset("utf8", $con) did that thing, I don't know).

Related

Why does my output change?

I'm working with UTF-8 encoding in PHP and I keep managing to get the output just as I want it. And then without anything happening with the code, the output all of a sudden changes.
Previously I was getting hebrew output. Now I'm getting "&&&&&".
Any ideas what might be causing this?
These are most common problems:
Your editor that you’re creating the PHP/HTML files in
The web browser you are viewing your site through
Your PHP web application running on the web server
The MySQL database
Anywhere else external you’re reading/writing data from (memcached, APIs, RSS feeds, etc)
And few things you can try:
Configuring your editor
Ensure that your text editor, IDE or whatever you’re writing the PHP code in saves your files in UTF-8 format. Your FTP client, scp, SFTP client doesn’t need any special UTF-8 setting.
Making sure that web browsers know to use UTF-8
To make sure your users’ browsers all know to read/write all data as UTF-8 you can set this in two places.
The content-type tag
Ensure the content-type META header specifies UTF-8 as the character set like this:
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
The HTTP response headers
Make sure that the Content-Type response header also specifies UTF-8 as the character-set like this:
ini_set('default_charset', 'utf-8')
Configuring the MySQL Connection
Now you know that all of the data you’re receiving from the users is in UTF-8 format we need to configure the client connection between the PHP and the MySQL database.
There’s a generic way of doing by simply executing the MySQL query:
SET NAMES utf8;
…and depending on which client/driver you’re using there are helper functions to do this more easily instead:
With the built in mysql functions
mysql_set_charset('utf8', $link);
With MySQLi
$mysqli->set_charset("utf8")
*With PDO_MySQL (as you connect)*
$pdo = new PDO(
'mysql:host=hostname;dbname=defaultDbName',
'username',
'password',
array(PDO::MYSQL_ATTR_INIT_COMMAND => "SET NAMES utf8")
);
The MySQL Database
We’re pretty much there now, you just need to make sure that MySQL knows to store the data in your tables as UTF-8. You can check their encoding by looking at the Collation value in the output of SHOW TABLE STATUS (in phpmyadmin this is shown in the list of tables).
If your tables are not already in UTF-8 (it’s likely they’re in latin1) then you’ll need to convert them by running the following command for each table:
ALTER TABLE myTable CHARACTER SET utf8 COLLATE utf8_general_ci;
One last thing to watch out for
With all of these steps complete now your application should be free of any character set problems.
There is one thing to watch out for, most of the PHP string functions are not unicode aware so for example if you run strlen() against a multi-byte character it’ll return the number of bytes in the input, not the number of characters. You can work round this by using the Multibyte String PHP extension though it’s not that common for these byte/character issues to cause problems.
Taken form here: http://webmonkeyuk.wordpress.com/2011/04/23/how-to-avoid-character-encoding-problems-in-php/
Try after setting the content type with header like this
header('Content-Type: text/html; charset=utf-8');
Try this function - >
$html = "Bla Bla Bla...";
$html = mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8");
for more - http://php.net/manual/en/function.mb-convert-encoding.php
I put together this method and called it in the file I'm working with, and that seemed to resolve the issue.
function setutf_8()
{
header('content-type: text/html; charset: utf-8');
mb_internal_encoding('UTF-8');
mb_http_output('UTF-8');
mb_http_input('UTF-8');
mb_language('uni');
mb_regex_encoding('UTF-8');
ob_start('mb_output_handler');
}
Thank you for all your help! :)

Getting special characters out of a MySQL database with PHP [duplicate]

This question already has answers here:
UTF-8 all the way through
(13 answers)
Closed 9 years ago.
I have a table that includes special characters such as ™.
This character can be entered and viewed using phpMyAdmin and other software, but when I use a SELECT statement in PHP to output to a browser, I get the diamond with question mark in it.
The table type is MyISAM. The encoding is UTF-8 Unicode. The collation is utf8_unicode_ci.
The first line of the html head is
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
I tried using the htmlentities() function on the string before outputting it. No luck.
I also tried adding this to php before any output (no difference):
header('Content-type: text/html; charset=utf-8');
Lastly I tried adding this right below the initial mysql connection (this resulted in additional odd characters being displayed):
$db_charset = mysql_set_charset('utf8',$db);
What have I missed?
Below code works for me.
$sql = "SELECT * FROM chartest";
mysql_set_charset("UTF8");
$rs = mysql_query($sql);
header('Content-type: text/html; charset=utf-8');
while ($row = mysql_fetch_array($rs)) {
echo $row['name'];
}
There are a couple things that might help. First, even though you're setting the charset to UTF-8 in the header, that might not be enough. I've seen the browser ignore that before. Try forcing it by adding this in the head of your html:
<meta charset='utf-8'>
Next, as mentioned here, try doing this:
mysql_query ("set character_set_client='utf8'");
mysql_query ("set character_set_results='utf8'");
mysql_query ("set collation_connection='utf8_general_ci'");
EDIT
So I've just done some reading up an playing around a bit. First let me tell you, despite what I mentioned in the comments, utf8_encode() and utf8_decode() will not help you here. It helps to actually understand UTF-8 encoding. I found the Wikipedia page on UTF-8 very helpful. Assuming the value you are getting back from the database is in fact already UTF-8 encoded and you simply dump it out right after getting it then it should be fine.
If you are doing anything with the database result (manipulating the string in any way especially) and you don't use the unicode aware functions from the PHP mbstring library then it will probably mess it up since the standard PHP string functions are not unicode aware.
Once you understand how UTF-8 encoding works you can do something cool like this:
$test = "™";
for($i = 0; $i < strlen($test); $i++) {
echo sprintf("%b ", ord($test[$i]));
}
Which dumps out something like this:
11100010 10000100 10100010
That's a properly encoded UTF-8 '™' character. If you don't have a character like that in your data retrieved from the database then something is messed up.
To check, try searching for a special character that you know is in the result using mb_strpos():
var_dump(mb_strpos($db_result, '™'));
If that returns anything other than false then the data from the database is fine, otherwise we can at least establish that it's a problem between PHP and the database.
you need to execute the following query first.
mysql_query("SET NAMES utf8");

UTF-8 data received by php isn't decoded

I'm having some troubles with my $_POST/$_REQUEST datas, they appear to be utf8_encoded still.
I am sending conventional ajax post requests, in these conditions:
oXhr.setRequestHeader("Content-type", "application/x-www-form-urlencoded; charset=utf-8");
js file saved under utf8-nobom format
meta-tags in html <header> tag setup
php files saved under utf-8-nobom format as well
encodeURIComponent is used but I tried without and it gives the same result
Ok, so everything is fine: the database is also in utf8, and receives it this way, pages show well.
But when I'm receiving the character "º" for example (through $_REQUEST or $_POST), its binary represention is 11000010 10111010, while "º" hardcoded in php (utf8...) binary representation is 10111010 only.
wtf? I just don't know whether it is a good thing or not... for instance if I use "#º#" as a delimiter of the explode php function, it won't get detected and this is actually the problem which lead me here.
Any help will be as usual greatly appreciated, thank you so much for your time.
Best rgds.
EDIT1: checking against mb_check_encoding
if (mb_check_encoding($_REQUEST[$i], 'UTF-8')) {
raise("$_REQUEST is encoded properly in utf8 at index " . $i);
} else {
raise(false);
}
The encoding got confirmed, I had the message raised up properly.
Single byte utf-8 characters do not have bit 7(the eight bit) set so 10111010 is not utf-8, your file is probably encoded in ISO-8859-1.

How to read Asiatic characters (Japanese, Chinese) after json_encode in PHP

I've read every post about the topic but I don't think I've found a reply to my question, that's driving me crazy.
I got a couple of php files, one stores data into mySQL db, another one read those data: I get data from all over the world and it seems that I succeed to store asiatic character in a right way, but when I try to read those data I can't get those characters back.
As many other users I got ?? instead of the correct chars.
Top of my PHP files I got:
header('Content-Type: application/json; charset=utf-8');
then
mysql_query("SET CHARACTER SET utf8", $link);
mysql_query("SET NAMES 'utf8'", $link);
then
$fab[] = array_map(utf8_encode,$array);
Here if I print_r ($fab) I lost asiatic chars :-(
Then when I do:
$json_string = json_encode($fab); //originale
What I get is "??".
How is the correct way to get the right chars back? The json string is then passed
to an iPhone client.
Any suggestion or help would be sooo appreciated.
Thank you anyway,
Fabrizio
Seems like you're double encoding it? If you get the data from mysql which is already utf8 encoded, what's the point of $fab[] = array_map(utf8_encode,$array); then?
Just had similar thing 2 days ago, when I was accepting utf8 data from an ExtJs form and it was messed up. It was cause I used utf8_encode on the data I received from the script (which was in utf8). So i broke it by double encoding. Maybe same in your case
The problem was what Tseng said: double encoding on the array: I thought I made the right test but simply I didn't.
So the only code I need is:
while($obj = mysql_fetch_object($rs)) {
$arr[] = $obj;
}
$json_string = json_encode($arr);
echo ($json_string);
Again Tseng, thanx.

Jquery ajax call and charset windows-1252

Dear stackoveflow, I have this problem. I'm working with an old version of mssql (2000) that has all the tables encoded in windows 1252 (and that's it). I can write and read succesfully with php using this line:
<?php header('Content-Type: text/html; charset=windows-1252'); ?>
If I make a normal post everything works as expected, If I do it ajax style I write messed characters in the table. I've also tried this:
contentType: "application/x-www-form-urlencoded;charset=windows-1252",
With no luck. Can anybody help me?
Thank you
I think it is possible to change the character set for incoming data from the Ajax request in Javascript somehow, butt IIRC, it's complex and is likely to have cross browser issues.
If you are querying a PHP script, the easiest way woudl be to convert the data to UTF-8 there:
$data = "Höllo, thüs üs windows-1252 encoded data";
$data_utf8 = iconv("windows-1252", "utf-8", $data);
echo $data;

Categories