Weird encoding issue in XML-RPC call - php

I'm retrieving from Odoo 9 on Ubuntu 14.04 ENG a list of partners via XML-RPC using PHP and ripcord
Some names contain one or more diacritics:
Pièr
Frère Pièr
All those names have been entered from a single computer running Windows 8.1 using one version of Chrome.
The strange fact is that I get a list where some diacritics are correct, some other have encoding problems, like:
Pi�r
Fr�re Pièr
The same diacritic in the same string is correctly encoded or not.
In subsequent calls the result is always the same.
If I edit the string, then it could change the results, giving
Frère Pi�r
Frère Pièr
Fr�re Pi�r...
I need to output a JSON, and thus I need to encode this in UTF-8: but it is currently impossible since I don't have a clue of what encoding the original text is (and it seems to not have any encoding at all!)
Any idea?

I found out that the incoming array was in charset "Latin1"
I solved normalizing the array generated from the XML-RPC output, recursively applying a multbyte conversion function:
// given an XML-RPC output named $arr_output...
function descramble_diacritics(&$entry, $key) {
$entry = mb_convert_encoding($entry, 'UTF-8', 'Latin1');
}
array_walk_recursive($arr_output, 'descramble_diacritics');
header('Access-Control-Allow-Origin: *');
header('Content-Type: application/json');
echo json_encode($arr_output);

Related

encoding issues in drupal when importing from wordpress

I am currently moving blog posts from wordpress to drupal. however after moving it
some of the text is not being displayed correctly.
wordpress is displaying :
When it hasn’t (html code is <h2>When it hasn’t</h2>)
Drupal is displaying :
When it hasn’t (html code is <h2>When it hasn’t</h2>)
In the wordpress and drupal db the value is correct. The source is the same.
<h2>When it hasn’t</h2>
I did a search and found many options. None of them helped.
Below are the ones I have done and checked.
1) I double checked that utf-8 is the character encoing in drupal and wp.
I also made a simple test.php file to check nothing else was coming in the way
and it still did not display correctly.
2) I made sure when we take a mysqldump and upload to drupal utf-8
is used.
3) I also made sure the .php file is in utf-8 when saved.
4) I changed the encoding type in chrome for every option available and nothing
displayed it correctly.
5) I also used php functions to recode it but they did not work.
$value2="<h2>When it hasn’t</h2>";
$out = recode_string('..utf-8', $value2);
//output - When it hasnt
$out2= mb_convert_encoding($value2,'UTF-8', "UTF-8");
// output - When it hasn’t
$out3= #iconv('UTF-8', 'utf-8', $value2);
// output - When it hasn’t
I have ran out of options now and I am stuck. Please help
You say the text in both databases is correct, but actually this doesn't mean too much: to viewing the content of a record you must use some client, and quite a few transformations may happen depending on how the text is rendered so you can read it.
So only two things matters:
the encoding of the column
the encoding of the HTML page returned by Drupal
Since your page outputs ’ (in CP1252 is xE2x80x99) for ’ (Unicode U+2019, UTF-8 is 0xE28099) I guess the column is indeed UTF-8, however there's someone between the database and the browser who thinks the text is CP1252. This is what you have to check:
If using MySQL, the connection encoding must be UTF-8 so that what you have in your PHP script is UTF-8 text. You can use SET NAMES 'UTF-8'. Note that if you don't need the Unicode set, you can even use CP1252: the only important thing is that you know the encoding, since PHP strings are just byte arrays.
Explicitely define the response encoding in the HTTP Content-Type header. I mean, configure Drupal to call header('Content-Type: text/html; charset=utf-8');
If the HTTP response encoding is different than the one used for the text retrieved from the db, transcode the query result accordingly

UTF-8 data received by php isn't decoded

I'm having some troubles with my $_POST/$_REQUEST datas, they appear to be utf8_encoded still.
I am sending conventional ajax post requests, in these conditions:
oXhr.setRequestHeader("Content-type", "application/x-www-form-urlencoded; charset=utf-8");
js file saved under utf8-nobom format
meta-tags in html <header> tag setup
php files saved under utf-8-nobom format as well
encodeURIComponent is used but I tried without and it gives the same result
Ok, so everything is fine: the database is also in utf8, and receives it this way, pages show well.
But when I'm receiving the character "º" for example (through $_REQUEST or $_POST), its binary represention is 11000010 10111010, while "º" hardcoded in php (utf8...) binary representation is 10111010 only.
wtf? I just don't know whether it is a good thing or not... for instance if I use "#º#" as a delimiter of the explode php function, it won't get detected and this is actually the problem which lead me here.
Any help will be as usual greatly appreciated, thank you so much for your time.
Best rgds.
EDIT1: checking against mb_check_encoding
if (mb_check_encoding($_REQUEST[$i], 'UTF-8')) {
raise("$_REQUEST is encoded properly in utf8 at index " . $i);
} else {
raise(false);
}
The encoding got confirmed, I had the message raised up properly.
Single byte utf-8 characters do not have bit 7(the eight bit) set so 10111010 is not utf-8, your file is probably encoded in ISO-8859-1.

Ajax autocompletion lost in character encoding

The scheme is a text input field in a html form to be autocompleted using jQuery.autocomplete and getting the appropriate server response (e;g. a city name json list). The whole package works well... except that the client does not get data returned from the server when typing accented characters (éèà..). Same as many, it looks like I'm facing a char encoding issue but can not manage to figure out where and how to solve it despite many tries (iconv, utf8_encode, urldecode...) and readings like this one for example.
Therefore I'd need some help/hints to understand where to act (before prototyping jQuery autocomplete code ... ?)
EDIT: might be also a jQuery accent folding issue, I'll try also that way.
Configuration:
server: Apache2.2 (debian lenny)
php : compiled 5.3.3 (so the option JSON_UNESCAPED_UNICODE is not available for json_encode)
mysql: 5.1.49 with MySQL charset: UTF-8 Unicode (utf8),
class: using a modified PFBC2.x version for the php form building
meta
The website is mostly for french users so it's all designed with ISO-8859-1 (bad initial choice I guess) :
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
jQuery autocomplete code (applied to the city input field)
// DEBUG Testing (tested w/ and w/o the $charset_attr: no change)
$charset_attr = 'contentType: "application/x-www-form-urlencoded;charset=ISO-8859-1"';
echo 'jQuery("#' . $this->attributes["id"] . '").autocomplete({source:"' , $this->xhr_path . '", minLength:2, ' . $charset_attr .'});';
The generated code for that input field is matching the above expection.
Converting mysql rows into utf8 using this function :
I convert the msyql returned array into utf8 prior to sending json back to the client. Actually I tested and wrote also other functions, but this does not change anything so I guess the point is not there.
$encoded_arr = utf8json($returnData);
echo json_encode($encoded_arr);
flush();
Encoding control 1 (client side)
A embed control in the html form in order to check which char encoding is actually passed to jQuery.autocomplete :
jQuery(document).ready(function() {
<?php
$test_str ="foobar";
$check_encoding = "'" . mb_detect_encoding($test_str) . "'";
?>
alert('Check charset server encoding: ' + <?php echo $check_encoding;?> ); // output : ASCII
});
Encoding control 2 (server side)
$inputData = (isset($_GET))? htmlspecialchars($_GET['term'],ENT_COMPAT, 'UTF-8') : NULL;
$encoding_get = mb_detect_encoding($_GET['term']);
$encoding_data = mb_detect_encoding($inputData);
$utf8converted = #iconv(strtolower($encoding_get), 'utf-8', $inputData);
$checkconversion = mb_detect_encoding($utf8converted);
Sending lowcase normal characters (ea..), I get all as ASCII.
Sending lowcase accented characters (éèà..), I get all as UTF8.
So I'm lost as the server receives the proper char string, produces a json return (tested without ajax) but it looks like the client does not receive or interprate this properly.
For those facing the same kind of ...%$# issue, here is what I've done to solve my case :
Checking the char encoding at each node (eg client, apache server, mysql server), using mb_detect_encoding on the server side,
Finally pointed out the problem location node : in my case passing UTF8 chars to the mysql server i/o latin ISO-8859-1, so mysql server did not return the expected answers, which I could not detect or debug with direct url POSTing data to the server script. So I had log the input and output in a file, checking entry character encoding and mysql server output.
Changed the ajax request to POST i/o GET,
Solved by encoding $_POST data to ISO prior sending the mysql server request, using mb_convert_encoding, as well described here.

Encoding problem in PHP while making a webservice call

I have the nth problem encoding related with PHP!
so the story is:
i read a url from a file (ISO-8859). I cant change the encoding of this file for various reason I wont discuss here.
I use that url to make a call to a rest webservice.
the url happens to contain the symbol "è" which is conveted to � when it is loaded by the PHP engine.
as a result the webservice returns and unexpected result because what it gets is actually the word "perch�" instead of "perchè".
I tried to force php to work with ISO-8859 by doing:
ini_set('default_charset', "ISO-8859");
The problem is that it still doesn't work and the webservice doesn't answer properly. I am sure that the webservice works as I tried to copy paste the url by hand in a browser and I received the expected data.
You can convert data from one character set into another using iconv().
Your REST web service is most likely expecting UTF-8 data, so you would have to do something like this:
$data = iconv("iso-8859-1", "utf-8", $data);
before sending the request.

Is PHP serialize function compatible UTF-8?

I have a site I want to migrate from ISO to UTF-8.
I have a record in database indexed by the following primary key :
s:22:"Informations générales";
The problem is, now (with UTF-8), when I serialize the string, I get :
s:24:"Informations générales";
(notice the size of the string is now the number of bytes, not string length)
So this is not compatible with non-utf8 previous records !
Did I do something wrong ? How could I fix this ?
Thanks
The behaviour is completely correct. Two strings with different encodings will generate different byte streams, thus different serialization strings.
Dump the database in latin1.
In the command line:
sed -e 's/latin1/utf8/g' -i ./DBNAME.sql
Import the file converted to a new database in UTF-8.
Use a php script to update each field.
Make a query, loop through each field and update the serialized string using this:
$str = preg_replace('!s:(\d+):"(.*?)";!se', "'s:'.strlen('$2').':\"$2\";'", $str);
After that, I was able to use unserialize() and everything working with UTF-8.
To unserialize an utf-8 encoded serialized array:
$array = #unserialize($arrayFromDatabase);
if ($array === false) {
$array = #unserialize(utf8_decode($arrayFromDatabase)); //decode first
$array = array_map('utf8_encode', $array ); // encode the array again
}
PHP 4 and 5 do not have built-in Unicode support; I believe PHP 6 is starting to add more Unicode support although I'm not sure how complete that is.
You did nothing wrong. PHP prior to v6 just isn't Unicode aware, and as such doesn't support it, if you don't beat it to be (i.e., via the mbstring extension or other means).
We here wrote our own wrapper around serialize() to remedy this. You could, too, move to other serialization techniques, like JSON (with json_encode() and json_decode() in PHP since 5.2.0).

Categories