Python xml-rpc with PHP client and unicode not working

Python xml-rpc with PHP client and unicode not working - php

I have XML-RPC Server written with Python. It takes some values and saves them in a mysql database. Data are in utf-8 and the whole process works fine.
I have no problem talking to it with other languages like Python and ASP.NET and C#, but when it comes to PHP, there is a problem. The characters are not being saved in MySQL as they should be and they are all scrambled characters.
I have done all the recommendations as setting the header in the PHP file and etc. I have also configured MySQL collation to utf-8, but the problem still exists.
The Curl used is from Github: https://github.com/dongsheng/cURL
Source code is below:
<?php
error_reporting(E_ALL);
header('Content-Type: text/plain; charset=utf-8');
require_once('curl.php');
$rpc = "http://xmlrpc-webservice-address.com/";
$client = new xmlrpc_client($rpc, true);
$text="سلام";
$arr1=array("username", "password", array("111"), $text, "30002240123456", "ws", False);
$resp = $client->call('send', $arr1);
print_r($resp);
print_r("\n");
print_r($text);
?>

Most likely your problem is one of the following (in descending order of likelihood):
Your PHP source code isn't actually encoded as UTF-8, but as, say, CP1256. That means the non-ASCII literal string in the source is actually mojibake nonsense as far as the PHP interpreter, which reads it as UTF-8, is concerned. And those garbage bytes get passed through as-is all the way through the chain—to the XML-RPC service and back, to the browser, and to the user's screen.
Your PHP source code is encoded as UTF-8, but your PHP interpreter thinks it's, say, CP1256, because of the way it (or your server/module) is configured. So, once again, the literal string is mojibake (in the opposite direction), which again passes through the whole chain.
Your web service isn't returning UTF-8, but, say, Latin-1, and your other clients all treat it accordingly as Latin-1, but your PHP client code just assumes it's UTF-8, passes it to the browser as if it were UTF-8, and the user sees garbage.
If you're not absolutely, positively sure that your editor saved the source code as UTF-8, look at the source file in a hex editor. If it's UTF-8, the Arabic string should look like D8 B3 D9 84 D8 A7 D9 85. If it's anything different—like D3 E1 C7 E3 (CP1256) or D3 E3 C7 E5 (ISO-8859-6), that's your problem.

Related

How can I reproducibly represent a non-UTF8 string in PHP (Browser)

I received a string with an unknown character encoding via import. How can I display such a string in the browser so that it can be reproduced as PHP code?
I would like to illustrate the problem with an example.
$stringUTF8 = "The price is 15 €";
$stringWin1252 = mb_convert_encoding($stringUTF8,'CP1252');
var_dump($stringWin1252); //string(17) "The price is 15 �"
var_export($stringWin1252); // 'The price is 15 �'
The string delivered with var_export does not match the original. All unrecognized characters are replaced by the � symbol. The string is only generated here with mb_convert_encoding for test purposes. Here the character coding is known. In practice, it comes from imports e.G. with file_cet_contents() and the character coding is unknown.
The output with an improved var_export that I expect looks like this:
"The price is 15 \x80"
My approach to the solution is to find all non-UTF8 characters and then show them in hexadecimal. The code for this is too extensive to be shown here.
Another variant is to output all characters in hexadecimal PHP notation.
function strToHex2($str) {
return '\x'.rtrim(chunk_split(strtoupper(bin2hex($str)),2,'\x'),'\x');
}
echo strToHex2($stringWin1252);
Output:
\x54\x68\x65\x20\x70\x72\x69\x63\x65\x20\x69\x73\x20\x31\x35\x20\x80
This variant is well suited for purely binary data, but quite large and difficult to read for general texts.
My question in other words:
How can I change all non-UTF8 characters from a string to the PHP hex representation "\xnn" and leave correct UTF8 characters.

I'm going to start with the question itself:
How can I reproducibly represent a non-UTF8 string in PHP (Browser)
The answer is very simple, just send the correct encoding in an HTML tag or HTTP header.
But that wasn't really your question. I'm actually not 100% sure what the true question is, but I'm going to try to follow what you wrote.
I received a string with an unknown character encoding via import.
That's really where we need to start. If you have an unknown string, then you really just have binary data. If you can't determine what those bytes represents, I wouldn't expect the browser or anyone else to figure it out either. If you can, however, determine what those bytes represent, then once again, send the correct encoding to the client.
How can I display such a string in the browser so that it can be reproduced
as PHP code?
You are round-tripping here which is asking for problems. The only safe and sane answer is Unicode with one of the officially support encodings such as UTF-8, UTF-16, etc.
The string delivered with var_export does not match the original. All unrecognized characters are replaced by the � symbol.
The string you entered as a sample did not end with a byte sequence of x80. Instead, you entered the € character which is 20AC in Unicode and expressed as the three bytes xE2 x82 xAC in UTF-8. The function mb_convert_encoding doesn't have a map of all logical characters in every encoding, and so for this specific case it doesn't know how to map "Euro Sign" to the CP1252 codepage. Whenever a character conversion fails, the Unicode FFFD character is used instead.
The string is only generated here with mb_convert_encoding for test purposes.
Even if this is just for testing purposes, it is still messing with the data, and the previous paragraph is important to understand.
Here the character coding is known. In practice, it comes from imports e.g. with file_get_contents() and the character coding is unknown.
We're back to arbitrary bytes at this point. You can either have PHP guess, or if you have a corpus of known data you could build some heuristics.
The output with an improved var_export that I expect looks like this:
"The price is 15 \x80"
Both var_dump and var_export are intended to show you quite literally what is inside the variable, and changing them would have a giant BC problem. (There actually was an RFC for making a new dumping function but I don't think it did what you want.)
In PHP, strings are just byte arrays so calling these functions dumps those byte arrays to the stream, and your browser or console or whatever takes the current encoding and tries to match those bytes to the current font. If your font doesn't support it, one of the replacement characters is shown. (Or, sometimes a device tries to guess what those bytes represent which is why you see â‚¬ or similar.) To say that again, your browser/console does this, PHP is not doing that.
My approach to the solution is to find all non-UTF8 characters
That's probably not what you want. First, it assumes that the characters are UTF-8, which you said was not an assumption that you can make. Second, if a file actually has byte sequences that aren't valid UTF-8, you probably have a broken file.
How can I change all non-UTF8 characters from a string to the PHP hex representation "\xnn" and leave correct UTF8 characters.
The real solution is to use Unicode all the way through your application and to enforce an encoding whenever you store/output something. This also means that when viewing this data that you have a font capable of showing those code points.
When you ingest data, you need to get it to this sane point first, and that's not always easy. Once you are Unicode, however, you should (mostly) be safe. (For "mostly", I'm looking at you Emojis!)
But how do you convert? That's the hard part. This answer shows how to manually convert CP1252 to UTF-8. Basically, repeat with each code point that you want to support.
If you don't want to do that, and you really want to have the escape sequences, then I think I'd inspect the string byte by byte, and anything over x7F gets escaped:
$s = "The price is 15 \x80";
$buf = '';
foreach(str_split($s) as $c){
$buf .= $c >= "\x80" ? '\x' . bin2hex($c) : $c;
}
var_dump($buf);
// string(20) "The price is 15 \x80"

Decoding ISO-8859-1 and Encoding to UTF-8 before MySQL query

I'm kinda stuck if I'm doing it right.
I have a file which is ISO-8859-1 (pretty certain). My MySQL db is in utf-8 encoding. Which is why I want to convert the file to UTF-8 encoded characters before I can send it as a query. For instance, First I rewrite every line of the file.txt into file_new.txt using.
line = line.decode('ISO-8859-1').encode('utf-8')
And then I save it. Next, I create a MySQL connection and create a cursor with the following query so that all the data is received as utf-8.
query = 'SET NAMES "utf8"'
cursor.execute(query)
Following this, I reopen file_new.txt and enter each line into MySQL. Is this the right approach to get the table in MySQL utf-8 encoding? Or Am I missing any crucial part?
Now to receive this data. I use 'SET NAMES "utf8"" as well. But the received data is giving me question marks � when I set the header content type to
header("Content-Type: text/html; charset=utf-8");
On the other hand, when I set
header("Content-Type: text/html; charset=ISO-8859-1");
It works fine, but other utf-8 encoded data from the database is getting scrambled. So I'm guessing the data from file.txt is still NOT getting encoded to utf-8. Can any one explain why?
PS: Before I read everyline, I replace a character and save the file.txt to file.txt.tmp. I then read this file to get file_new.txt. I don't know if it causes any problem to the original file encoding.
f1 = codecs.open(tsvpath, 'rb',encoding='iso-8859-1')
f2 = codecs.open(tsvpath + '.tmp', 'wb',encoding='utf8')
for line in f1:
f2.write(line.replace('\"', '\''))
f1.close()
f2.close()
In the below example, I've utf-8 encoded persian data which is right but the other non-enlgish text is coming out to be in "question marks". This is precisely my problem.
Example : Removed.

Welcome to the wonderful world of unicode and windows. I've found this site very helpful in understanding what is going wrong with my strings http://www.i18nqa.com/debug/utf8-debug.html. The other thing you need is a hex editor like HxD. There are many places where things can go wrong. For example, if you are viewing your files in a text editor - it may be trying to be helpful and is silently changing your encoding.
Start with your original data, view it in HxD and see what the encoding is. View your results in Hxd and see if the changes you expect are being made. Repeat through the steps in your process.
Without your full code and sample data, its hard to say where the problem is. My guess is your replacing the double quote with single quote on binary files is the culprit.
Also check out The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
by Joel Spolsky

Try this instead:
line = line.decode('ISO-8859-1').encode('utf-8-sig')
From the docs:
As UTF-8 is an 8-bit encoding no BOM is required and any U+FEFF
character in the decoded string (even if it’s the first character) is
treated as a ZERO WIDTH NO-BREAK SPACE.
Without external information it’s impossible to reliably determine
which encoding was used for encoding a string. Each charmap encoding
can decode any random byte sequence. However that’s not possible with
UTF-8, as UTF-8 byte sequences have a structure that doesn’t allow
arbitrary byte sequences. To increase the reliability with which a
UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8
(that Python 2.5 calls "utf-8-sig") for its Notepad program: Before
any of the Unicode characters is written to the file, a UTF-8 encoded
BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is
written. As it’s rather improbable that any charmap encoded file
starts with these byte values (which would e.g. map to
LATIN SMALL LETTER I WITH DIAERESIS RIGHT-POINTING DOUBLE ANGLE
QUOTATION MARK INVERTED QUESTION MARK in iso-8859-1), this increases
the probability that a utf-8-sig encoding can be correctly guessed
from the byte sequence. So here the BOM is not used to be able to
determine the byte order used for generating the byte sequence, but as
a signature that helps in guessing the encoding. On encoding the
utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes
to the file. On decoding utf-8-sig will skip those three bytes if they
appear as the first three bytes in the file. In UTF-8, the use of the
BOM is discouraged and should generally be avoided.
Source: https://docs.python.org/3.5/library/codecs.html
EDIT:
Sample:
"Hello World".encode('utf-8') yields b'Hello World' while "Hello World".encode('utf-8-sig') yields b'\xef\xbb\xbfHello World' highlighting the docs:
On encoding the
utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes
to the file. On decoding utf-8-sig will skip those three bytes if they
appear as the first three bytes in the file.
Edit:
I have made a similar function before that converts a file to utf-8 encoding. Here is a snippet:
def convert_encoding(src, dst, unicode='utf-8-sig'):
return open(dst, 'w').write(open(src, 'rb').read().decode(unicode, 'ignore'))
Based on your example, try this:
convert_encoding('file.txt.tmp', 'file_new.txt')

Alright guys, so my encoding was right. The file was getting encoding to utf-8 just as needed. All the queries were right. It turns out that the other dataset that was in Arabic was in ISO-8859-1. Therefore, only 1 of them was working. No matter what I did.
The Hexeditors did help. But in the end I just used sublime text to recheck if my encoded data was utf-8. It turns out the python script and the sublime editor did the same. So the code is fine. :)

You should not need to do any explicit encode or decode. SET NAMES ... should match what the client encoding is (for INSERTing) or should become (for SELECTing).
MySQL will convert between the client encoding and the columns's CHARACTER SET.

php filter_var FILTER_FLAG_ENCODE_HIGH

I have the fallowing testcase for the php function function_var():
<?php
$inputvalue = "Ž"; //NUM = 142 on the ASCII extended list
$sanitized = filter_var($inputvalue, FILTER_SANITIZE_STRING, FILTER_FLAG_ENCODE_HIGH);
echo 'The sanitized output: '.$sanitized."\n"; // --> & #197;& #189; (Å ½)
?>
If you run the above snippet the output is not what I expect to be returned. The Ž is number 142 in the ASCII extended list (see: ascii-code[dot]com). So what I expect to get back is the '& #142;' (string, without the space).
I had help finding out what is going wrong I just dont know how to solve it yet.
If you convert 'Ž' to Hex UTF-8 bytes you get: C5 BD. These hex bytes correspond with the ISO-8859 hex values: Å ½(see: http://cs.stanford.edu/~miles/iso8859.html). These 2 characters then get decoded by filter_var to '& #197;& #189;'.
See this onlineconverter!!!: http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%C5%BD&mode=char
So basically what happens: UTF-8 bytes are used to translate them as Latin-1 characters bytes. The converter page says the fallowing: "UTF-8 bytes as Latin-1 characters" is what you typically see when you display a UTF-8 file with a terminal or editor that only knows about 8-bit characters.
I dont think my editor is the problem. I am using a Mac with Coda 2 (UTF-8 as default). The test has also been tested on a html5 page with meta character set to utf-8. Furthermore I am using a defaut XAMPP localhost server. With Firebug in Firefox I also checked if the file was served as UTF-8 (it is).
Anyone got a idea how I can solve this encoding problem?

I am gona drop this cause I am not finding any solution. The email() function is also not safe and I am gona use either phpmailer or swiftmailer (and I am leaning towards the latter).

PHP, IMAP and Outlook 2010 - folder names encoding is different?

Im developing e-mail client in php (with symfony2) and i have problem with folders with non-ascii characters in name.
Folder created in php app is visible in same app correctly. Same in Outlook, created in outlook looks good in outlook. In other cases not. Folder created in outlook is not displayed correctly in php and vice-versa.
Im using utf-7 to encode folder names in php. Which encoding uses Outlook?
Example: Folder named "Wysłąne" (misspelled polish word meaning "sent"), first one is encoded in utf7 by php, and second created in Outlook:
PHP:
Wys&xYLEhQ-ne
Outlook:
Wys&AUIBBQ-ne
Why it differs? How to make it in same encoding?

There seems to be a mixup in your source character encoding. imap_utf7_encode (and similar) expect your string in ISO-8859-1 encoding.
AFAICT there is no way to represent Wysłąne in ISO-8859-1. "Wysłąne" represented as UTF-8 becomes (hex bytes)
byte value (hex) 57, 79, 73, C5 82, C4 85, 6E 65
unicode character W y s ł ą n e
The PHP result Wys&xYLEhQ-ne when decoded is "Wys얂쒅ne". The the two special characters therein are Korean characters with code points U+C582 and U+C485 respectively. So it appears a character-per-character translation is somehow attempted, where the UTF-8 representation of two of the characters is interpreted as Unicode code points instead.
The simplest way to fix this is to use the mbstring extension which has the mb_convert_encoding function.
$utf7encoded = mb_convert_encoding($utf8SourceString, "UTF7-IMAP","UTF-8")
$decodedAsUTF8 = mb_convert_encoding($utf7String,"UTF-8", "UTF7-IMAP")

problems with UTF-8 encoding in PHP

The characters I am getting from the URL, for example www.mydomain.com/?name=john , were fine, as longs as they were not in Russian.
If they were are in Russian, I was getting '����'.
So I added $name= iconv("cp1251","utf-8" ,$name); and now it works fine for Russian and English characters, but screws up other languages. :)))
For example 'Jānis' ( Latvian ) that worked fine before iconv, now turns into 'jДЃnis'.
Any idea if there's some universal encoder that would work with both the Cyrillic languages and not screw up other languages?

Why don't you just use UTF-8 with all files and processes?

Actually this runs down to the problem of how the URL is encoded. If you're clicking a link on a given page the browser will use the page's encoding to sent the request but if you enter the URL directly into the address-bar of your browser the behavior is somehow undefined as there is no standardized way on the encoding to use (Firefox provides an about:config switch to use UTF-8 encoded URLs).
Besides using some encoding detection there is no way to know the encoding used with the URL in the given request.
EDIT:
Just to backup what I said above, I wrote a small test script that shows the default behavior of the five major browsers (running Mac OS X in my case - Windows Vista via Parallels in case of the IE):
$p = $_GET['p'];
for ($i = 0; $i < strlen($p); $i++) {
// this displays the binary data received via the URL in hex format
echo dechex(ord($p[$i])) . ' ';
}
Calling http://path/to/script.php?p=äöü leads to
Safari (4.0.5): c3 a4 c3 b6 c3 bc
Firefox (3.6.3): c3 a4 c3 b6 c3 bc
Google Chrome (5.0.375.38): c3 a4 c3 b6 c3 bc
Opera (10.10): e4 f6 fc
Internet Explorer (8.0.6001.18904): e4 f6 fc
So obviously the first three use UTF-8 encoded URLs while Opera and IE use ISO-8859-1 or some of its variants. Conclusion: you cannot be sure what's the encoding of textual data sent via an URL.

Seems like the issue is the file encoding, you should always use UTF-8 no BOM as the prefered encoding for your .php files, code editors such as Intype let you easily specify this (UTF-8 Plain).
Also, add the following code to your files before any output:
header('Content-Type: text/html; charset=utf-8');
You should also read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.