I have a python program running very well. It connects to several websites and outputs the desired information. Since not all websites are encoded with utf-8, I am requesting the charset from the headers and using unicode(string, encoding) method to decode (I am not sure whether its the appropriate way to do this but it works pretty well). When I run the python program I receive no ??? marks and it works fine. But when I run the program using php's system function, I receive this error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0131' in position 41: ordinal not in range(128)
This is a python specific error but what confuses me is that I don't receive this error when I run the program using the terminal. I only receive this when I use php's system function and call the program from php. What may be the cause behind this problem?
Here is a sample code:
php code that calls python program:
system("python somefile.py $search") // where $search is the variable coming from an input
python code:
encoding = "iso-8859-9"
l = "some string here with latin characters"
print unicode("<div class='line'>%s</div>" % l, encoding)
# when I run this code from terminal it works perfect and I receive no ??? marks
# when I run this code from php, I receive the error above
From the PrintFails wiki:
When Python finds its output attached to a terminal, it sets the
sys.stdout.encoding attribute to the terminal's encoding. The print
statement's handler will automatically encode unicode arguments into
str output.
This is why your program works when called from the terminal.
When Python does not detect the desired character set of the
output, it sets sys.stdout.encoding to None, and print will invoke the
"ascii" codec.
This is why your program fails when called from php.
To make it work when called from php, you need to make explicit what encoding print should use. For example, to make explicit that you want the output encoded in utf-8 (when not attached to a terminal):
ENCODING = sys.stdout.encoding if sys.stdout.encoding else 'utf-8'
print unicode("<div class='line'>%s</div>" % l, encoding).encode(ENCODING)
Alternatively, you could set the PYTHONIOENCODING environment variable.
Then your code should work without changes (both from the terminal and when called from php).
When you run the python script in your terminal, your terminal is likely to be encoded in UTF8 (specially if you are using linux or mac).
When you set l variable to "some string with latin characters", that string will be encoded to the default encoding, if you are using a terminal l will be UTF8 and the script wont crash.
A little tip: if you have a string encoded in latin1 and you want it in unicode you can do:
variable.decode('latin1')
Related
Using the following code (in PHP) I send an string to a python program:
shell_exec("python3 /var/www/html/app.py \"$text\"");
The $text variable contains a non-English string. the Problem is, When I print the arguments in Python with print(sys.argv) I get a result like this:
['/var/www/html/app.py', '\udcd8\udca8\udcd8\udcaa\udcd8\udcb5\udcd8\udcb4\udcda\udca9 \udcd8\udcae\udcd8\udcab\udcd9\udc87\udcd8\udca8 \udcd8\udcaa\udcd8\udcb4\udcd8\udcb5\udcd8\udcab']
How do I convert this unicode string to original form of the text in python?
Python uses your locale's encoding to decode the bytes that it gets from the command line. Default C locale uses ascii. $text it seems is in utf-8. Therefore Python has to use surrogateescape error handler to decode these bytes into the text sys.argv[1] that produces the lone surrogates such as '\udcd8' that you see in the output.
You could use utf-8 locale e.g., LC_ALL=C.UTF-8 or reencode the arguments manually: sys.argv[1].encode(locale.getpreferredencoding(True), 'surrogateescape').decode('utf-8'):
>>> s = u'\udcd8\udca8\udcd8\udcaa\udcd8\udcb5\udcd8\udcb4\udcda\udca9 \udcd8\udcae\udcd8\udcab\udcd9\udc87\udcd8\udca8 \udcd8\udcaa\udcd8\udcb4\udcd8\udcb5\udcd8\udcab'
>>> print(s.encode('ascii', 'surrogateescape').decode('utf-8'))
بتصشک خثهب تشصث
shell_exec("python3 /var/www/html/app.py \"$text\"");
(I hope $text is strongly sanitised, escaped, or static! If user input got in here you've got a horrible remote code execution vulnerability!)
'\udcd8\udca8\udcd8\udcaa\udcd8\udcb5\udcd8...
OK what has happened here is that PHP has passed a UTF-8-encoded string to Python, but Python didn't know that the command line input was UTF-8. (Often when you run Python as a command, it can work that out from your terminal, but there's no terminal when it is PHP running Python in a web server.)
Not knowing what the input was it defaulted to plain ASCII. The high bytes in the input aren't valid in ASCII, but Python 3 has a “surrogateescape” fallback handler for invalid bytes, that is applied to the command line when decoding it to a Unicode string. This generates otherwise-invalid UTF-16 surrogate code units U+DC80–U+DCFF, but at least it allows the original high bytes to be recovered if you want to.
So either:
set the PYTHONIOENCODING environment variable to UTF-8 before executing Python, so it knows what the right encoding is in the first place, or
change the Python script to pre-process its input to recover the proper input with sys.argv[1].encode('utf-8', 'surrogateescape').decode('utf-8')
My PHP script have all echoes commented. But if I'd run it from the command line I will receive 'О╩©' (without quotes) in the very beginning of the script execution.
I'm concerning as the script is intended to be run from crontab and each execution generates new email with empty message body (only two LFs after the message header).
How can I track the source of this unnecessary output?
(Sorry - the script is too large to be posted here)
Seems like your file has Byte Order Mark [BOM] signature at the start, save your file encoding as UTF8 without BOM.
Byte Order Mark (BOM)
In Notepad++ Try : Encode -> Encode in UTF-8 without BOM
I work with php cli, so on the command line, on a Linux computer. I want to type in a unicode character. How do you do that?
Suppose the character is the euro sign.
In vim I do: ctrl-v shift-u 20ac Enter.
In bash I do: ctlr-shift-u 20ac Enter.
So how in the php cli?
You can do:
echo "\x20\xac";
To echo the raw bytes, but what gets displayed will depend on your terminal settings. Things get... complicated.
I'm going to assume you're talking about the PHP Interactive Shell.
Unfortunately the interactive shell has no concept of unicode. You have two options:
Entering/pasting the characters directly (some European keyboard layouts allow you to enter the Euro sign directly)
Entering the unicode bytes using escape sequences. For instance echo "\xE2\x82\xAc"; will get you a Euro sign
When I tested this little script:
$str = "apple";
echo md5($str);
The result matched the result of doing md5 using utf8 (tested using C#)
Should I trust that this will always be the case in any other environment?
If I where to put this script on any webhost, windows or linux, would it behave always the same with UTF8 encoding ?
The encoding of a string literal is whatever encoding you saved the source file in. If you saved this php file in UTF-16, you would get a different result, that is, if the code even runs.
There is no unified or managed encoding in PHP. Strings in PHP can be in any encoding, in other words, they are equivalent to byte arrays of languages that have more abstract string type.
Simply md5() will always give the same encoding.
If you cannot trust it, Simply you can encode data in Database itself.
i am using ajax for data in arabic characters and everything works good , i can store arabic characters to database and i can retrieve arabic characters from database and prints it to the screem and everything works good , but my problem is that when i check javascripte concole on google chrome to check the retriveing data i can't show arabic characters , but the prints as this (this is just example and not all data)
["\u0645\u062f\u064a\u0646\u0629","\u0645\u062f\u064a\u0646\u0629 \u062a\u0627\u0631\u064a\u062e\u064a\u0651\u0629","\u0634\u062e\u0635\u064a\u0651\]
i mean like this
When using JSON, strings are in UTF-8, and special characters are encoded as \u followed by 4 hexadecimal characters.
In your case, if you try to decode that string -- for example, with the first item of your array :
>>> str = "\u0645\u062f\u064a\u0646\u0629";
"مدينة"
I don't read arabic, but this looks like arabic to me :-)
Even if the JSON doesn't look good, it's not what matters : the important thing is that you get your original data back, once the JSON is decoded ; and, here, it seems you'll do.
To get the original, decoded, string in the browser's console (for debugging purposes, I suppose), you should be able to use the same JS library you are using in your application (if any), or the JSON.parse() function (I just tested this in Firefox's console, actually) :
>>> JSON.parse('"\u0645\u062f\u064a\u0646\u0629"');
"مدينة"
Of course, you'll have to write some code to actually output that decoded-value to the browser's console (be it "by hand" or when getting the JSON back from your server) ; but since the browser's console it a debugging tool it seems OK.
By default, the console, as a debugging tool, outputs the raw JSON string it gets from the server -- and, with JSON, special characters are encoded, there is nothing you can do about it (except decode the JSON string and display it yourself, if you need to)
If you want to output the decoded string to the console each time you get a result from your server, you'll have to call JSON.parse() each time you get a result from your server ; and then output it, probably using console.log().
Don't forget to remove that debugging code before distributing your application / uploading it to your production server, though.