How to decode unicode python arguments?

How to decode unicode python arguments? - php

Using the following code (in PHP) I send an string to a python program:
shell_exec("python3 /var/www/html/app.py \"$text\"");
The $text variable contains a non-English string. the Problem is, When I print the arguments in Python with print(sys.argv) I get a result like this:
['/var/www/html/app.py', '\udcd8\udca8\udcd8\udcaa\udcd8\udcb5\udcd8\udcb4\udcda\udca9 \udcd8\udcae\udcd8\udcab\udcd9\udc87\udcd8\udca8 \udcd8\udcaa\udcd8\udcb4\udcd8\udcb5\udcd8\udcab']
How do I convert this unicode string to original form of the text in python?

Python uses your locale's encoding to decode the bytes that it gets from the command line. Default C locale uses ascii. $text it seems is in utf-8. Therefore Python has to use surrogateescape error handler to decode these bytes into the text sys.argv[1] that produces the lone surrogates such as '\udcd8' that you see in the output.
You could use utf-8 locale e.g., LC_ALL=C.UTF-8 or reencode the arguments manually: sys.argv[1].encode(locale.getpreferredencoding(True), 'surrogateescape').decode('utf-8'):
>>> s = u'\udcd8\udca8\udcd8\udcaa\udcd8\udcb5\udcd8\udcb4\udcda\udca9 \udcd8\udcae\udcd8\udcab\udcd9\udc87\udcd8\udca8 \udcd8\udcaa\udcd8\udcb4\udcd8\udcb5\udcd8\udcab'
>>> print(s.encode('ascii', 'surrogateescape').decode('utf-8'))
بتصشک خثهب تشصث

shell_exec("python3 /var/www/html/app.py \"$text\"");
(I hope $text is strongly sanitised, escaped, or static! If user input got in here you've got a horrible remote code execution vulnerability!)
'\udcd8\udca8\udcd8\udcaa\udcd8\udcb5\udcd8...
OK what has happened here is that PHP has passed a UTF-8-encoded string to Python, but Python didn't know that the command line input was UTF-8. (Often when you run Python as a command, it can work that out from your terminal, but there's no terminal when it is PHP running Python in a web server.)
Not knowing what the input was it defaulted to plain ASCII. The high bytes in the input aren't valid in ASCII, but Python 3 has a “surrogateescape” fallback handler for invalid bytes, that is applied to the command line when decoding it to a Unicode string. This generates otherwise-invalid UTF-16 surrogate code units U+DC80–U+DCFF, but at least it allows the original high bytes to be recovered if you want to.
So either:
set the PYTHONIOENCODING environment variable to UTF-8 before executing Python, so it knows what the right encoding is in the first place, or
change the Python script to pre-process its input to recover the proper input with sys.argv[1].encode('utf-8', 'surrogateescape').decode('utf-8')

Related

Python equivalent of php FILTER_FLAG_STRIP_HIGH

Parsing a large data set of poor quality data converted from pysical form using OCR and using PostgreSQL COPY to insert .csv files into psql. Some records have ASCII bytes that are causing errors to import into postgres since I want the data in UTF-8 varchar(), as I believe that using a TEXT type column would not produce this error.
DataError: invalid byte sequence for encoding "UTF8": 0xd6 0x53
CONTEXT: COPY table_name, line 112809
I want to filter all these bytes before writing to the csv file.
I believe something like PHP's FILTER_FLAG_STRIP_HIGH (http://php.net/manual/en/filter.filters.sanitize.php) would work since it can remove all high ASCII value > 127.
Is there such a function in python?

Encode your string to ASCII, ignoring errors, then decode that back to a string.
text = "ƒart"
text = text.encode("ascii", "ignore").decode()
print(text) # art
If you are starting with a byte string in UTF-8, then you just need to decode it:
bites = "ƒart".encode("utf8")
text = bites.decode("ascii", "ignore")
print(text) # art
This works specifically with UTF-8 because multi-byte characters always use values outside of the ASCII range, so partial characters are never stripped out. It mightn't work so well with other encodings.

UTF-8 String decoding in Python

In a project I need a PHP and a Python module (Python 3.5.2). As well as a configfile which both modules use. The Python configparser has problems reading special characters from the configfile, like german mutated vowel (ä,ö,ü, e.g.). From the PHP side I use the utf-8 encoding to bypass the problem:
utf8_encode ("Köln") //result: KÃ¶ln
From the Python side I tried the decode function:
"KÃ¶ln".decode("utf-8", "strict")
I expected the result "Köln" but just got the result "KÃ¶ln" again.
What do I have do to do to decode my String?

Try these lines added on the top of your document:
# -*- coding: latin-1 -*-
# Encoding schema https://www.python.org/dev/peps/pep-0263
This may help you, more documentation here

In cases like this you should add #-*- coding: UTF-8 -*- in the first line of your .py file.

In Python3, all text is in unicode. So I would recommend, on your PHP side, converting the string to unicode (outputting as u'K\xf6ln'). After you've done this, you can convert it back to (sort of) it's original form in Python, however, the mutated vowel will be destroyed.
import unicodedata
unicodetext = u'K\xf6ln'
output = unicodedata.normalize('NFKD', unicodetext).encode('ascii', 'ignore')
This will output a lonesome Koln, without the rather pretty mutation. From my research, I can't find any way around this, but please, anybody who finds a more apt solution please comment

Thanks for all the helpfull answers and comments. I finally ended up with the following solution:
On the PHP side I encode my string with the following command:
$str = "path/to/file/Köln.jpg";
json_encode ($str, JSON_UNESCAPED_SLASHES);
The result is the string "path/to/file/K\u00f6ln.jpg" which is then stored in my config file.
The Python module uses ConfigParser to read the file. The encoded string is then decoded with the following command:
str.encode('utf8').decode('utf8')
The result is again "path/to/file/Köln.jpg".

Why do PHP and Obj-C encode strings differently?

I'm trying to convert a string to UTF8, on both obj-c and php.
I get different results:
"\xd7\x91\xd7\x93\xd7\x99\xd7\xa7\xd7\x94" //Obj-C
"\u05d1\u05d3\u05d9\u05e7\u05d4" //PHP
Obj-C code:
const char *cData = [#"בדיקה" cStringUsingEncoding:NSUTF8StringEncoding]
PHP code:
utf8_encode('בדיקה')
This difference breaks my hash algorithm that follows.
How can I make the two strings encoded the same way? Should I change the obj-c\php ?

Go to http://www.utf8-chartable.de/unicode-utf8-table.pl
In the combo box switch to “U+0590 … U+5FF Hebrew”
Scroll down to “U+05D1” which is the rightmost character of your input string.
The third column shows the two UTF-8 bytes: “d7 91”
If you keep looking you will see that the PHP and the Objective-C are actually the same. The “problem” you are seeing is that while PHP uses an Unicode escape (\u), Objective-C uses direct byte hexadecimal escapes (\x). Those are only visual representations of the strings, the bytes in memory are actually the same.
If your hash algorithm deals with bytes correctly, you should not see differences.

What are you using to do the encoding on PHP? It looks like you're generating a UTF-16 string.
Try utf8_encode() and see if that gives better results.

What is the (default) encoding for the function md5() in PHP?

When I tested this little script:
$str = "apple";
echo md5($str);
The result matched the result of doing md5 using utf8 (tested using C#)
Should I trust that this will always be the case in any other environment?
If I where to put this script on any webhost, windows or linux, would it behave always the same with UTF8 encoding ?

The encoding of a string literal is whatever encoding you saved the source file in. If you saved this php file in UTF-16, you would get a different result, that is, if the code even runs.
There is no unified or managed encoding in PHP. Strings in PHP can be in any encoding, in other words, they are equivalent to byte arrays of languages that have more abstract string type.

Simply md5() will always give the same encoding.
If you cannot trust it, Simply you can encode data in Database itself.

php system, python and utf-8

I have a python program running very well. It connects to several websites and outputs the desired information. Since not all websites are encoded with utf-8, I am requesting the charset from the headers and using unicode(string, encoding) method to decode (I am not sure whether its the appropriate way to do this but it works pretty well). When I run the python program I receive no ??? marks and it works fine. But when I run the program using php's system function, I receive this error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0131' in position 41: ordinal not in range(128)
This is a python specific error but what confuses me is that I don't receive this error when I run the program using the terminal. I only receive this when I use php's system function and call the program from php. What may be the cause behind this problem?
Here is a sample code:
php code that calls python program:
system("python somefile.py $search") // where $search is the variable coming from an input
python code:
encoding = "iso-8859-9"
l = "some string here with latin characters"
print unicode("<div class='line'>%s</div>" % l, encoding)
# when I run this code from terminal it works perfect and I receive no ??? marks
# when I run this code from php, I receive the error above

From the PrintFails wiki:
When Python finds its output attached to a terminal, it sets the
sys.stdout.encoding attribute to the terminal's encoding. The print
statement's handler will automatically encode unicode arguments into
str output.
This is why your program works when called from the terminal.
When Python does not detect the desired character set of the
output, it sets sys.stdout.encoding to None, and print will invoke the
"ascii" codec.
This is why your program fails when called from php.
To make it work when called from php, you need to make explicit what encoding print should use. For example, to make explicit that you want the output encoded in utf-8 (when not attached to a terminal):
ENCODING = sys.stdout.encoding if sys.stdout.encoding else 'utf-8'
print unicode("<div class='line'>%s</div>" % l, encoding).encode(ENCODING)
Alternatively, you could set the PYTHONIOENCODING environment variable.
Then your code should work without changes (both from the terminal and when called from php).

When you run the python script in your terminal, your terminal is likely to be encoded in UTF8 (specially if you are using linux or mac).
When you set l variable to "some string with latin characters", that string will be encoded to the default encoding, if you are using a terminal l will be UTF8 and the script wont crash.
A little tip: if you have a string encoded in latin1 and you want it in unicode you can do:
variable.decode('latin1')

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to decode unicode python arguments? - php

Related

Python equivalent of php FILTER_FLAG_STRIP_HIGH

UTF-8 String decoding in Python

Why do PHP and Obj-C encode strings differently?

What is the (default) encoding for the function md5() in PHP?

php system, python and utf-8

Categories

Resources