UTF-8 String decoding in Python - php

In a project I need a PHP and a Python module (Python 3.5.2). As well as a configfile which both modules use. The Python configparser has problems reading special characters from the configfile, like german mutated vowel (ä,ö,ü, e.g.). From the PHP side I use the utf-8 encoding to bypass the problem:
utf8_encode ("Köln") //result: Köln
From the Python side I tried the decode function:
"Köln".decode("utf-8", "strict")
I expected the result "Köln" but just got the result "Köln" again.
What do I have do to do to decode my String?

Try these lines added on the top of your document:
# -*- coding: latin-1 -*-
# Encoding schema https://www.python.org/dev/peps/pep-0263
This may help you, more documentation here

In cases like this you should add #-*- coding: UTF-8 -*- in the first line of your .py file.

In Python3, all text is in unicode. So I would recommend, on your PHP side, converting the string to unicode (outputting as u'K\xf6ln'). After you've done this, you can convert it back to (sort of) it's original form in Python, however, the mutated vowel will be destroyed.
import unicodedata
unicodetext = u'K\xf6ln'
output = unicodedata.normalize('NFKD', unicodetext).encode('ascii', 'ignore')
This will output a lonesome Koln, without the rather pretty mutation. From my research, I can't find any way around this, but please, anybody who finds a more apt solution please comment

Thanks for all the helpfull answers and comments. I finally ended up with the following solution:
On the PHP side I encode my string with the following command:
$str = "path/to/file/Köln.jpg";
json_encode ($str, JSON_UNESCAPED_SLASHES);
The result is the string "path/to/file/K\u00f6ln.jpg" which is then stored in my config file.
The Python module uses ConfigParser to read the file. The encoded string is then decoded with the following command:
str.encode('utf8').decode('utf8')
The result is again "path/to/file/Köln.jpg".

Related

How to convert CSV's to UTF-8 with PHP

I have looked all over the internet and i cannot find an answer.
I am scraping thousands of CSV's from a source out of my control. The CSV can be ANY character encoding. so i need to convert them all to UTF-8.
I have read online that if you convert utf-8 to utf-8 the data gets scrabbled, so what i am trying to do is detect the character encoding of the file and if its not utf-8 i want to convert it to utf-8 (i plan to use iconv).
I have tried everything on stack overflow (and other sites) but i cannot seem to get the current encoding of the file.
If i use
mb_detect_encoding(file_get_contents($csvPath), mb_detect_order(), TRUE);
or
mb_detect_encoding(file_get_contents($csvPath),'auto');
has anyone got any suggestions on how i can detect the encoding of the csv or have a better way that i can convert files without knowing the original encoding.
Iv figured it out after hours of trial and error. forget mb_detect_encoding its useless.
to the shell instead and use iconv (installed by default on OSX and Linux).
$output = shell_exec("file --mime-encoding GBP_AUD_Week1.csv");
$output = str_replace("$csvPath: ", '', $output);
This gives the current file encoding
shell_exec(iconv -f $output -t utf-8 GBP_AUD_Week1.csv > GBP_AUD_Week1Converted.csv);
Note:
I tried to overwrite the file instead of creating a new one, but when i did this the file was blank and the encoding was binary.

How to decode unicode python arguments?

Using the following code (in PHP) I send an string to a python program:
shell_exec("python3 /var/www/html/app.py \"$text\"");
The $text variable contains a non-English string. the Problem is, When I print the arguments in Python with print(sys.argv) I get a result like this:
['/var/www/html/app.py', '\udcd8\udca8\udcd8\udcaa\udcd8\udcb5\udcd8\udcb4\udcda\udca9 \udcd8\udcae\udcd8\udcab\udcd9\udc87\udcd8\udca8 \udcd8\udcaa\udcd8\udcb4\udcd8\udcb5\udcd8\udcab']
How do I convert this unicode string to original form of the text in python?
Python uses your locale's encoding to decode the bytes that it gets from the command line. Default C locale uses ascii. $text it seems is in utf-8. Therefore Python has to use surrogateescape error handler to decode these bytes into the text sys.argv[1] that produces the lone surrogates such as '\udcd8' that you see in the output.
You could use utf-8 locale e.g., LC_ALL=C.UTF-8 or reencode the arguments manually: sys.argv[1].encode(locale.getpreferredencoding(True), 'surrogateescape').decode('utf-8'):
>>> s = u'\udcd8\udca8\udcd8\udcaa\udcd8\udcb5\udcd8\udcb4\udcda\udca9 \udcd8\udcae\udcd8\udcab\udcd9\udc87\udcd8\udca8 \udcd8\udcaa\udcd8\udcb4\udcd8\udcb5\udcd8\udcab'
>>> print(s.encode('ascii', 'surrogateescape').decode('utf-8'))
بتصشک خثهب تشصث
shell_exec("python3 /var/www/html/app.py \"$text\"");
(I hope $text is strongly sanitised, escaped, or static! If user input got in here you've got a horrible remote code execution vulnerability!)
'\udcd8\udca8\udcd8\udcaa\udcd8\udcb5\udcd8...
OK what has happened here is that PHP has passed a UTF-8-encoded string to Python, but Python didn't know that the command line input was UTF-8. (Often when you run Python as a command, it can work that out from your terminal, but there's no terminal when it is PHP running Python in a web server.)
Not knowing what the input was it defaulted to plain ASCII. The high bytes in the input aren't valid in ASCII, but Python 3 has a “surrogateescape” fallback handler for invalid bytes, that is applied to the command line when decoding it to a Unicode string. This generates otherwise-invalid UTF-16 surrogate code units U+DC80–U+DCFF, but at least it allows the original high bytes to be recovered if you want to.
So either:
set the PYTHONIOENCODING environment variable to UTF-8 before executing Python, so it knows what the right encoding is in the first place, or
change the Python script to pre-process its input to recover the proper input with sys.argv[1].encode('utf-8', 'surrogateescape').decode('utf-8')

PHP : csv file encoding?

I have a stupid problem. I use a software for export .csv files, and the result is a strange formated text. When I try to deal them in PHP, everything goes wrong.
I copy and paste the text in MS WORD : there is a strange character between each letter.
In php I tried to convert it using utf8_decode/utf8_encode, iconv("ISO-8859-1", "WINDOWS-1252", $str)... in vain.
I guess it's an utf16 encoded text, but I'm not sure of it. I tried some functions to decode utf16, in vain too.
Is someone has a solution to fix this ?
Your guess it correct:
file -i NL_JGFR_130326_bac.csv
NL_JGFR_130326_bac.csv: text/plain; charset=utf-16le
You can probably use the PHP MultiByte extension to work with UTF-16:
http://php.net/manual/en/ref.mbstring.php

how to convert ISO 8859-1 Characters to UTF-8

I use CURL to get content from another site, but i don't know why it's auto convert from UTF-8 to ISO 8859-1, like follow:
site: abc.com:
Cửa Hàng Chip Chip: Rộn ràng đón Giáng sinh với những vật phẩm trang trí Noel đầy màu sắc của CHIPCHIP GIFT SHOP
But when i use CURL get content from that site, i got follow:
Cửa Hàng Chip Chip: Rộn ràng đón Giáng sinh với những vật phẩm trang trí Noel đầy màu sắc của CHIPCHIP GIFT SHOP
So how to convert it's become to UTF-8 ?
I'd recommend using iconv.
iconv --list gives you a list of all known encodings, and you can then use iconv -f FROM_ENCODING -t TO_ENCODING do do your conversion. It can also read from stdin and therefore be plugged to curl.
But regarding the comment you got for your question: It seems like the file author didn't care about using the correct encoding and decided to stick with (old-style?) &auml and stuff.
Take your string in variable and use following function.
$var = "";
echo utf8_encode($var);
Judging from the line you pasted, the problem appears to be with HTML entities, not with character enconding. The encoded chars look fine to me.
You need to translate those HTML entities to encoded chars. Which tool to use will depend of your enviroment or programming language. I don't think it can be done with CURL alone.
PHP has htmlspecialchars_decode(). Python unescape() from the HTMLParser module.
curl does not convert anything, downloads things "as is"
What you see are character entities, valid html, and the browser that the conversion to a readable form.
You can check this by opening the file saved by curl in a browser. It will look like the live page.
You can try this:
html_entity_decode($string)
See more here: html_entity_decode
Your files aren’t being converted to another encoding. They’re using HTML character entities. You need to convert those entities, such as é to UTF-8, such as é. This takes one extra line of code after you convert to UTF-8, if you even need to do that.

How to convert unknown/mixed encoding file to UTF-8

I am using retrieving an XML file from a remote service which is supposed to be UTF-8, as the header is <?xml version="1.0" encoding="UTF-8"?>. However, certain parts of it is apparently not UTF-8, as when I load it into PHP's XMLReader extension, it throws some sort of "Not UTF-8 as expected" error when parsing over certain parts of the document (parts that look like they have been copy-pasted directly from MS Word).
I am looking for ideas to solve this error. Is there some program I can use to "fix" the file of any non-uft8 encodings? A PHP solution or any other solution will do
Depending on what encoding it is you are converting from, quick and easy utf-8 safe strings,utf8_encode function is your friend, but only for iso8859-1 encoding. Also, your txt cannot be already UTF-8 else you have good chances of having garbled text.
See the man page for more info:
// Usage can be as simple as this.
$name = utf8_encode($contact['name']);
On the other hand, if you need to convert from any other encoding, you will have to maybe look into incov() function.
Good-luck

Categories