PHP unable to decode ASCII character

PHP unable to decode ASCII character - php

I am investigating an issue where the browser is sending data to Apache(2.4) / PHP (7.2 Mac) and PHP is unable to decode some bytes into a printable character. The character is '-' (the hexidecimal value 2D is given when the character is copied and pasted into https://www.online-toolz.com/tools/text-hex-convertor.php and ASCII hex translated here - https://ascii.cl/) but is displayed as ��� by PHP.
MariaDB displays the character fine and reports the length of the data source's column value as 250 characters. The data is collected by PHP PDO and passed to an HTML form and used as a value for a text input form. The character displays fine in the HTML dom. However, when the POST data is submitted back through Apache to PHP, PHP says the string length is 251 characters, and then subsequently breaks my string length sanitizer.
I found a short Python command to see the binary. I copied and pasted the character out of Sequel Pro and put it into this script.
import binascii
bin(int(binascii.hexlify('-'), 16))
'0b101101'
The history of the encoding is that it was from a Google Docs document, downloaded as .txt, opened in Mac Text Edit and saved with 'UTF-8' encoding, then passed through python into a MySQL database, back out through PHP to HTML and submitted back to PHP.
I have replaced the character in the database with another character '–' (hex value e28093) with binary output below, and everything works fine.
bin(int(binascii.hexlify('–'), 16))
'0b111000101000000010010011'
Any ideas on why PHP fails to correctly recognize original character and reports the string length as +1 compared to MySQL? I assume that PHP should be able to handle all ASCII characters properly.
UPDATE:
When I print the original string (that is unprintable) out in the HTML dom (before posting back to PHP) the string length is reported as 249 characters and the '-' character is printable.

This '–' is – or U-2013. If it is delivered as ASCII, than 3 ASCII chars are send: 0xe2 0x80 0x93. The first code is â in ASCII 8bit, but undefined in standard ASCII (7bit). The other 2 chars are controls in ACII 8-bit. So 3 "?" are ok.
Anyway, you said, that the standard munis sign is also delivered as 3 "?". That is very unusual. Please proof this again.

Related

How can I reproducibly represent a non-UTF8 string in PHP (Browser)

I received a string with an unknown character encoding via import. How can I display such a string in the browser so that it can be reproduced as PHP code?
I would like to illustrate the problem with an example.
$stringUTF8 = "The price is 15 €";
$stringWin1252 = mb_convert_encoding($stringUTF8,'CP1252');
var_dump($stringWin1252); //string(17) "The price is 15 �"
var_export($stringWin1252); // 'The price is 15 �'
The string delivered with var_export does not match the original. All unrecognized characters are replaced by the � symbol. The string is only generated here with mb_convert_encoding for test purposes. Here the character coding is known. In practice, it comes from imports e.G. with file_cet_contents() and the character coding is unknown.
The output with an improved var_export that I expect looks like this:
"The price is 15 \x80"
My approach to the solution is to find all non-UTF8 characters and then show them in hexadecimal. The code for this is too extensive to be shown here.
Another variant is to output all characters in hexadecimal PHP notation.
function strToHex2($str) {
return '\x'.rtrim(chunk_split(strtoupper(bin2hex($str)),2,'\x'),'\x');
}
echo strToHex2($stringWin1252);
Output:
\x54\x68\x65\x20\x70\x72\x69\x63\x65\x20\x69\x73\x20\x31\x35\x20\x80
This variant is well suited for purely binary data, but quite large and difficult to read for general texts.
My question in other words:
How can I change all non-UTF8 characters from a string to the PHP hex representation "\xnn" and leave correct UTF8 characters.

I'm going to start with the question itself:
How can I reproducibly represent a non-UTF8 string in PHP (Browser)
The answer is very simple, just send the correct encoding in an HTML tag or HTTP header.
But that wasn't really your question. I'm actually not 100% sure what the true question is, but I'm going to try to follow what you wrote.
I received a string with an unknown character encoding via import.
That's really where we need to start. If you have an unknown string, then you really just have binary data. If you can't determine what those bytes represents, I wouldn't expect the browser or anyone else to figure it out either. If you can, however, determine what those bytes represent, then once again, send the correct encoding to the client.
How can I display such a string in the browser so that it can be reproduced
as PHP code?
You are round-tripping here which is asking for problems. The only safe and sane answer is Unicode with one of the officially support encodings such as UTF-8, UTF-16, etc.
The string delivered with var_export does not match the original. All unrecognized characters are replaced by the � symbol.
The string you entered as a sample did not end with a byte sequence of x80. Instead, you entered the € character which is 20AC in Unicode and expressed as the three bytes xE2 x82 xAC in UTF-8. The function mb_convert_encoding doesn't have a map of all logical characters in every encoding, and so for this specific case it doesn't know how to map "Euro Sign" to the CP1252 codepage. Whenever a character conversion fails, the Unicode FFFD character is used instead.
The string is only generated here with mb_convert_encoding for test purposes.
Even if this is just for testing purposes, it is still messing with the data, and the previous paragraph is important to understand.
Here the character coding is known. In practice, it comes from imports e.g. with file_get_contents() and the character coding is unknown.
We're back to arbitrary bytes at this point. You can either have PHP guess, or if you have a corpus of known data you could build some heuristics.
The output with an improved var_export that I expect looks like this:
"The price is 15 \x80"
Both var_dump and var_export are intended to show you quite literally what is inside the variable, and changing them would have a giant BC problem. (There actually was an RFC for making a new dumping function but I don't think it did what you want.)
In PHP, strings are just byte arrays so calling these functions dumps those byte arrays to the stream, and your browser or console or whatever takes the current encoding and tries to match those bytes to the current font. If your font doesn't support it, one of the replacement characters is shown. (Or, sometimes a device tries to guess what those bytes represent which is why you see â‚¬ or similar.) To say that again, your browser/console does this, PHP is not doing that.
My approach to the solution is to find all non-UTF8 characters
That's probably not what you want. First, it assumes that the characters are UTF-8, which you said was not an assumption that you can make. Second, if a file actually has byte sequences that aren't valid UTF-8, you probably have a broken file.
How can I change all non-UTF8 characters from a string to the PHP hex representation "\xnn" and leave correct UTF8 characters.
The real solution is to use Unicode all the way through your application and to enforce an encoding whenever you store/output something. This also means that when viewing this data that you have a font capable of showing those code points.
When you ingest data, you need to get it to this sane point first, and that's not always easy. Once you are Unicode, however, you should (mostly) be safe. (For "mostly", I'm looking at you Emojis!)
But how do you convert? That's the hard part. This answer shows how to manually convert CP1252 to UTF-8. Basically, repeat with each code point that you want to support.
If you don't want to do that, and you really want to have the escape sequences, then I think I'd inspect the string byte by byte, and anything over x7F gets escaped:
$s = "The price is 15 \x80";
$buf = '';
foreach(str_split($s) as $c){
$buf .= $c >= "\x80" ? '\x' . bin2hex($c) : $c;
}
var_dump($buf);
// string(20) "The price is 15 \x80"

Python equivalent of php FILTER_FLAG_STRIP_HIGH

Parsing a large data set of poor quality data converted from pysical form using OCR and using PostgreSQL COPY to insert .csv files into psql. Some records have ASCII bytes that are causing errors to import into postgres since I want the data in UTF-8 varchar(), as I believe that using a TEXT type column would not produce this error.
DataError: invalid byte sequence for encoding "UTF8": 0xd6 0x53
CONTEXT: COPY table_name, line 112809
I want to filter all these bytes before writing to the csv file.
I believe something like PHP's FILTER_FLAG_STRIP_HIGH (http://php.net/manual/en/filter.filters.sanitize.php) would work since it can remove all high ASCII value > 127.
Is there such a function in python?

Encode your string to ASCII, ignoring errors, then decode that back to a string.
text = "ƒart"
text = text.encode("ascii", "ignore").decode()
print(text) # art
If you are starting with a byte string in UTF-8, then you just need to decode it:
bites = "ƒart".encode("utf8")
text = bites.decode("ascii", "ignore")
print(text) # art
This works specifically with UTF-8 because multi-byte characters always use values outside of the ASCII range, so partial characters are never stripped out. It mightn't work so well with other encodings.

UTF8 null character & normalizing whitespace characters

I'm working on a script that builds an XML feed using strings from the database. The strings are user-entered image captions from Facebook Open Graph API. The strings are supposed to be all UTF8 according to facebook. So i import the captions into the database and store them as utf8-unicode (i also tried utf8-bin)
But i always have the same error when trying to display the output XML feed, because one of the caption have a weird whitespace character
This page contains the following errors:
error on line 63466 at column 14: Input is not proper UTF-8, indicate encoding !
Bytes: 0x0B 0x54 0x68 0x6F
Below is a rendering of the page up to the first error.
In the database (phpmyadmin) and in the page source code (using chrome), the problematic characters appear as empty square symbol.
Now if i copy and paste the problematic character in an converter it gives me Hexadecimal 000B
What's the easiest way to fix this ?
I'd also like to understand in the first place, why Facebook Graph API is giving me non-utf8 characters when it's not supposed to
Failed attemps:
utf8_encode() isn't working because the rest of the strings are UTF8 valid.
I also tried multiple different ways of stripping out all non-utf8 characters, but it doesn't filter out this specific character. Same when trying to filter out all non-latin.
htmlentities() htmlspecialchars() or the same isn't encoding the problematic characters
charactericonv(mb_detect_encoding()) will not detect the string as invalid utf8
str_replace() or preg_replace() is of no help, if i try to copy and paste the character in Visual Studio Code, nothing is pasted, not even a whitespace
str_replace("\0", "", ) ...nope

Here is a list of what we have found and/or worked through with the original poster:
MySQL's utf-8 is not a proper implementation of utf-8 - utf8mb4 is;
additional information on character sets and collation differences;
changes that happen to existing data if collation is changed.
We have checked the above and discovered that the initial problem was caused by vertical tabulation symbols creeping into the text fields. A good way to remove said symbols is by running $str = str_replace("\x0b", "", $str);, where $str is the string that is going to be inserted into the text field. It's important to not replace \v, as that might not be desired.

If the 0B is always at the beginning of a string, then trace the strings back to their source and see if they are "BOM" encoded. Wikipedia on BOM .
At least come back with the various steps the data takes, so we can help with deducing the source of the problem.
Note: although needed for Emoji and Chinese, switching to utf8mb4 will not deal with BOM if that is the 'real' problem.
(using str_replace is just a bandaid)

Decoding ISO-8859-1 and Encoding to UTF-8 before MySQL query

I'm kinda stuck if I'm doing it right.
I have a file which is ISO-8859-1 (pretty certain). My MySQL db is in utf-8 encoding. Which is why I want to convert the file to UTF-8 encoded characters before I can send it as a query. For instance, First I rewrite every line of the file.txt into file_new.txt using.
line = line.decode('ISO-8859-1').encode('utf-8')
And then I save it. Next, I create a MySQL connection and create a cursor with the following query so that all the data is received as utf-8.
query = 'SET NAMES "utf8"'
cursor.execute(query)
Following this, I reopen file_new.txt and enter each line into MySQL. Is this the right approach to get the table in MySQL utf-8 encoding? Or Am I missing any crucial part?
Now to receive this data. I use 'SET NAMES "utf8"" as well. But the received data is giving me question marks � when I set the header content type to
header("Content-Type: text/html; charset=utf-8");
On the other hand, when I set
header("Content-Type: text/html; charset=ISO-8859-1");
It works fine, but other utf-8 encoded data from the database is getting scrambled. So I'm guessing the data from file.txt is still NOT getting encoded to utf-8. Can any one explain why?
PS: Before I read everyline, I replace a character and save the file.txt to file.txt.tmp. I then read this file to get file_new.txt. I don't know if it causes any problem to the original file encoding.
f1 = codecs.open(tsvpath, 'rb',encoding='iso-8859-1')
f2 = codecs.open(tsvpath + '.tmp', 'wb',encoding='utf8')
for line in f1:
f2.write(line.replace('\"', '\''))
f1.close()
f2.close()
In the below example, I've utf-8 encoded persian data which is right but the other non-enlgish text is coming out to be in "question marks". This is precisely my problem.
Example : Removed.

Welcome to the wonderful world of unicode and windows. I've found this site very helpful in understanding what is going wrong with my strings http://www.i18nqa.com/debug/utf8-debug.html. The other thing you need is a hex editor like HxD. There are many places where things can go wrong. For example, if you are viewing your files in a text editor - it may be trying to be helpful and is silently changing your encoding.
Start with your original data, view it in HxD and see what the encoding is. View your results in Hxd and see if the changes you expect are being made. Repeat through the steps in your process.
Without your full code and sample data, its hard to say where the problem is. My guess is your replacing the double quote with single quote on binary files is the culprit.
Also check out The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
by Joel Spolsky

Try this instead:
line = line.decode('ISO-8859-1').encode('utf-8-sig')
From the docs:
As UTF-8 is an 8-bit encoding no BOM is required and any U+FEFF
character in the decoded string (even if it’s the first character) is
treated as a ZERO WIDTH NO-BREAK SPACE.
Without external information it’s impossible to reliably determine
which encoding was used for encoding a string. Each charmap encoding
can decode any random byte sequence. However that’s not possible with
UTF-8, as UTF-8 byte sequences have a structure that doesn’t allow
arbitrary byte sequences. To increase the reliability with which a
UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8
(that Python 2.5 calls "utf-8-sig") for its Notepad program: Before
any of the Unicode characters is written to the file, a UTF-8 encoded
BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is
written. As it’s rather improbable that any charmap encoded file
starts with these byte values (which would e.g. map to
LATIN SMALL LETTER I WITH DIAERESIS RIGHT-POINTING DOUBLE ANGLE
QUOTATION MARK INVERTED QUESTION MARK in iso-8859-1), this increases
the probability that a utf-8-sig encoding can be correctly guessed
from the byte sequence. So here the BOM is not used to be able to
determine the byte order used for generating the byte sequence, but as
a signature that helps in guessing the encoding. On encoding the
utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes
to the file. On decoding utf-8-sig will skip those three bytes if they
appear as the first three bytes in the file. In UTF-8, the use of the
BOM is discouraged and should generally be avoided.
Source: https://docs.python.org/3.5/library/codecs.html
EDIT:
Sample:
"Hello World".encode('utf-8') yields b'Hello World' while "Hello World".encode('utf-8-sig') yields b'\xef\xbb\xbfHello World' highlighting the docs:
On encoding the
utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes
to the file. On decoding utf-8-sig will skip those three bytes if they
appear as the first three bytes in the file.
Edit:
I have made a similar function before that converts a file to utf-8 encoding. Here is a snippet:
def convert_encoding(src, dst, unicode='utf-8-sig'):
return open(dst, 'w').write(open(src, 'rb').read().decode(unicode, 'ignore'))
Based on your example, try this:
convert_encoding('file.txt.tmp', 'file_new.txt')

Alright guys, so my encoding was right. The file was getting encoding to utf-8 just as needed. All the queries were right. It turns out that the other dataset that was in Arabic was in ISO-8859-1. Therefore, only 1 of them was working. No matter what I did.
The Hexeditors did help. But in the end I just used sublime text to recheck if my encoded data was utf-8. It turns out the python script and the sublime editor did the same. So the code is fine. :)

You should not need to do any explicit encode or decode. SET NAMES ... should match what the client encoding is (for INSERTing) or should become (for SELECTing).
MySQL will convert between the client encoding and the columns's CHARACTER SET.

Outputting non ASCII codes to a file

I have a problem while writing non ASCII codes to a file with PHP.
For example when I press ALT + 20 on my keyboard I get a ¶ character.
But when I write chr(20) to a file and after opening the file via Notepad++ it reads a DC4 or if I try to write it as a .csv and then open it with excel I get a ? surrounded by a square.

You mainly misunderstand a feature of your operating system. As commented pressing that keyboard combo (ALT + numpad 20 ) does not enter US-ASCII character decimal 20. From the documentation of your operating system:
If the first digit you type is any number from 1 through 9, the value is recognized as a code point in the system's OEM code page. The result differs depending on the Windows system language specified in Regional and Language Options in Control Panel. For example, if your system language is English (US), the code page is 437 (MS-DOS Latin US), so pressing ALT and then typing 163 on the numeric keypad produces ú (U+00FA, Latin lowercase letter U with acute). If your system language is Greek (OEM code page 737 MS-DOS Greek), the same sequence produces the Greek lowercase letter MU (U+03BC).
taken from: To input characters that are not on your keyboard (Windows XP Professional Product Documentation)
From your description you've got OEM 437 Wikipedia Code page 437 so the codepoint you're looking for is the Pilcrow Wikipedia and in Unicode this is Unicode Character 'PILCROW SIGN' (U+00B6).
So where-ever you want to output that, you need to find out the needed targets file character encoding and encode that character in the right encoding and that's all. No more magic, nothing.

As Jeff says, control characters (with ASCII code < 32) are always interpreted differently. For showing a paragraph sign, try sending either chr(182) or utf8_encode(chr(182)), depending on the charset of your target file.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.