UTF8 null character & normalizing whitespace characters

UTF8 null character & normalizing whitespace characters - php

I'm working on a script that builds an XML feed using strings from the database. The strings are user-entered image captions from Facebook Open Graph API. The strings are supposed to be all UTF8 according to facebook. So i import the captions into the database and store them as utf8-unicode (i also tried utf8-bin)
But i always have the same error when trying to display the output XML feed, because one of the caption have a weird whitespace character
This page contains the following errors:
error on line 63466 at column 14: Input is not proper UTF-8, indicate encoding !
Bytes: 0x0B 0x54 0x68 0x6F
Below is a rendering of the page up to the first error.
In the database (phpmyadmin) and in the page source code (using chrome), the problematic characters appear as empty square symbol.
Now if i copy and paste the problematic character in an converter it gives me Hexadecimal 000B
What's the easiest way to fix this ?
I'd also like to understand in the first place, why Facebook Graph API is giving me non-utf8 characters when it's not supposed to
Failed attemps:
utf8_encode() isn't working because the rest of the strings are UTF8 valid.
I also tried multiple different ways of stripping out all non-utf8 characters, but it doesn't filter out this specific character. Same when trying to filter out all non-latin.
htmlentities() htmlspecialchars() or the same isn't encoding the problematic characters
charactericonv(mb_detect_encoding()) will not detect the string as invalid utf8
str_replace() or preg_replace() is of no help, if i try to copy and paste the character in Visual Studio Code, nothing is pasted, not even a whitespace
str_replace("\0", "", ) ...nope

Here is a list of what we have found and/or worked through with the original poster:
MySQL's utf-8 is not a proper implementation of utf-8 - utf8mb4 is;
additional information on character sets and collation differences;
changes that happen to existing data if collation is changed.
We have checked the above and discovered that the initial problem was caused by vertical tabulation symbols creeping into the text fields. A good way to remove said symbols is by running $str = str_replace("\x0b", "", $str);, where $str is the string that is going to be inserted into the text field. It's important to not replace \v, as that might not be desired.

If the 0B is always at the beginning of a string, then trace the strings back to their source and see if they are "BOM" encoded. Wikipedia on BOM .
At least come back with the various steps the data takes, so we can help with deducing the source of the problem.
Note: although needed for Emoji and Chinese, switching to utf8mb4 will not deal with BOM if that is the 'real' problem.
(using str_replace is just a bandaid)

Related

Decoding ISO-8859-1 and Encoding to UTF-8 before MySQL query

I'm kinda stuck if I'm doing it right.
I have a file which is ISO-8859-1 (pretty certain). My MySQL db is in utf-8 encoding. Which is why I want to convert the file to UTF-8 encoded characters before I can send it as a query. For instance, First I rewrite every line of the file.txt into file_new.txt using.
line = line.decode('ISO-8859-1').encode('utf-8')
And then I save it. Next, I create a MySQL connection and create a cursor with the following query so that all the data is received as utf-8.
query = 'SET NAMES "utf8"'
cursor.execute(query)
Following this, I reopen file_new.txt and enter each line into MySQL. Is this the right approach to get the table in MySQL utf-8 encoding? Or Am I missing any crucial part?
Now to receive this data. I use 'SET NAMES "utf8"" as well. But the received data is giving me question marks � when I set the header content type to
header("Content-Type: text/html; charset=utf-8");
On the other hand, when I set
header("Content-Type: text/html; charset=ISO-8859-1");
It works fine, but other utf-8 encoded data from the database is getting scrambled. So I'm guessing the data from file.txt is still NOT getting encoded to utf-8. Can any one explain why?
PS: Before I read everyline, I replace a character and save the file.txt to file.txt.tmp. I then read this file to get file_new.txt. I don't know if it causes any problem to the original file encoding.
f1 = codecs.open(tsvpath, 'rb',encoding='iso-8859-1')
f2 = codecs.open(tsvpath + '.tmp', 'wb',encoding='utf8')
for line in f1:
f2.write(line.replace('\"', '\''))
f1.close()
f2.close()
In the below example, I've utf-8 encoded persian data which is right but the other non-enlgish text is coming out to be in "question marks". This is precisely my problem.
Example : Removed.

Welcome to the wonderful world of unicode and windows. I've found this site very helpful in understanding what is going wrong with my strings http://www.i18nqa.com/debug/utf8-debug.html. The other thing you need is a hex editor like HxD. There are many places where things can go wrong. For example, if you are viewing your files in a text editor - it may be trying to be helpful and is silently changing your encoding.
Start with your original data, view it in HxD and see what the encoding is. View your results in Hxd and see if the changes you expect are being made. Repeat through the steps in your process.
Without your full code and sample data, its hard to say where the problem is. My guess is your replacing the double quote with single quote on binary files is the culprit.
Also check out The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
by Joel Spolsky

Try this instead:
line = line.decode('ISO-8859-1').encode('utf-8-sig')
From the docs:
As UTF-8 is an 8-bit encoding no BOM is required and any U+FEFF
character in the decoded string (even if it’s the first character) is
treated as a ZERO WIDTH NO-BREAK SPACE.
Without external information it’s impossible to reliably determine
which encoding was used for encoding a string. Each charmap encoding
can decode any random byte sequence. However that’s not possible with
UTF-8, as UTF-8 byte sequences have a structure that doesn’t allow
arbitrary byte sequences. To increase the reliability with which a
UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8
(that Python 2.5 calls "utf-8-sig") for its Notepad program: Before
any of the Unicode characters is written to the file, a UTF-8 encoded
BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is
written. As it’s rather improbable that any charmap encoded file
starts with these byte values (which would e.g. map to
LATIN SMALL LETTER I WITH DIAERESIS RIGHT-POINTING DOUBLE ANGLE
QUOTATION MARK INVERTED QUESTION MARK in iso-8859-1), this increases
the probability that a utf-8-sig encoding can be correctly guessed
from the byte sequence. So here the BOM is not used to be able to
determine the byte order used for generating the byte sequence, but as
a signature that helps in guessing the encoding. On encoding the
utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes
to the file. On decoding utf-8-sig will skip those three bytes if they
appear as the first three bytes in the file. In UTF-8, the use of the
BOM is discouraged and should generally be avoided.
Source: https://docs.python.org/3.5/library/codecs.html
EDIT:
Sample:
"Hello World".encode('utf-8') yields b'Hello World' while "Hello World".encode('utf-8-sig') yields b'\xef\xbb\xbfHello World' highlighting the docs:
On encoding the
utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes
to the file. On decoding utf-8-sig will skip those three bytes if they
appear as the first three bytes in the file.
Edit:
I have made a similar function before that converts a file to utf-8 encoding. Here is a snippet:
def convert_encoding(src, dst, unicode='utf-8-sig'):
return open(dst, 'w').write(open(src, 'rb').read().decode(unicode, 'ignore'))
Based on your example, try this:
convert_encoding('file.txt.tmp', 'file_new.txt')

Alright guys, so my encoding was right. The file was getting encoding to utf-8 just as needed. All the queries were right. It turns out that the other dataset that was in Arabic was in ISO-8859-1. Therefore, only 1 of them was working. No matter what I did.
The Hexeditors did help. But in the end I just used sublime text to recheck if my encoded data was utf-8. It turns out the python script and the sublime editor did the same. So the code is fine. :)

You should not need to do any explicit encode or decode. SET NAMES ... should match what the client encoding is (for INSERTing) or should become (for SELECTing).
MySQL will convert between the client encoding and the columns's CHARACTER SET.

Half character? - Accents encoding issue

I'm currently facing a very strange encoding issue when dealing with an html source code.
I got the following line:
"requête présentée par..."
When an extern library does an utf8_decode I got:
"reque^te présente´e par..."
So accents are placed right to the accented characters. If I do an utf8_encode from that result, I don't get the original "requête présentée par..." but I keep having "reque^te présente´e par..."
Even stranger: If I open the original html in Notepad++, encoding is utf8 without BOM (so far, so good) but I can actually select half of the character with the text selection (keyboard or mouse). Yes, half of it. As if the real code was "e^" but it was displayed as "ê". When I try to copy it to my IDE it copies "ê" but pastes "e^".
I have come up with a basic replacement function:
"e^" => "ê",
"e´" => "é",
...
and some other french cases, and it's working properly for now.
But as the HTML comes in differents languages, I'm pretty sure I won't be able to successfully replace every character under this encoding issue.
Has anybody face this issue before and (hopefully) has a more general solution?
Thanks in advance.

It sounds like your HTML source is using Combining characters. That is, instead of using a single unicode character to represent the ê, it's using first a regular e and then a combining character to add the diacritic ^. You can verify this with a hex editor to see the character codes, in this case the combining circumflex is hex code 0302.
See also Unicode equivalence.

PHP: html_entity_decode removing/not showing character

I am having a problem with Â character on my website.
I have a website where users can use a wysiwyg editor (ckeditor) to fill out their profile. The content is ran through htmlpurify before being put into a database (for security reasons).
The database has all tables setup with UTF-8 charset. I also call 'SET NAMES utf-8' at the beginning of script execution to prevent problems (which has worked for years, as I haven't had this problem in a long time). The webpage the text is displayed on has a content-type of utf-8 and I also use the header() function to set the content-type and charset as well.
When displaying the text all seemed fine until I tried running a regular expression on the content. html_entity_decode (called with the encoding param of 'utf-8') is removing/not showing the Â character for some reason and it leaves behind something which is causing all of my regexes to fail (it seems there is a character there but I cannot view it in the source).
How can I prevent and/or remove this character so I can run the regular expression?
EDIT: I have decided to abandon ckeditor and go with the markdown format like this site uses to have more flexibility. I have hated wysiwyg editors for as long as I remember. Updating all the profiles to the new format will give me a chance to remove all of the offending text and give the site a clean start. Thanks for all the input.

You are probably facing the situation that the string actually is not properly UTF-8 encoded (as you wrote it is, but it ain't). html_entity_decode might then remove any invalid UTF-8 byte sequences (e.g. single-byte-charset encoding of Â) with a substitution character.
Depending on the PHP version you're using you've got more control how to deal with this by making use of the flags.
Additionally to find the character you can't see, create a hexdump of the string.

Since the character you are talking about exists within the ANSI charset, you can do this:
utf8_encode( preg_replace($match, $replace, utf8_decode($utf8_text));
This will however destroy any unicode character not existing within the ANSI charset. To avoid this you can always try using mb_ereg_replace which has multibyte (unicode) support:
string mb_ereg_replace ( string $pattern , string $replacement , string $string [, string $option = "msr" ] )

PHP: Fixing encoding issues with database content - removing accents from characters

I'm trying to make a URL-safe version of a string.
In my database I have a value medúlla - I want to turn this into medulla.
I've found plenty of functions to do this, but when I retrieve the value from the database it comes back as medÃºlla.
I've tried:
Setting the column as utf_8 encoding
Setting the table as utf_8 encoding
Setting the entire database as utf_8 encoding
Running `SET NAMES utf8` on the database before querying
When I echo the value onto the screen it displays as I want it to, but the conversion function doesn't see the ú character (even a simple str_replace() doesn't work either).
Does anybody know how I can force the system to recognise this as UTF-8 and allow me to run the conversion?
Thanks,
Matt

To transform an UTF-8 string into an URL-safe string you should use:
$str = iconv('UTF-8', 'ASCII//IGNORE//TRANSLIT', $strt);
The IGNORE part tells iconv() not to raise an exception when facing a character it can't manage, and the TRANSLIT part converts an UTF-8 character into its nearest ASCII equivalent ('ú' into 'u' and such).
Next step is to preg_replace() spaces into underscores and substitute or drop any character which is unsafe within an URL, either with preg_replace() or urlencode().
As for the database stuff, you really should have done all this setting stuff before INSERTing UTF-8 content. Changing charset to an existing table is somewhat like changing a file extension in Windows - it doesn't convert a JPEG into a GIF. But don't worry and remember that the database will return you byte by byte exactly what you've stored in it, no matter which charset has been declared. Just keep the settings you used when INSERTing and treat the returned strings as UTF-8.

I'm trying to make a URL-safe version of a string.
Whilst it is common to use ASCII-only ‘slugs’ in URLs, it is actually possible to have web addresses including non-ASCII characters. eg.:
http://en.wikipedia.org/wiki/Medúlla
This is a valid IRI. For inclusion in a URI, you should UTF-8 and %-encode it:
http://en.wikipedia.org/wiki/Med%C3%BAlla
Either way, most browsers (except sometimes not IE) will display the IRI version in the address bar. Sites such as Wikipedia use this to get pretty addresses.
the conversion function doesn't see the ú character
What conversion function? rawurlencode() will correctly spit out %C3%BA for ú, if, as presumably you do, you have it in UTF-8 encoding. This is the correct way to include text in a URL's path component. (urlencode() also gives the same results, but it should only be used for query components.)
If you mean htmlentities()... do not use this function. It converts all non-ASCII characters to HTML character references, which makes your output unnecessarily larger, and means it has to know what encoding the string you pass in is. Unless you give it a UTF-8 $charset argument it will use ISO-8859-1, and consequently screw up all your non-ASCII characters.
Unless you are specifically authoring for an environment which mangles non-ASCII characters, it is better to use htmlspecialchars(). This gives smaller output, and it doesn't matter(*) if you forget to include the $charset argument, since all it changes is a couple of characters like < and &.
(Actually it could matter for some East Asian multibyte character sets where < could be part of a multibyte sequence and so shouldn't be escaped. But in general you'd want to avoid these legacy encodings, as UTF-8 is less horrific.)
(even a simple str_replace() doesn't work either).
If you wrote str_replace(..., 'ú', ...) in the PHP source code, you would have to be sure that you saved the source code in the same encoding as you'll be handling, otherwise it won't match.
It is unfortunate that most Windows text editors still save in the (misleadingly-named) “ANSI” code page, which is locale-specific, instead of just using UTF-8. But it should be possible to save the file as UTF-8, and then the replace should work. Alternatively, write '\xc3\xba' to avoid the problem.
Running SET NAMES utf8 on the database before querying
Use mysql_set_charset() in preference.

w3c validation error with utf-8

When I try to validate a certain page I get the below error:
Sorry, I am unable to validate this document because on line 136 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication.
The error was: utf8 "\xFF" does not map to Unicode
What exactly does this mean and how can I find out what character is causing the problem?
The page is generated dynamically in PHP and a bit large and I am not sure what to look for.
EDIT:
I get missing character symbols for umlauts and french/spanish accented vowels.

Does your text editor save the file with BOM? If so unset that setting and resave it. I think this is it.
Otherwise try going to line 136, possibly with a different editor and delete any weird square symbols, or whole lines.

I fixed this with htmlentities() in PHP to make sure the umlaut and accented characters are displayed correctly in html.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.