w3c validation error with utf-8 - php

When I try to validate a certain page I get the below error:
Sorry, I am unable to validate this document because on line 136 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication.
The error was: utf8 "\xFF" does not map to Unicode
What exactly does this mean and how can I find out what character is causing the problem?
The page is generated dynamically in PHP and a bit large and I am not sure what to look for.
EDIT:
I get missing character symbols for umlauts and french/spanish accented vowels.

Does your text editor save the file with BOM? If so unset that setting and resave it. I think this is it.
Otherwise try going to line 136, possibly with a different editor and delete any weird square symbols, or whole lines.

I fixed this with htmlentities() in PHP to make sure the umlaut and accented characters are displayed correctly in html.

Related

UTF8 null character & normalizing whitespace characters

I'm working on a script that builds an XML feed using strings from the database. The strings are user-entered image captions from Facebook Open Graph API. The strings are supposed to be all UTF8 according to facebook. So i import the captions into the database and store them as utf8-unicode (i also tried utf8-bin)
But i always have the same error when trying to display the output XML feed, because one of the caption have a weird whitespace character
This page contains the following errors:
error on line 63466 at column 14: Input is not proper UTF-8, indicate encoding !
Bytes: 0x0B 0x54 0x68 0x6F
Below is a rendering of the page up to the first error.
In the database (phpmyadmin) and in the page source code (using chrome), the problematic characters appear as empty square symbol.
Now if i copy and paste the problematic character in an converter it gives me Hexadecimal 000B
What's the easiest way to fix this ?
I'd also like to understand in the first place, why Facebook Graph API is giving me non-utf8 characters when it's not supposed to
Failed attemps:
utf8_encode() isn't working because the rest of the strings are UTF8 valid.
I also tried multiple different ways of stripping out all non-utf8 characters, but it doesn't filter out this specific character. Same when trying to filter out all non-latin.
htmlentities() htmlspecialchars() or the same isn't encoding the problematic characters
charactericonv(mb_detect_encoding()) will not detect the string as invalid utf8
str_replace() or preg_replace() is of no help, if i try to copy and paste the character in Visual Studio Code, nothing is pasted, not even a whitespace
str_replace("\0", "", ) ...nope
Here is a list of what we have found and/or worked through with the original poster:
MySQL's utf-8 is not a proper implementation of utf-8 - utf8mb4 is;
additional information on character sets and collation differences;
changes that happen to existing data if collation is changed.
We have checked the above and discovered that the initial problem was caused by vertical tabulation symbols creeping into the text fields. A good way to remove said symbols is by running $str = str_replace("\x0b", "", $str);, where $str is the string that is going to be inserted into the text field. It's important to not replace \v, as that might not be desired.
If the 0B is always at the beginning of a string, then trace the strings back to their source and see if they are "BOM" encoded. Wikipedia on BOM .
At least come back with the various steps the data takes, so we can help with deducing the source of the problem.
Note: although needed for Emoji and Chinese, switching to utf8mb4 will not deal with BOM if that is the 'real' problem.
(using str_replace is just a bandaid)

Decoding ISO-8859-1 and Encoding to UTF-8 before MySQL query

I'm kinda stuck if I'm doing it right.
I have a file which is ISO-8859-1 (pretty certain). My MySQL db is in utf-8 encoding. Which is why I want to convert the file to UTF-8 encoded characters before I can send it as a query. For instance, First I rewrite every line of the file.txt into file_new.txt using.
line = line.decode('ISO-8859-1').encode('utf-8')
And then I save it. Next, I create a MySQL connection and create a cursor with the following query so that all the data is received as utf-8.
query = 'SET NAMES "utf8"'
cursor.execute(query)
Following this, I reopen file_new.txt and enter each line into MySQL. Is this the right approach to get the table in MySQL utf-8 encoding? Or Am I missing any crucial part?
Now to receive this data. I use 'SET NAMES "utf8"" as well. But the received data is giving me question marks � when I set the header content type to
header("Content-Type: text/html; charset=utf-8");
On the other hand, when I set
header("Content-Type: text/html; charset=ISO-8859-1");
It works fine, but other utf-8 encoded data from the database is getting scrambled. So I'm guessing the data from file.txt is still NOT getting encoded to utf-8. Can any one explain why?
PS: Before I read everyline, I replace a character and save the file.txt to file.txt.tmp. I then read this file to get file_new.txt. I don't know if it causes any problem to the original file encoding.
f1 = codecs.open(tsvpath, 'rb',encoding='iso-8859-1')
f2 = codecs.open(tsvpath + '.tmp', 'wb',encoding='utf8')
for line in f1:
f2.write(line.replace('\"', '\''))
f1.close()
f2.close()
In the below example, I've utf-8 encoded persian data which is right but the other non-enlgish text is coming out to be in "question marks". This is precisely my problem.
Example : Removed.
Welcome to the wonderful world of unicode and windows. I've found this site very helpful in understanding what is going wrong with my strings http://www.i18nqa.com/debug/utf8-debug.html. The other thing you need is a hex editor like HxD. There are many places where things can go wrong. For example, if you are viewing your files in a text editor - it may be trying to be helpful and is silently changing your encoding.
Start with your original data, view it in HxD and see what the encoding is. View your results in Hxd and see if the changes you expect are being made. Repeat through the steps in your process.
Without your full code and sample data, its hard to say where the problem is. My guess is your replacing the double quote with single quote on binary files is the culprit.
Also check out The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
by Joel Spolsky
Try this instead:
line = line.decode('ISO-8859-1').encode('utf-8-sig')
From the docs:
As UTF-8 is an 8-bit encoding no BOM is required and any U+FEFF
character in the decoded string (even if it’s the first character) is
treated as a ZERO WIDTH NO-BREAK SPACE.
Without external information it’s impossible to reliably determine
which encoding was used for encoding a string. Each charmap encoding
can decode any random byte sequence. However that’s not possible with
UTF-8, as UTF-8 byte sequences have a structure that doesn’t allow
arbitrary byte sequences. To increase the reliability with which a
UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8
(that Python 2.5 calls "utf-8-sig") for its Notepad program: Before
any of the Unicode characters is written to the file, a UTF-8 encoded
BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is
written. As it’s rather improbable that any charmap encoded file
starts with these byte values (which would e.g. map to
LATIN SMALL LETTER I WITH DIAERESIS RIGHT-POINTING DOUBLE ANGLE
QUOTATION MARK INVERTED QUESTION MARK in iso-8859-1), this increases
the probability that a utf-8-sig encoding can be correctly guessed
from the byte sequence. So here the BOM is not used to be able to
determine the byte order used for generating the byte sequence, but as
a signature that helps in guessing the encoding. On encoding the
utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes
to the file. On decoding utf-8-sig will skip those three bytes if they
appear as the first three bytes in the file. In UTF-8, the use of the
BOM is discouraged and should generally be avoided.
Source: https://docs.python.org/3.5/library/codecs.html
EDIT:
Sample:
"Hello World".encode('utf-8') yields b'Hello World' while "Hello World".encode('utf-8-sig') yields b'\xef\xbb\xbfHello World' highlighting the docs:
On encoding the
utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes
to the file. On decoding utf-8-sig will skip those three bytes if they
appear as the first three bytes in the file.
Edit:
I have made a similar function before that converts a file to utf-8 encoding. Here is a snippet:
def convert_encoding(src, dst, unicode='utf-8-sig'):
return open(dst, 'w').write(open(src, 'rb').read().decode(unicode, 'ignore'))
Based on your example, try this:
convert_encoding('file.txt.tmp', 'file_new.txt')
Alright guys, so my encoding was right. The file was getting encoding to utf-8 just as needed. All the queries were right. It turns out that the other dataset that was in Arabic was in ISO-8859-1. Therefore, only 1 of them was working. No matter what I did.
The Hexeditors did help. But in the end I just used sublime text to recheck if my encoded data was utf-8. It turns out the python script and the sublime editor did the same. So the code is fine. :)
You should not need to do any explicit encode or decode. SET NAMES ... should match what the client encoding is (for INSERTing) or should become (for SELECTing).
MySQL will convert between the client encoding and the columns's CHARACTER SET.

Half character? - Accents encoding issue

I'm currently facing a very strange encoding issue when dealing with an html source code.
I got the following line:
"requête présentée par..."
When an extern library does an utf8_decode I got:
"reque^te présente´e par..."
So accents are placed right to the accented characters. If I do an utf8_encode from that result, I don't get the original "requête présentée par..." but I keep having "reque^te présente´e par..."
Even stranger: If I open the original html in Notepad++, encoding is utf8 without BOM (so far, so good) but I can actually select half of the character with the text selection (keyboard or mouse). Yes, half of it. As if the real code was "e^" but it was displayed as "ê". When I try to copy it to my IDE it copies "ê" but pastes "e^".
I have come up with a basic replacement function:
"e^" => "ê",
"e´" => "é",
...
and some other french cases, and it's working properly for now.
But as the HTML comes in differents languages, I'm pretty sure I won't be able to successfully replace every character under this encoding issue.
Has anybody face this issue before and (hopefully) has a more general solution?
Thanks in advance.
It sounds like your HTML source is using Combining characters. That is, instead of using a single unicode character to represent the ê, it's using first a regular e and then a combining character to add the diacritic ^. You can verify this with a hex editor to see the character codes, in this case the combining circumflex is hex code 0302.
See also Unicode equivalence.

"php include" strange characters in the generator xml "'╗ ┐' ╗ ┐"

The structure of this XML is corrupted because of "include" connection database.
As you can see, there are strange characters in the first line of the file ('╗ ┐' ╗ ┐).
However, they do not appear on the web, since they only appear when I use cmd.exe to type the file. Here is a screenshot of the offending file:
Here's the URL of the file:
http://web.wipix.com.br/aniversariantes.xml
In my PHP file, I have two "includes" in the files connection.php (connection to database) AND "serialize.php" to generate the XML.
This only works if I take out the "includes" and use everything on one page only. How can I fix this?
That is a byte order mark (Unicode character U+FEFF) but it being displayed in an incorrect encoding. Since your document claims to be encoded as ISO-8859-1 there should not be a byte order mark.
Probably your xml file is in UTF-8 format with BOM.
http://en.wikipedia.org/wiki/Byte_order_mark
Remove offending 8 bytes or save your xml without BOM using a text editor.
If xml is dinamically generated, you have to modify the generation code.
Moreover, the BOM bytes seems encoded badly. Probably the xml was converted in a wrong way and BOM bytes were screwed up.
The odd stuff at the beginning could be a byte-order mark, but I'm not sure.
A byte-order mark is a byte sequence inserted at the beginning of a file used to indicate the endianness of it, or whether the most significant byte comes first.
From your output, there are other weird characters (not text) in the file, so it is possible that the program inserted them in.

PHP concatenate Characters with HTML encoding of the Unicode characters

PROBLEM
I am trying to make a string where the values are normal characters as well as HTML encoding.
How can I create a string that is part Character and part Encoding?
FOR EXAMPLE
I want to make an array of the cards A,2,3,4,5,6,7,8,9,10,J,Q,K of Hearts or &#9829 in HTML encoding.
I have tried the following in various forms to no avail...
$hearts = array("A&#9829","2&#9829", etc);
I have also tried to use the HTML encoding of the letters themselves but it returns an error as unexpected code.
RESOLVED
The code as is above will work. Error was due to incorrect " symbols in original php. BUT see selected answer and comments for information on UTF-8 usage in php.
Just include the UTF-8 characters, for example ❤.

Categories