The structure of this XML is corrupted because of "include" connection database.
As you can see, there are strange characters in the first line of the file ('╗ ┐' ╗ ┐).
However, they do not appear on the web, since they only appear when I use cmd.exe to type the file. Here is a screenshot of the offending file:
Here's the URL of the file:
http://web.wipix.com.br/aniversariantes.xml
In my PHP file, I have two "includes" in the files connection.php (connection to database) AND "serialize.php" to generate the XML.
This only works if I take out the "includes" and use everything on one page only. How can I fix this?
That is a byte order mark (Unicode character U+FEFF) but it being displayed in an incorrect encoding. Since your document claims to be encoded as ISO-8859-1 there should not be a byte order mark.
Probably your xml file is in UTF-8 format with BOM.
http://en.wikipedia.org/wiki/Byte_order_mark
Remove offending 8 bytes or save your xml without BOM using a text editor.
If xml is dinamically generated, you have to modify the generation code.
Moreover, the BOM bytes seems encoded badly. Probably the xml was converted in a wrong way and BOM bytes were screwed up.
The odd stuff at the beginning could be a byte-order mark, but I'm not sure.
A byte-order mark is a byte sequence inserted at the beginning of a file used to indicate the endianness of it, or whether the most significant byte comes first.
From your output, there are other weird characters (not text) in the file, so it is possible that the program inserted them in.
Related
I’m having issues with reading Czech characters from a txt file.
I want to read .txt files containing categories line by line. With general languages I have no issue. I can read the txt file line by line and copy the categories that I want in an array.
But as soon as I want to read a txt file that contains categories in the Czech language I get problems processing the output of my code. The Czech specific characters are coming out rubbish even though the text file is showing the characters correctly.
As an example:
The letters ě, č, ů or ř are all outputed as a square or as st\u001b or other rubish, depending on the way I read the file.
Origionally I use the fgets function to read a line from the text file.
But as this didn’t return the correct characters I started testing with adding utf8_encode but whilst that changed some characters it still didn’t restore all the characters.
Then I started experimenting with mb_detect_encoding combined with mb_convert_encoding and later read somewhere that fgets could sometimes return incorrect characters so I started testing with file_get_contents. This also didn’t solve the issue.
I assume the main issue is with the way I’m reading the txt file as the output from the fgets and file_get_contents functions are garbled from the start.
Can anyone tell me how to read a text file with Czech characters correctly?
Thanks In advance.
Oké I found the solution myself. Just for the case someone else runs into this issue, the txt file was in the wrong coding. The file was in the "UCS-2 Little Endian" coding. After loading the file in Notepad++ I could encode it to the UTF-8 format and that solved the problem.
I'm kinda stuck if I'm doing it right.
I have a file which is ISO-8859-1 (pretty certain). My MySQL db is in utf-8 encoding. Which is why I want to convert the file to UTF-8 encoded characters before I can send it as a query. For instance, First I rewrite every line of the file.txt into file_new.txt using.
line = line.decode('ISO-8859-1').encode('utf-8')
And then I save it. Next, I create a MySQL connection and create a cursor with the following query so that all the data is received as utf-8.
query = 'SET NAMES "utf8"'
cursor.execute(query)
Following this, I reopen file_new.txt and enter each line into MySQL. Is this the right approach to get the table in MySQL utf-8 encoding? Or Am I missing any crucial part?
Now to receive this data. I use 'SET NAMES "utf8"" as well. But the received data is giving me question marks � when I set the header content type to
header("Content-Type: text/html; charset=utf-8");
On the other hand, when I set
header("Content-Type: text/html; charset=ISO-8859-1");
It works fine, but other utf-8 encoded data from the database is getting scrambled. So I'm guessing the data from file.txt is still NOT getting encoded to utf-8. Can any one explain why?
PS: Before I read everyline, I replace a character and save the file.txt to file.txt.tmp. I then read this file to get file_new.txt. I don't know if it causes any problem to the original file encoding.
f1 = codecs.open(tsvpath, 'rb',encoding='iso-8859-1')
f2 = codecs.open(tsvpath + '.tmp', 'wb',encoding='utf8')
for line in f1:
f2.write(line.replace('\"', '\''))
f1.close()
f2.close()
In the below example, I've utf-8 encoded persian data which is right but the other non-enlgish text is coming out to be in "question marks". This is precisely my problem.
Example : Removed.
Welcome to the wonderful world of unicode and windows. I've found this site very helpful in understanding what is going wrong with my strings http://www.i18nqa.com/debug/utf8-debug.html. The other thing you need is a hex editor like HxD. There are many places where things can go wrong. For example, if you are viewing your files in a text editor - it may be trying to be helpful and is silently changing your encoding.
Start with your original data, view it in HxD and see what the encoding is. View your results in Hxd and see if the changes you expect are being made. Repeat through the steps in your process.
Without your full code and sample data, its hard to say where the problem is. My guess is your replacing the double quote with single quote on binary files is the culprit.
Also check out The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
by Joel Spolsky
Try this instead:
line = line.decode('ISO-8859-1').encode('utf-8-sig')
From the docs:
As UTF-8 is an 8-bit encoding no BOM is required and any U+FEFF
character in the decoded string (even if it’s the first character) is
treated as a ZERO WIDTH NO-BREAK SPACE.
Without external information it’s impossible to reliably determine
which encoding was used for encoding a string. Each charmap encoding
can decode any random byte sequence. However that’s not possible with
UTF-8, as UTF-8 byte sequences have a structure that doesn’t allow
arbitrary byte sequences. To increase the reliability with which a
UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8
(that Python 2.5 calls "utf-8-sig") for its Notepad program: Before
any of the Unicode characters is written to the file, a UTF-8 encoded
BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is
written. As it’s rather improbable that any charmap encoded file
starts with these byte values (which would e.g. map to
LATIN SMALL LETTER I WITH DIAERESIS RIGHT-POINTING DOUBLE ANGLE
QUOTATION MARK INVERTED QUESTION MARK in iso-8859-1), this increases
the probability that a utf-8-sig encoding can be correctly guessed
from the byte sequence. So here the BOM is not used to be able to
determine the byte order used for generating the byte sequence, but as
a signature that helps in guessing the encoding. On encoding the
utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes
to the file. On decoding utf-8-sig will skip those three bytes if they
appear as the first three bytes in the file. In UTF-8, the use of the
BOM is discouraged and should generally be avoided.
Source: https://docs.python.org/3.5/library/codecs.html
EDIT:
Sample:
"Hello World".encode('utf-8') yields b'Hello World' while "Hello World".encode('utf-8-sig') yields b'\xef\xbb\xbfHello World' highlighting the docs:
On encoding the
utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes
to the file. On decoding utf-8-sig will skip those three bytes if they
appear as the first three bytes in the file.
Edit:
I have made a similar function before that converts a file to utf-8 encoding. Here is a snippet:
def convert_encoding(src, dst, unicode='utf-8-sig'):
return open(dst, 'w').write(open(src, 'rb').read().decode(unicode, 'ignore'))
Based on your example, try this:
convert_encoding('file.txt.tmp', 'file_new.txt')
Alright guys, so my encoding was right. The file was getting encoding to utf-8 just as needed. All the queries were right. It turns out that the other dataset that was in Arabic was in ISO-8859-1. Therefore, only 1 of them was working. No matter what I did.
The Hexeditors did help. But in the end I just used sublime text to recheck if my encoded data was utf-8. It turns out the python script and the sublime editor did the same. So the code is fine. :)
You should not need to do any explicit encode or decode. SET NAMES ... should match what the client encoding is (for INSERTing) or should become (for SELECTing).
MySQL will convert between the client encoding and the columns's CHARACTER SET.
I have a few php scripts files encoded in ANSI. Now that I converted my website to html5, I need everything in UTF-8, so that accents in these file are displayed correctly without any php conversion through iconv(). I used Notepad++ to set the encoding of my scripts on UTF-8 and save the files, and most are fine, accents are displayed correctly, only the main script now blocks everything, and the server only returns a white page, without any error message, even with ini_set('error_reporting', 'E_ALL') !
When I change the encoding back to ANSI in Notepad++, and save the file without any other change, it works again (except the accents are not displayed correctly without iconv() ).
I did also try to use a php script to change the encoding with ...$file = iconv('ISO-8859-1','UTF-8', $file);... but the result is exactly the same !
I wrote a short php script to look for high char() values, but the highest values seems to be usual French accents like é, è, etc which are also present on other files and pose no problem. I did remove other special chars, without any effect...
The problem is that the file is large, more than 4500 lines and I'm not sure how to proceed to correct this ? Anyone has had this problem, or has any idea ?
The issue was with the "£" (pound) character, I used it a lot as delimiter in preg_match("£(...)£", "...", $string) and preg_replace conditions.
For some reason these characters were not accepted after conversion. I had to replace all of them, then only it worked fine in utf-8... Apparently they are not a problem now that the file is converted, I can use them again.
I have a lot Persian text and I want explode it, I store my text in a file.txt. (So i have a file.text containing Persian text). Now my problem is charset. When i save the text into file.text, it give me a error:
This file contains characters in Unicode format which will be lost if you save this file as a ANSI encoded text file. To keep the Unicode information, click cancel below and then select one of the Unicode options from the Encoding drop down list. Continue?
I continue. Now when I open file.text all characters are fine, and when explode it, all characters crash.
Note: when I put text in a php variable, all thing is fine, in fact my problem is with file.text.
What should I do ?
My code: (for explode)
$text=file_get_contents('file.txt');
$var = explode("\n", $text);
foreach ($var as $sentence) {
echo $sentence.'<br>'; // or save into databse
}
Make sure to save the text file in the UTF-8 encoding. (Use UTF-8 for your HTML output and database connection as well, to match.)
If you save a file as the encoding that Microsoft misleadingly call “Unicode” you will actually get UTF-16LE, a two-byte, non-ASCII-compatible encoding that is generally a bad idea.
PHP's basic string ops like explode operate on a byte basis, so if you split a UTF-16 on a single \n byte you will end up splitting up a two-byte character in the middle and messing up the byte order of the following string (and every alternate string).
Use a decent text editor that gives you the possibility to save as UTF-8 without BOM, because Notepad will give you a UTF-8-faux-BOM at the start of the file, meaning that when you read it in PHP your first line (but none of the other lines) will have a U+FEFF Byte Order Mark character at the start of the string, causing widespread conclusion.
Prefer a text editor that saves in BOM-free-UTF-8 by default. Notepad's preference for ANSI, UTF-16LE, and faux-BOMs makes it a pretty terrible choice of editor for the web.
When I try to validate a certain page I get the below error:
Sorry, I am unable to validate this document because on line 136 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication.
The error was: utf8 "\xFF" does not map to Unicode
What exactly does this mean and how can I find out what character is causing the problem?
The page is generated dynamically in PHP and a bit large and I am not sure what to look for.
EDIT:
I get missing character symbols for umlauts and french/spanish accented vowels.
Does your text editor save the file with BOM? If so unset that setting and resave it. I think this is it.
Otherwise try going to line 136, possibly with a different editor and delete any weird square symbols, or whole lines.
I fixed this with htmlentities() in PHP to make sure the umlaut and accented characters are displayed correctly in html.