I have a unicode string received over HTTP Post or fetched from a DB (does not matter)
In PHP I checked the encoding of the string using "mb_detect_encoding" and got UTF-8 as the result.
SO therefore the string is in Unicode.
But how do I write the string from php to a output file with the proper encoding
$fd = fopen('myfile.php', "wb");
fwrite($fd, $msg."\n");
What I see is "टेसà¥à¤Ÿ" instead of the actual string which is
टेस्ट्
Pasting the 'junk' into Notepad++ and then from menu option doing 'encoding UTF-8' will show the proper text.
EDIT
*SOLUTION*
Sorry for posting the question and figuring out the answer myself.
I found the solution at the following site
http://www.codingforums.com/showthread.php?t=129270
function writeUTF8File($filename,$content) {
$f=fopen($filename,"w");
# Now UTF-8 - Add byte order mark
fwrite($f, pack("CCC",0xef,0xbb,0xbf));
fwrite($f,$content);
fclose($f);
}
PHP does not change the encoding of the string or does anything with it when you write to a file. It simply dumps the bytes of the string (PHP strings are really byte arrays) into the file, period. If you actually receive the string as UTF-8 and do not do anything with it except write it to a file, the content of the file will be UTF-8 encoded. Your problem is most likely that whatever application you're using to view the file does not properly read it as UTF-8 encoded.
That BOM solution is not necessarily the best. A BOM is not necessary for UTF-8 and many applications have problems with it. It only helps applications that are otherwise unable (too stupid) to detect that a file is UTF-8 encoded. The better solution may be to simply explicitly tell the application in question that it needs to treat the file as UTF-8 encoded when opening the file. Or use a better application.
You have to specify the strict parameter of mb_detect_encoding, or you'll get many false positives.
Also, while the output may be UTF-8, you will have to specify the right headers (content-encoding) and/or the charset meta tag (if it's HTML).
Sorry for posting the question and figuring out the answer myself.
I found the solution at the following site here
function writeUTF8File($filename,$content) {
$f=fopen($filename,"w");
# Now UTF-8 - Add byte order mark
fwrite($f, pack("CCC",0xef,0xbb,0xbf));
fwrite($f,$content);
fclose($f);
}
Related
I'm kinda stuck if I'm doing it right.
I have a file which is ISO-8859-1 (pretty certain). My MySQL db is in utf-8 encoding. Which is why I want to convert the file to UTF-8 encoded characters before I can send it as a query. For instance, First I rewrite every line of the file.txt into file_new.txt using.
line = line.decode('ISO-8859-1').encode('utf-8')
And then I save it. Next, I create a MySQL connection and create a cursor with the following query so that all the data is received as utf-8.
query = 'SET NAMES "utf8"'
cursor.execute(query)
Following this, I reopen file_new.txt and enter each line into MySQL. Is this the right approach to get the table in MySQL utf-8 encoding? Or Am I missing any crucial part?
Now to receive this data. I use 'SET NAMES "utf8"" as well. But the received data is giving me question marks � when I set the header content type to
header("Content-Type: text/html; charset=utf-8");
On the other hand, when I set
header("Content-Type: text/html; charset=ISO-8859-1");
It works fine, but other utf-8 encoded data from the database is getting scrambled. So I'm guessing the data from file.txt is still NOT getting encoded to utf-8. Can any one explain why?
PS: Before I read everyline, I replace a character and save the file.txt to file.txt.tmp. I then read this file to get file_new.txt. I don't know if it causes any problem to the original file encoding.
f1 = codecs.open(tsvpath, 'rb',encoding='iso-8859-1')
f2 = codecs.open(tsvpath + '.tmp', 'wb',encoding='utf8')
for line in f1:
f2.write(line.replace('\"', '\''))
f1.close()
f2.close()
In the below example, I've utf-8 encoded persian data which is right but the other non-enlgish text is coming out to be in "question marks". This is precisely my problem.
Example : Removed.
Welcome to the wonderful world of unicode and windows. I've found this site very helpful in understanding what is going wrong with my strings http://www.i18nqa.com/debug/utf8-debug.html. The other thing you need is a hex editor like HxD. There are many places where things can go wrong. For example, if you are viewing your files in a text editor - it may be trying to be helpful and is silently changing your encoding.
Start with your original data, view it in HxD and see what the encoding is. View your results in Hxd and see if the changes you expect are being made. Repeat through the steps in your process.
Without your full code and sample data, its hard to say where the problem is. My guess is your replacing the double quote with single quote on binary files is the culprit.
Also check out The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
by Joel Spolsky
Try this instead:
line = line.decode('ISO-8859-1').encode('utf-8-sig')
From the docs:
As UTF-8 is an 8-bit encoding no BOM is required and any U+FEFF
character in the decoded string (even if it’s the first character) is
treated as a ZERO WIDTH NO-BREAK SPACE.
Without external information it’s impossible to reliably determine
which encoding was used for encoding a string. Each charmap encoding
can decode any random byte sequence. However that’s not possible with
UTF-8, as UTF-8 byte sequences have a structure that doesn’t allow
arbitrary byte sequences. To increase the reliability with which a
UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8
(that Python 2.5 calls "utf-8-sig") for its Notepad program: Before
any of the Unicode characters is written to the file, a UTF-8 encoded
BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is
written. As it’s rather improbable that any charmap encoded file
starts with these byte values (which would e.g. map to
LATIN SMALL LETTER I WITH DIAERESIS RIGHT-POINTING DOUBLE ANGLE
QUOTATION MARK INVERTED QUESTION MARK in iso-8859-1), this increases
the probability that a utf-8-sig encoding can be correctly guessed
from the byte sequence. So here the BOM is not used to be able to
determine the byte order used for generating the byte sequence, but as
a signature that helps in guessing the encoding. On encoding the
utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes
to the file. On decoding utf-8-sig will skip those three bytes if they
appear as the first three bytes in the file. In UTF-8, the use of the
BOM is discouraged and should generally be avoided.
Source: https://docs.python.org/3.5/library/codecs.html
EDIT:
Sample:
"Hello World".encode('utf-8') yields b'Hello World' while "Hello World".encode('utf-8-sig') yields b'\xef\xbb\xbfHello World' highlighting the docs:
On encoding the
utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes
to the file. On decoding utf-8-sig will skip those three bytes if they
appear as the first three bytes in the file.
Edit:
I have made a similar function before that converts a file to utf-8 encoding. Here is a snippet:
def convert_encoding(src, dst, unicode='utf-8-sig'):
return open(dst, 'w').write(open(src, 'rb').read().decode(unicode, 'ignore'))
Based on your example, try this:
convert_encoding('file.txt.tmp', 'file_new.txt')
Alright guys, so my encoding was right. The file was getting encoding to utf-8 just as needed. All the queries were right. It turns out that the other dataset that was in Arabic was in ISO-8859-1. Therefore, only 1 of them was working. No matter what I did.
The Hexeditors did help. But in the end I just used sublime text to recheck if my encoded data was utf-8. It turns out the python script and the sublime editor did the same. So the code is fine. :)
You should not need to do any explicit encode or decode. SET NAMES ... should match what the client encoding is (for INSERTing) or should become (for SELECTing).
MySQL will convert between the client encoding and the columns's CHARACTER SET.
The structure of this XML is corrupted because of "include" connection database.
As you can see, there are strange characters in the first line of the file ('╗ ┐' ╗ ┐).
However, they do not appear on the web, since they only appear when I use cmd.exe to type the file. Here is a screenshot of the offending file:
Here's the URL of the file:
http://web.wipix.com.br/aniversariantes.xml
In my PHP file, I have two "includes" in the files connection.php (connection to database) AND "serialize.php" to generate the XML.
This only works if I take out the "includes" and use everything on one page only. How can I fix this?
That is a byte order mark (Unicode character U+FEFF) but it being displayed in an incorrect encoding. Since your document claims to be encoded as ISO-8859-1 there should not be a byte order mark.
Probably your xml file is in UTF-8 format with BOM.
http://en.wikipedia.org/wiki/Byte_order_mark
Remove offending 8 bytes or save your xml without BOM using a text editor.
If xml is dinamically generated, you have to modify the generation code.
Moreover, the BOM bytes seems encoded badly. Probably the xml was converted in a wrong way and BOM bytes were screwed up.
The odd stuff at the beginning could be a byte-order mark, but I'm not sure.
A byte-order mark is a byte sequence inserted at the beginning of a file used to indicate the endianness of it, or whether the most significant byte comes first.
From your output, there are other weird characters (not text) in the file, so it is possible that the program inserted them in.
i just started dabbling in php and i'm afraid i need some help to figure out how to manipulate utf-8 strings.
I'm working in ubuntu 11.10 x86, php version 5.3.6-13ubuntu3.2. I have a utf-8 encoded file (vim :set encoding confirms this) which i then proceed to reading it using
$file = fopen("file.txt", "r");
while(!feof($file)){
$line = fgets($file);
//...
}
fclose($file);
using mb_detect_encoding($line) reports UTF-8
If i do echo $line I can see the line properly (no mangled characters) in the browser
so I guess everything is fine with browser and apache. Though i did search my apache configuration for AddDefaultCharset and tried adding http meta-tags for character encoding (just in case)
When i try to split the string using $arr = mb_split(';',$line) the fields of the resulting array contain mangled utf-8 characters (mb_detect_encoding($arr[0]) reports utf-8 as well).
So echo $arr[0] will result in something like this: ΑΘΗÎÎ.
I have tried setting mb_detect_order('utf-8'), mb_internal_encoding('utf-8'), but nothing changed. I also tried to manually detect utf-8 using this w3 perl regex because i read somewhere that mb_detect_encoding can sometimes fail (myth?), but results were the same as well.
So my question is how can i properly split the string? Is going down the mb_ path the wrong way? What am I missing?
Thank you for your help!
UPDATE: I'm adding sample strings and base64 equivalents (thanks to #chris' for his suggestion)
1. original string: "ΑΘΗΝΑ;ΑΙΓΑΛΕΩ;12242;37.99452;23.6889"
2. base64 encoded: "zpHOmM6Xzp3OkTvOkc6ZzpPOkc6bzpXOqTsxMjI0MjszNy45OTQ1MjsyMy42ODg5"
3. first part (the equivalent of "ΑΘΗΝΑ") base64 encoded before splitting: "zpHOmM6Xzp3OkQ=="
4. first part ($arr[0] after splitting): "ΑΘΗÎΑ"
5. first part after splitting base64 encoded: "77u/zpHOmM6Xzp3OkQ=="
Ok, so after doing this there seems to be a 77u/ difference between 3. and 5. which according to this is a utf-8 BOM mark. So how can i avoid it?
UPDATE 2: I woke up refreshed today and with your tips in mind i tried it again. It seems that $line=fgets($file) reads correctly the first line (no mangled chars), and fails for each subsequent line. So then i base64_encoded the first and second line, and the 77u/ bom appeared on the base64'd string of the first line only. I then opened up the offending file in vim, and entered :set nobomb :w to save the file without the bom. Firing up php again showed that the first line was also mangled now. Based on #hakre's remove_utf8_bom i added it's complementary function
function add_utf8_bom($str){
$bom= "\xEF\xBB\xBF";
return substr($str,0,3)===$bom?$str:$bom.$str;
}
and voila each line is read correctly now.
I do not much like this solution, as it seems very very hackish (i can't believe that an entire framework/language does not provide for a way to deal with nobombed strings). So do you know of an alternate approach? Otherwise I'll proceed with the above.
Thanks to #chris, #hakre and #jacob for their time!
UPDATE 3 (solution): It turns out after all that it was a browser thing: it was not enough to add header('Content-type: text/html; charset=UTF-8') and meta-tags like <meta http-equiv="Content-type" value="text/html; charset=UTF-8" />. It also had to be properly enclosed inside an <html><body> section or the browser would not understand the encoding correctly. Thanks to #jake for his suggestion.
Morale of the story: I should learn more about html before trying coding for the browser in the first place. Thanks for your help and patience everyone.
UTF-8 has the very nice feature that it is ASCII-compatible. With this I mean that:
ASCII characters stay the same when encoded to UTF-8
no other characters will be encoded to ASCII characters
This means that when you try to split a UTF-8 string by the semicolon character ;, which is an ASCII character, you can just use standard single byte string functions.
In your example, you can just use explode(';',$utf8encodedText) and everything should work as expected.
PS: Since the UTF-8 encoding is prefix-free, you can actually use explode() with any UTF-8 encoded separator.
PPS: It seems like you try to parse a CSV file. Have a look at the fgetcsv() function. It should work perfectly on UTF-8 encoded strings as long as you use ASCII characters for separators, quotes, etc.
When you write debug/testing scripts in php, make sure you output a more or less valid HTML page.
I like to use a PHP file similar to the following:
<!DOCTYPE html>
<html>
<head>
<meta charset=utf-8>
<title>Test page for project XY</title>
</head>
<body>
<h1>Test Page</h1>
<pre><?php
echo print_r($_GET,1);
?></pre>
</body>
</html>
If you don't include any HTML tags, the browser might interpret the file as a text file and all kinds of weird things could happen. In your case, I assume the browser interpreted the file as a Latin1 encoded text file. I assume it worked with the BOM, because whenever the BOM was present, the browser recognized the file as a UTF-8 file.
Edit, I just read your post closer. You're suggesting this should output false, because you're suggesting a BOM was introduced by mb_split().
header('content-type: text/plain;charset=utf-8');
$s = "zpHOmM6Xzp3OkTvOkc6ZzpPOkc6bzpXOqTsxMjI0MjszNy45OTQ1MjsyMy42ODg5";
$str = base64_decode($s);
$peices = mb_split(';', $str);
var_dump(substr($str, 0, 10) === $peices[0]);
var_dump($peices);
Does it? It works as expected for me( bool true, and the strings in the array are correct)
The mb_splitDocs function should be fine, but you should define the charset it's using as well with mb_regex_encodingDocs:
mb_regex_encoding('UTF-8');
About mb_detect_encodingDocs: it can fail, but that's just by the fact that you can never detect an encoding. You either know it or you can try but that's all. Encoding detection is mostly a gambling game, however you can use the strict parameter with that function and specify the encoding(s) you're looking for.
How to remove the BOM mask:
You can filter the string input and remove a UTF-8 bom with a small helper function:
/**
* remove UTF-8 BOM if string has it at the beginning
*
* #param string $str
* #return string
*/
function remove_utf8_bom($str)
{
if ($bytes = substr($str, 0, 3) && $bytes === "\xEF\xBB\xBF")
{
$str = substr($str, 3);
}
return $str;
}
Usage:
$line = remove_utf8_bom($line);
There are probably better ways to do it, but this should work.
I have a large file that contains world countries/regions that I'm seperating into smaller files based on individual countries/regions. The original file contains entries like:
EE.04 Järvamaa
EE.05 Jõgevamaa
EE.07 Läänemaa
However when I extract that and write it to a new file, the text becomes:
EE.04 Järvamaa
EE.05 Jõgevamaa
EE.07 Läänemaa
To save my files I'm using the following code:
mb_detect_encoding($text, "UTF-8") == "UTF-8" ? : $text = utf8_encode($text);
$fp = fopen(MY_LOCATION,'wb');
fwrite($fp,$text);
fclose($fp);
I tried saving the files with and without utf8_encode() and neither seems to work. How would I go about saving the original encoding (which is UTF8)?
Thank you!
First off, don't depend on mb_detect_encoding. It's not great at figuring out what the encoding is unless there's a bunch of encoding specific entities (meaning entities that are invalid in other encodings).
Try just getting rid of the mb_detect_encoding line all together.
Oh, and utf8_encode turns a Latin-1 string into a UTF-8 string (not from an arbitrary charset to UTF-8, which is what you really want)... You want iconv, but you need to know the source encoding (and since you can't really trust mb_detect_encoding, you'll need to figure it out some other way).
Or you can try using iconv with a empty input encoding $str = iconv('', 'UTF-8', $str); (which may or may not work)...
It doesn't work like that. Even if you utf8_encode($theString) you will not CREATE a UTF8 file.
The correct answer has something to do with the UTF-8 byte-order mark.
This to understand the issue:
- http://en.wikipedia.org/wiki/Byte_order_mark
- http://unicode.org/faq/utf_bom.html
The solution is the following:
As the UTF-8 byte-order mark is '\xef\xbb\xbf' we should add it to the document's header.
<?php
function writeStringToFile($file, $string){
$f=fopen($file, "wb");
$file="\xEF\xBB\xBF".$string; // utf8 bom
fputs($f, $string);
fclose($f);
}
?>
The $file could be anything text or xml...
The $string is your UTF8 encoded string.
Try it now and it will write a UTF8 encoded file with your UTF8 content (string).
writeStringToFile('test.xml', 'éèàç');
Maybe you want to call htmlentities($text) before writing it into file and html_entity_decode($fetchedData) before output. It'll work with Scandinavian letters.
It appears that your source file is not, in fact, in UTF-8. You might want to try using the same approach you've been using, but with a different encoding, such as UTF-16 perhaps.
You can do it as follows:
<?php
$s = "This is a string éèàç and it is in utf-8";
$f = fopen('myFile',"w");
fwrite($f, utf8_encode($s));
fclose($f);
?>
I am not that good with encoding but I am even falling over with the basics here.
I am trying to create a file that is recognised as UTF-8
header("Content-Type: text/plain; charset=utf-8");
header("Content-disposition: attachment; filename=test.txt");
echo "test";
exit();
also tried
header("Content-Type: text/plain; charset=utf-8");
header("Content-disposition: attachment; filename=test.txt");
echo utf8_encode("test");
exit();
I then open the file with Notepad++ and it says its current encoding is ANSI not UTF-8, what am I missing how should I be outputting this file.
I will eventually be outputting an XML file of products for the Affiliate Window program.
Also if it helps My webserver is Centos, Apache2, PHP 5.2.8.
Thanks in advance for any help!
As Filip said, encoding is not an intrinsic attribute of a file; It's implicit. This means that unless you know what encoding a file is to be interpreted in, there is no way to determine it. The best you can do, is to make a guess. This is presumably what programs such as Notepad++ does. Since the actual data that you have sent, can be interpreted in many different encodings, it just picks the candidate that it likes best. For Notepad++ this appears to be ANSI (Which in itself is a rather inaccurate classification), while other programs might default to something else.
The reason why you have to specify the charset in a HTTP-header is exactly because the file itself doesn't contain this information, so the browser needs to be informed about it. Once you have saved the file to disk, this information is thus unavailable.
If the file you're going to serve is an XML-document, you have the option of putting the encoding information inside the actual document. That way it is preserved after the file is saved to disk. Eg. if you are using utf-8, you should put this at the top of your document:
<?xml version="1.0" encoding="utf-8" ?>
Note that apart from getting the meta-information about the charset across, you also need to make sure that the data you are serving is actually utf-8 encoded. This is much the same scenario: You need to know implicitly what encoding your data are in. The function utf8_encode is (despite the name) explicitly meant for converting iso-8859-1 into utf-8. Thus, if you use it on already utf-8 encoded data, you'll get it double-encoded, with the result of garbled data.
Charsets aren't that complicated in itself. The problem is that if you aren't careful about keeping things straight you'll mess it up. Whenever you have a string, you should be absolutely certain that you know which encoding it is in. Otherwise it's not a string - it's just a blob of binary data.
test is all ASCII. So there is no need to use UTF-8 for that.
But in fact, the first 128 characters of the Unicode charset are the same as ASCII’s charset. And UTF-8 uses the same code for that characters as ASCII does. See Wikipedia’s description of UTF-8 for furhter information.
Once you download the file it no longer carries the information about the encoding, so Notepad++ has to guess it from the contents. There's a thing called Byte-Order-Mark which allows specifying the UTF encodings by prefix in the contents.
See question "When a BOM is used, is it only in 16-bit Unicode text?".
I would imagine using something like echo "\xEF\xBB\xBF" before writing the actual contents will force Notepad++ to recognize the file correctly.
There is no such thing as headers for downloaded txt-files. As you try to create XML files in the end anyway, and you can specify the charset in the XML declaration, try creating a simple XML structure and save / open that, then it should work, as long as the OS has utf-8 support, which any modern Linux distribution should have.
I refer you to Joel's Absolute minimum every software developer should know about Unicode
I refer you to What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text