XML file isn't UTF-8 encoded when created in PHP

XML file isn't UTF-8 encoded when created in PHP - php

I'm trying to output XML file using PHP, and everything is right except that the file that is created isn't UTF-8 encoded, it's ANSI. (I see that when I open the file an do the Save as...).
I was using
$dom = new DOMDocument('1.0', 'UTF-8');
but I figured out that non-english characters don't appear on the output.
I was searching for solution and I tryed first adding
header("Content-Type: application/xml; charset=utf-8");
at the beginning of the php script but it say's:
Extra content at the end of the document
Below is a rendering of the page up to the first error.
I've tryed some other suggestions like not to include 'UTF-8' when creating the document but to write it separately:
$doc->encoding = 'UTF-8'; , but the result was the same.
I used
$doc->save("filename.xml");
to save the file, and I've tryed to change it to
$doc->saveXML();
but the non-english characters didn't appear.
Any ideas?

ANSI is not a real encoding. It's a word that basically means "whatever encoding my Windows computer is configured to use". Getting ANSI is a clear sign of relying on default encoding somewhere.
In order to generate valid UTF-8 output, you have to feed all XML functions with proper UTF-8 input. The most straightforward way to do it is to save your PHP source code as UTF-8 and then just type some non-English letters. If you are reading data from external sources (such as a database) you need to ensure that the complete toolchain makes proper use of encodings.
Whatever, using "Save as" in an undisclosed piece of software is not a reliable way to determine the file encoding.

Related

Character Encoding/decoding becomes a mess

In a webapp I place a <div id="xxx" contentEditable=true > for editing purpose. The encodeURIComponent(xxx.innerHTML) will be send via Ajax POST type to a server, where a PHP script creates a simple txt file from it which in turn can be downloaded from the user to store it locally or print it on screen. It works perfect so far, but … Yes, but, character encoding is a mess. All special characters like the german Ä are interpretated wrong. In this case as ÃÂ¤
I google for some days and I study PHP methods like iconv() and I know how to set up a browsers character encoding and also set a text editor for a correct correspondending decoding. But nothing helps, its still a messs, or becoming even weired.
So my question is : Where in this encoding/decoding roundtrip from the browser to a server and back to the browser I have to do what, to ensure that an Ä will still be an Ä ?

I answer my question, because it turns out to be another problem as stated above. The contenteditable is actually part of a section of html code. On the serverside with PHP I need to filter out the contenteditable text which I do via a DOMDocument like this:
$doc = new DOMDocument();
$doc->loadHTML($_POST["data"]);
then I access the elements and their textual content as usual.
Finally I save the text with
file_put_contents($txtFile, $plainText, LOCK_EX);
The saved text then was a mess as written above. Now it turns out that you need to tell the DOMDocument the character set wich loadHTML() has to interpretate. In this case UTF-8.
First I did it as recommended in PHP this way :
$doc = new DOMDocument('1.0', 'UTF-8');
But that doesn't help (I wonder). Then I found this answer in SO. And the final solution is this :
$doc->loadHTML('<?xml encoding="UTF-8">' . $_POST["data"]);
Though it works it is a trick. Finally the question is left over, how to do it the right way ? If somebedoy has the definite answer, he is very welcome.

You need to make sure that the content is encoded consistently throughout its roundtrip from user input to server-side storage and back to the browser again.
I would recommend using UTF-8. Check that your HTML document (which includes the contenteditable zone) is UTF-8 encoded, and that the XMLHttpRequest/Ajax request does not specify a different encoding when it sends the content to the server.
Check that your server-side application encodes the text file as UTF-8 also. And check that the HTTP response headers declare the file's encoding as UTF-8 when the file is requested and downloaded in the browser.
Somewhere along this path, the encoding differs, and that is what is causing the error. iconv converts between different encodings, which should not be necessary if everything is consistent.
Good luck!

Arabic characters and UTF-8 in aria2

I use aria2 to have download with XML_RPC and when i want to have a download like this in php :
$client->aria2_addUri( array($url), array("dir"=>'/home/amir/دانلود') );
it will create a folder named Ø´Ø³ÛØ¨ instead of دانلود. i post a related post in aria2 forums. and they said aria2 has not problem if that string sent to aria2 with utf-8.
so, i used utf-8 header and convert the string to utf-8, but it's not works :
header('Content-type:application/json; charset=utf-8');
$dir_on_server = mb_convert_encoding($dir_on_server, 'UTF-8');
what do you think?

Try accessing the file or folder via the browser.
By writing a .htaccess-file with the content "Options Indexes" so that you're folders are shown.(I can even access them via http)
I created multiple files and folders by writing a script where the GET Value file or folder determines the name of the folder or file, I tried it with japanese and arabic characters. Albeit they won't be shown in FTP correctly (In my case only file names like: "?????") they are correctly displayed if you read them by script.
The problem might be at the program you're using to access your FTP, WinSCP for example has UTF-8 normally on "auto" by default, so forcing it might work out.(Although I have to admit that it's not working on my side, maybe my linux server is not supporting utf-8 file names which can also be a problem for you)
PS:
Also make sure your php-file is encoded(saved) in UTF-8 without BOM since you're using a constant utf-8 string.
EDIT:
Also if you still intent to use mb_convert_encoding, better add the optional parameter "from_encoding".
I tested this with japanese in a SHIFT-JIS encoded file:
$text = "A strange string to pass, maybe with some 日本語の characters.";
echo mb_convert_encoding($text, 'UTF-8');
and it's not displaying correctly although my browser has UTF-8 activated, so it seems to be not always right when it's trying to detect the Encoding.
So this for example works for me then:
$text = "A strange string to pass, maybe with some 日本語の characters.";
echo mb_convert_encoding($text, 'UTF-8', 'SJIS'); //from SJIS(SHIFT-JIS)
This little script is nice to findout the optional parameter you want for your arabic characters:
http://www.php.net/manual/de/function.mb-convert-encoding.php#97902
But converting won't be necessary if the file is already in UTF-8, it's only making sense if it's in some arabic encoding, so I think this is not really bringing you any further to the solution.
EDIT2:
Tried a different FTP-Program, Filezilla displays my files and folder, which have japanese names and the arabic one, correctly. (I was using WinSCP 4.3.4 before)

How to convert unknown/mixed encoding file to UTF-8

I am using retrieving an XML file from a remote service which is supposed to be UTF-8, as the header is <?xml version="1.0" encoding="UTF-8"?>. However, certain parts of it is apparently not UTF-8, as when I load it into PHP's XMLReader extension, it throws some sort of "Not UTF-8 as expected" error when parsing over certain parts of the document (parts that look like they have been copy-pasted directly from MS Word).
I am looking for ideas to solve this error. Is there some program I can use to "fix" the file of any non-uft8 encodings? A PHP solution or any other solution will do

Depending on what encoding it is you are converting from, quick and easy utf-8 safe strings,utf8_encode function is your friend, but only for iso8859-1 encoding. Also, your txt cannot be already UTF-8 else you have good chances of having garbled text.
See the man page for more info:
// Usage can be as simple as this.
$name = utf8_encode($contact['name']);
On the other hand, if you need to convert from any other encoding, you will have to maybe look into incov() function.
Good-luck

PHP fwrite function to write txt file in utf-8 encoding

I have made a form where a user writes his message in Arabic and submits it by a submit button. The message is saved in database and I need to create a .txt file on the server for some other application which shows something like this :
Ø¯ Ù¾ÙˆÙ„ÙŠØ³Ùˆ Ù¾Ø
I successfully used the fopen, fwrite functions to create my txt files.
When I open the file in notepad the Arabic text is shown correctly
but when I open it in eclipse I get something like this :
Ø¯ Ù¾ÙˆÙ„ÙŠØ³Ùˆ Ù¾Ø± Ø±ÙˆØ²Ù†ÙŠØ² Ù…Ø±Ú©Ø² ØªÙˆØºÙ†Ø¯ÙˆÙŠÙŠ Ø¨Ø±ÙŠØ¯ ÙˆØ´Ùˆ
Well afterwards when I save the txt file in notepad as utf-8 encoding the above unknown stuff changes to Arabic.
But I cant do that manually for every message.
I searched a lot on the internet and did these:
I saved the script in utf-8
I used utf8_encode function
I set this too ini_set('default_charset', 'UTF-8');
this too <meta http-equiv="Content-Type" content="text/html; charset=utf-8; encoding=utf-8" />
I change the parameter in fwrite to "wb" where b is for binary
Any solution to this problem ill be very glad I have continuously worked on this issue for the last week. I know the problem is in the encoding so how can I write utf-8 encoded files using PHP?

If the text displays fine in one program but not another, that just means one program interprets the file correctly while the other doesn't. Most likely Notepad sets a UTF-8 BOM on the file when you save it again, so Eclipse now automatically recognizes that it's UTF-8 encoded. Without that, Eclipse assumes latin-1 or some other encoding as the default.
Two options:
change your Eclipse preferences to open files as UTF-8 by default
set a BOM on the file when writing it, see Encoding a string as UTF-8 with BOM in PHP
A BOM can be helpful for making programs recognize UTF-8 but can also cause problems in other programs that don't expect or want BOMs. Whether to use a BOM or not depends on your intended use and target audience.

In eclipse you need to set your encoding in menu Edit > Set Encoding...

Encoding in UTF-8 from PHP

I am not that good with encoding but I am even falling over with the basics here.
I am trying to create a file that is recognised as UTF-8
header("Content-Type: text/plain; charset=utf-8");
header("Content-disposition: attachment; filename=test.txt");
echo "test";
exit();
also tried
header("Content-Type: text/plain; charset=utf-8");
header("Content-disposition: attachment; filename=test.txt");
echo utf8_encode("test");
exit();
I then open the file with Notepad++ and it says its current encoding is ANSI not UTF-8, what am I missing how should I be outputting this file.
I will eventually be outputting an XML file of products for the Affiliate Window program.
Also if it helps My webserver is Centos, Apache2, PHP 5.2.8.
Thanks in advance for any help!

As Filip said, encoding is not an intrinsic attribute of a file; It's implicit. This means that unless you know what encoding a file is to be interpreted in, there is no way to determine it. The best you can do, is to make a guess. This is presumably what programs such as Notepad++ does. Since the actual data that you have sent, can be interpreted in many different encodings, it just picks the candidate that it likes best. For Notepad++ this appears to be ANSI (Which in itself is a rather inaccurate classification), while other programs might default to something else.
The reason why you have to specify the charset in a HTTP-header is exactly because the file itself doesn't contain this information, so the browser needs to be informed about it. Once you have saved the file to disk, this information is thus unavailable.
If the file you're going to serve is an XML-document, you have the option of putting the encoding information inside the actual document. That way it is preserved after the file is saved to disk. Eg. if you are using utf-8, you should put this at the top of your document:
<?xml version="1.0" encoding="utf-8" ?>
Note that apart from getting the meta-information about the charset across, you also need to make sure that the data you are serving is actually utf-8 encoded. This is much the same scenario: You need to know implicitly what encoding your data are in. The function utf8_encode is (despite the name) explicitly meant for converting iso-8859-1 into utf-8. Thus, if you use it on already utf-8 encoded data, you'll get it double-encoded, with the result of garbled data.
Charsets aren't that complicated in itself. The problem is that if you aren't careful about keeping things straight you'll mess it up. Whenever you have a string, you should be absolutely certain that you know which encoding it is in. Otherwise it's not a string - it's just a blob of binary data.

test is all ASCII. So there is no need to use UTF-8 for that.
But in fact, the first 128 characters of the Unicode charset are the same as ASCII’s charset. And UTF-8 uses the same code for that characters as ASCII does. See Wikipedia’s description of UTF-8 for furhter information.

Once you download the file it no longer carries the information about the encoding, so Notepad++ has to guess it from the contents. There's a thing called Byte-Order-Mark which allows specifying the UTF encodings by prefix in the contents.
See question "When a BOM is used, is it only in 16-bit Unicode text?".
I would imagine using something like echo "\xEF\xBB\xBF" before writing the actual contents will force Notepad++ to recognize the file correctly.

There is no such thing as headers for downloaded txt-files. As you try to create XML files in the end anyway, and you can specify the charset in the XML declaration, try creating a simple XML structure and save / open that, then it should work, as long as the OS has utf-8 support, which any modern Linux distribution should have.

I refer you to Joel's Absolute minimum every software developer should know about Unicode

I refer you to What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.