Character encoding while using DOMDocument for parsing a xml-file

Character encoding while using DOMDocument for parsing a xml-file - php

I have problems with wrong character encoding while reading a xml-file.
While this one shows the complete content of the file correctly...
$reader = new DOMDocument();
$reader->preserveWhiteSpace = false;
$reader->load('zip://content.odt#content.xml');
echo $reader->saveXML();
...this one gives me a strange output (german umlauts, em dashes, µ or similar characters aren't shown correctly):
$reader = new DOMDocument();
$reader->preserveWhiteSpace = false;
$reader->load('zip://content.odt#content.xml');
$elements = $reader->getElementsByTagName('text');
foreach($elements as $node){
foreach($node->childNodes as $child) {
$content .= $child->nodeValue;
}
}
echo $content;
I don't know why this is the case. Hope someone can explain it to me.

DOMDocument::saveXML()
This method returns the whole XML document as string. As with any XML document, the encoding is given in the XML declaration or it has the default encoding which is UTF-8.
DOMNode::$nodeValue
Contains the value of a node, most often text. All text-strings the DOMDocument library returns - of which DOMNode is part of - is in UTF-8 encoding regardless of the encoding of the XML document.
As you write that if you display the first:
echo $reader->saveXML();
all umlauts are preserved, it's most likely the XML itself ships with a different encoding as UTF-8 because the later
$content .= $child->nodeValue;
...
echo $content;
doesn't do it.
As you don't share how and with which application you're displaying and reading the output, not much more can be said.
You most likely need to hint the character encoding in the later case to the displaying application. For example, if you display text in a browser, you should add the appropriate content-type header at the very beginning:
header("Content-Type: text/plain; charset=utf-8");
Compare with How to set UTF-8 encoding for a PHP file.

Related

How to convert this UTF-8 escaped string from an Amazon MWS response to proper UTF-8?

In part of an XML Amazon MWS ListOrders response we got an escaped UTF-8 character in one element:
<Address><Name>RamÃrez Jones</Name></Address>
The name is supposed to be Ramírez. The diacritic character í is UTF-8 character U+00ED (\xc3\xad in literal; see this chart for reference).
However PHP's SimpleXML function mangles this string(which you can see because I simply pasted), transforming it into
RamÃrez Jones
into the editor box here (evidently stackoverflow's ASP.NET underpinnings do the same thing as PHP).
Now when this mangled string gets saved into, then pulled out of MongoDB, it then becomes
RamÃ-rez Jones
For some reason a hyphen is inserted there, although believe it or not, if you select the above bold text, then paste it back into a StackOverflow editor window, it will simply appear as RamÃrez (the hyphen mysteriously vanishes, at least on OS X 10.8.5)!
Here is some example code to show this problem:
$xml = "<Address><Name>RamÃrez Jones</Name></Address>";
$elem = new SimpleXMLAddressent($xml);
$bad_string = $elem->Name;
echo mb_detect_encoding($bad_string)."\n";
echo $elem->Name->__toString()."\n";
echo iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $elem->Name->__toString());
Here is the output from the above sample code, as run on onlinephpfunction.com's sandbox:
UTF-8
RamÃrez Jones
RamA-rez Jones
How can we avoid this problem? It's really screwing things up.
EDIT:
Let me add that while the name in the XML is supposed to be Ramírez Jones, I need to transliterate it to Ramirez Jones (strip the diacrtic mark off of the í).
REVISED FINAL SOLUTION:
It's different than the correct answer below but this was the most elegant solution that I found. Just replace the last line of the example with this:
echo iconv('UTF-8','ASCII//TRANSLIT', html_entity_decode($xml));
This works because "Ã" are HTML entities.
ALTERNATE SOLUTION
Strangely, this also works:
$xml = '<?xml version="1.0"?><Address><Name>RamÃrez Jones</Name></Address>';
$xml= str_replace('<?xml version="1.0"?>', '<?xml version="1.0" encoding="ISO-8859-1"?>' , $xml);
$domdoc = new DOMDocument();
$domdoc->loadXML($xml);
$xml = iconv('UTF-8','ASCII//TRANSLIT',$domdoc->saveXML());
$elem = new SimpleXMLElement($xml);
echo $elem->Name;

It does not work because it is encoded twice. The character í has the code U+00ED and it should be encoded in XML as &#ED;.
You can fix its encoding using either:
$name = iconv('UTF-8', 'ISO-8859-1//TRANSLIT//IGNORE', $elem->Name->__toString());
or
$name = mb_convert_encoding($elem->Name->__toString(), 'ISO-8859-1', 'UTF-8');
UPDATE:
Both ways suggested above work to fix the encoding (they actually convert the encoding of the string from UTF-8 to ISO-8859-1 which incidentally fix the issue at hand).
The solution provided by #Hazzit also works.
The real challenge for both solutions (and for your code) is to detect if the received data is encoded in a wrong way and apply these fixes only in that situation, letting the code work correctly when Amazon fixes the encoding issue. I hope they will do it.
Stripping the accents with minimum loss of information
After you fix the encoding, in order to replace the accented letters with similar letters from the ASCII subset you must use iconv() (because only iconv() can help), as you already did in the sample code.
$nameAscii = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $name);
An explanation of the second parameter can be found in the documentation page of iconv(); please also read the user comments.

SimpleXML does not decode the hex entities and understand the result as UTF-8, because that's not how XML or UTF-8 actually works. Nevertheless, if Amazon produces such nonsense, you need to correct that error before parsing it as XML.
function decode_hexentities($xml) {
return
preg_replace_callback(
'~&#x([0-9a-fA-F]+);~i',
function ($matches) { return chr(hexdec($matches[1])); },
$xml
);
}
$xml = "<Address><Name>RamÃrez Jones</Name></Address>";
$xml = decode_hexentities($xml);
$elem = new SimpleXMLElement($xml);
$bad_string = $elem->Name;
echo mb_detect_encoding($bad_string)."\n";
echo $elem->Name->__toString()."\n";
echo iconv('UTF-8', 'ASCII//TRANSLIT', $elem->Name->__toString());
results in:
UTF-8
Ramírez Jones
Ramirez Jones

Outputting UTF-8 with PHP SimpleXML

I'm trying to parse an XML file generated from Wordpress' export function. I've grabbed the text from the block but when I echo the text it gets malformed, into ASCII I think.
<?php
header("Content-Type: text/plain; charset: UTF-8;");
$source = file_get_contents("blog.wordpress.2013-10-31.xml");
$xml = simplexml_load_string($source);
$items = $xml->channel->item;
foreach($items as $item) {
$namepsaces = $item->getNameSpaces(true);
$content = $item->children($namepsaces['content']);
if($content != '') {
echo '#' . $item->title . "#\n";
echo $content->encoded;
echo "\n\n\n";
}
}
So As the BBC’s would become As the BBCâ€™s. Anyway I can stop this?
Edit: I've appended echo '“Test”'; to just after the header and I'm seeing â€œTestâ€ in my browser, so this doesn't appear to be a SimpleXML issue.

As UTF-8 ’ (0xE2 0x80 0x99) is WINDOWS-1252 â € ™ and that is exactly what you describe, it seems that you load UTF-8 encoded strings as WINDOWS-1252.
The output of SimpleXML when you read from elements or attributes is always UTF-8 encoded, therefore about that part I see no problem with your code.
So it's more likely that the XML file has the wrong encoding hinted. Fix that and you should be fine (as you have not shown that file, it's hard to say what exactly needs to be changed and why the encoding got mixed-up in the first place, perhaps some transfer issue).
You perhaps need to re-encode the XML file before you send it to the parser. If so, XMLRecoder might be helpful.

You are using a colon here: charset: UTF-8
The correct code is
header('Content-Type: text/html; charset=utf-8');

Check your XML file starts with
<?xml version="1.0" encoding="UTF-8"?>

PHP DomXPath encoding issue after xpath

If I use echo $doc->saveHTML(); It will show the characters accordingly , but once it reaches the xml? at xpath to extract the element , the issues are back again.
I cant seem to display the characters properly. How do i convert it properly. I'm getting:
婢跺繐顒滈拺鍙ョ瀵偓鐞涱偊鈧繑妲戦挅鍕綍婢舵牕顨� 闂€鍌溾敄缂侊綀濮虫稉濠呫€� 娑擃叀顣荤純鎴犵綍閺冭泛鐨绘總鍏呯瑐鐞涳綀鏉藉▎
Instead of proper Chinese:
<head><meta http-equiv="X-UA-Compatible" content="IE=edge"><meta charset="gbk"/></head>
My PHP code:
$html = file_get_contents('http://item.taobao.com/item.htm?spm=a2106.m874.1000384.41.aG3Kbi&id=20811635147&_u=o1ffj7oi9ad3&scm=1029.newlist-0.1.16&ppath=&sku=');
$doc = new DOMDocument();
// Based on Article http://stackoverflow.com/questions/11309194/php-domdocument-failing-to-handle-utf-8-characters/11310258#11310258
$searchPage = mb_convert_encoding($html,"HTML-ENTITIES","GBK");
$doc->loadHTML($searchPage);
// echo $doc->saveHTML();
$xpath = new DOMXpath($doc);
$elements = $xpath->query("//*[#id='detail']/div[1]/h3");
foreach ($elements as $e) {
//echo $e->nodeValue;
echo mb_convert_encoding($e->nodeValue,"utf-8","gbk");
}

You have the to_encoding and from_encoding parameters the wrong way around in your last call to mb_convert_encoding. The content returned from the XPath query is encoded as UTF-8, but you assumedly want the output encoded as gbk (given that you've set the meta charset to "gbk").
So the final loop should be:
foreach ($elements as $e) {
echo mb_convert_encoding($e->nodeValue,"gbk","utf-8");
}
The to_encoding is "gbk", and the from_encoding is "utf-8".
That said, the answer given by AgreeOrNot should work too, if you are happy with the page being encoded as UTF-8.
As for how the encoding process works, internally DOMDocument uses UTF-8, so that is why the results you get by from your xpath queries are UTF-8, and why you need to convert that to gbk with mb_convert_encoding if that is the character set you need.
When you call loadHTML, it attempts to detect the source encoding, and then convert the input from that encoding to UTF-8. Unfortunately the detection algorithm doesn't always work very well.
For example, although your example page has set the charset metatag, that metatag is not recognised by loadHTML, so it defaults to assuming the source encoding is Latin1. It would have worked if you had used an http-equiv metatag specifying the Content-Type.
<meta http-equiv="Content-Type" content="text/html; charset=gbk" />
The alternative is to avoid the problem altogether, but by converting all non-ASCII characters to html entities (as you have done). That way it doesn't matter if loadHTML detects the character encoding correctly, because there won't be any characters that need converting.

Since you've already converted the document to html entities, you don't need to convert encoding when you print the result. So:
echo $e->nodeValue;
// echo mb_convert_encoding($e->nodeValue,"utf-8","gbk");
The reason you didn't get the correct output is that you put <meta charset="gbk"/> in your html while it should be <meta charset="utf-8"/>.

How to keep the Chinese or other foreign language as they are instead of converting them into codes?

DOMDocument seems to convert Chinese characters into codes, for instance,
你的乱发 will become ä½ çš„ä¹±å‘
How can I keep the Chinese or other foreign language as they are instead of converting them into codes?
Below is my simple test,
$dom = new DOMDocument();
$dom->loadHTML($html);
If I add this below before loadHTML(),
$html = mb_convert_encoding($html, "HTML-ENTITIES", "UTF-8");
I get,
你的乱发
Even though the coverted codes will be displayed as Chinese characters, 你的乱发 still are not 你的乱发 what I am after....

DOMDocument seems to convert Chinese characters into codes [...]. How can I keep the Chinese or other foreign language as they are instead of converting them into codes?
$dom = new DOMDocument();
$dom->loadHTML($html);
If you're using the loadHTML function to load a HTML chunk. By default DOMDocument expects that string to be in HTML's default encoding (ISO-8859-1) however most often the charset (sic!) is meta-information provided next to the string you're using and not inside. To make this more complicated, that meta-information be be even inside the string.
Anyway as you have not shared the string data of the HTML and you have not specified the encoding, it's hard to tell specifically what is going on.
I assume the HTML is UTF-8 encoded but this is not signalled within the HTML string. So the following work-around can help:
$doc = new DOMDocument();
$doc->loadHTML('<?xml encoding="UTF-8">' . $html);
// dirty fix
foreach ($doc->childNodes as $item)
if ($item->nodeType == XML_PI_NODE)
$doc->removeChild($item); // remove hack
$doc->encoding = 'UTF-8'; // insert proper
It injects an encoding hint on the very beginning (and removes it after the HTML has been loaded). From that point on, DOMDocument will return UTF-8 (as always).

I just stumbled upon this thread when searching for a solution of a similar problem, i after loading the html properly and doing some parsing with Xpath etc... my text ends up like this:
你的乱发
this display fine in the body of the HTML, but won't display properly in a style or script tag (e.g. setting chinese-fonts).
to fix this, do the reverse lauthiamkok did:
$html = mb_convert_encoding($html, "UTF-8", "HTML-ENTITIES");
if for any reason the first workaround doesn't work for you, try this conversion.

I'm pretty sure ä½ çš„ä¹±å‘ is actually Windows Latin 1 (not ASCII, there are no diacritics in ASCII). Somewhere along the way your UTF-8 text got saved as Windows Latin 1....

Why Does DOM Change Encoding?

$string = file_get_contents('http://example.com');
if ('UTF-8' === mb_detect_encoding($string)) {
$dom = new DOMDocument();
// hack to preserve UTF-8 characters
$dom->loadHTML('<?xml encoding="UTF-8">' . $string);
$dom->preserveWhiteSpace = false;
$dom->encoding = 'UTF-8';
$body = $dom->getElementsByTagName('body');
echo htmlspecialchars($body->item(0)->nodeValue);
}
This changes all UTF-8 characters to Å, ¾, ¤ and other rubbish. Is there any other way how to preserve UTF-8 characters?
Don't post answers telling me to make sure I am outputting it as UTF-8, I made sure I am.
Thanks in advance :)

I had similar problems recently, and eventually found this workaround - convert all the non-ascii characters to html entities before loading the html
$string = mb_convert_encoding($string, 'HTML-ENTITIES', "UTF-8");
$dom->loadHTML($string);

In case it is definitely the DOM screwing up the encoding, this trick did it for me a while back the other way round (accepting ISO-8859-1 data). DOMDocument should be UTF-8 by default in any case but you can still try:
$dom = new DOMDocument('1.0', 'utf-8');

I had to add a utf8 header to get the correct view:
header('Content-Type: text/html; charset=utf-8');

At the top of the script where your php code lies(the code you posted here), make sure you send a utf-8 header. I bet your encoding is a some variant of latin1 right now. Yes, I know the remote webpage is utf8, but this php script isn't.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.