DOMDocument breaks encoding?

DOMDocument breaks encoding? - php

I run the following code:
$page = '<p>Ä</p>';
$DOM = new DOMDocument;
$DOM->loadHTML($page);
echo 'source:'.$page;
echo 'dom: '.$DOM->getElementsByTagName('p')->item (0)->textContent;
and it outputs the following:
source: Ä
dom: Ã
so, I don't understand why when the text comes through DOMDocument its encoding becomes broken?

Here's a workaround that adds the proper encoding via meta header:
$DOM->loadHTML('<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />' . $page);
I'm not sure if that's the actual character set you're trying to use, but adjust where necessary
See also: domdocument character set issue

DOMDocument appears to be treating the input as UTF-8. In this conversion, Ä becomes Ã„. Here's the catch: That second character does not exist in ISO-8859-1, but does exist in Windows-1252. This is why you are seeing no second character in your output.
You can fix this by calling utf8_decode on the output of textContent, or using UTF-8 as your page's character encoding.

Related

Why does DomDocument prepend this character (Â) when I use the degree (°) symbol?

I have a section of code that parses some content from a webpage and I can't figure out why it's inserting the Â in front of the degree symbol.
I've replicated what I'm seeing in my application in the php interpreter:
$ php -a
php > $dom=new domDocument;
php > $dom->loadHTML("<ol><li>What if I use a ° symbol here...</li></ol>");
php > $xpath = new DOMXpath($dom);
php > $steps = $xpath->query("//li");
php > foreach($steps as $step) { echo $step->nodeValue; }
What if I use a Â° symbol here...

The problem is that the default encoding of DOMDocument::loadHTML is ISO-8859-1, while your input is a UTF-8 encoded string. You need tell to DOMDocument that you're using a different charset.
You can do that with
$dom->loadHTML("<?xml encoding=\"utf-8\" ?><ol><li>What if I use a ° symbol here...</li></ol>");

Maybe an encoding issue?
Normally DomDocument uses UTF-8.
But browsers tend to use different encodings when displaying the page. To force UTF-8 encoding you could add a tag like
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8" >
to your head element

Mojibake issue in XML retrieved over cURL

I'm retrieving this XML feed over PHP cURL and outputting it in a textarea on my page. The problem is, it's coming back full of mojibake characters. The feed itself is fine; it's only when output on my page that the chars appear.
Pound signs (£) are coming back as Â£, for example.
I've tried throwing UTF-8 at the issue, as suggested in the answer to this question.
ini_set('default_charset', 'UTF-8');
header("Content-Type:text/html; charset=UTF-8");
And in the HTML:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
and even by outputting the cURL response via utf8_encode(), yet still they persist.
$ch = curl_init($feed_url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$xml = curl_exec($ch);
echo '<textarea>'.utf8_encode($xml).'</textarea>';
I even tried swapping these chars out, but that didn't cut it.
$xml = strtr($xml, array('Â£' => ''));
Am I powerless here, or is there something I can do?

Use htmlentities (http://php.net/manual/en/function.htmlentities.php) before displaying the XML content in an HTML page, also change $ch to $xml in that call, so:
echo '<textarea>'.htmlentities($xml).'</textarea>';

utf8_encode() will treat the input as latin-1 and convert it to utf-8. If the input is utf-8 this will be a double encoding - that's what you're seeing.
Check the XML string you're fetching from the URL. The encoding of an XML file is usually in the XML processing instruction:
<?xml version="1.0" encoding="utf-8"?>
<document-element/>
Loaded into DOM, XMLReader or SimpleXML it will always be converted to UTF-8. Any value you read using the APIs will be UTF-8.
If you like to output the UTF-8 XML into the textarea of your HTML page you need to escape the special characters.
echo '<textarea>'.htmlspecialchars($xml).'</textarea>';
This will escape characters like < and >, but this is needed. Imagine the XML containing the string </textarea>. This would break your HTML page. The browser will decode < and the other entities before displaying them.

PHP DomXPath encoding issue after xpath

If I use echo $doc->saveHTML(); It will show the characters accordingly , but once it reaches the xml? at xpath to extract the element , the issues are back again.
I cant seem to display the characters properly. How do i convert it properly. I'm getting:
婢跺繐顒滈拺鍙ョ瀵偓鐞涱偊鈧繑妲戦挅鍕綍婢舵牕顨� 闂€鍌溾敄缂侊綀濮虫稉濠呫€� 娑擃叀顣荤純鎴犵綍閺冭泛鐨绘總鍏呯瑐鐞涳綀鏉藉▎
Instead of proper Chinese:
<head><meta http-equiv="X-UA-Compatible" content="IE=edge"><meta charset="gbk"/></head>
My PHP code:
$html = file_get_contents('http://item.taobao.com/item.htm?spm=a2106.m874.1000384.41.aG3Kbi&id=20811635147&_u=o1ffj7oi9ad3&scm=1029.newlist-0.1.16&ppath=&sku=');
$doc = new DOMDocument();
// Based on Article http://stackoverflow.com/questions/11309194/php-domdocument-failing-to-handle-utf-8-characters/11310258#11310258
$searchPage = mb_convert_encoding($html,"HTML-ENTITIES","GBK");
$doc->loadHTML($searchPage);
// echo $doc->saveHTML();
$xpath = new DOMXpath($doc);
$elements = $xpath->query("//*[#id='detail']/div[1]/h3");
foreach ($elements as $e) {
//echo $e->nodeValue;
echo mb_convert_encoding($e->nodeValue,"utf-8","gbk");
}

You have the to_encoding and from_encoding parameters the wrong way around in your last call to mb_convert_encoding. The content returned from the XPath query is encoded as UTF-8, but you assumedly want the output encoded as gbk (given that you've set the meta charset to "gbk").
So the final loop should be:
foreach ($elements as $e) {
echo mb_convert_encoding($e->nodeValue,"gbk","utf-8");
}
The to_encoding is "gbk", and the from_encoding is "utf-8".
That said, the answer given by AgreeOrNot should work too, if you are happy with the page being encoded as UTF-8.
As for how the encoding process works, internally DOMDocument uses UTF-8, so that is why the results you get by from your xpath queries are UTF-8, and why you need to convert that to gbk with mb_convert_encoding if that is the character set you need.
When you call loadHTML, it attempts to detect the source encoding, and then convert the input from that encoding to UTF-8. Unfortunately the detection algorithm doesn't always work very well.
For example, although your example page has set the charset metatag, that metatag is not recognised by loadHTML, so it defaults to assuming the source encoding is Latin1. It would have worked if you had used an http-equiv metatag specifying the Content-Type.
<meta http-equiv="Content-Type" content="text/html; charset=gbk" />
The alternative is to avoid the problem altogether, but by converting all non-ASCII characters to html entities (as you have done). That way it doesn't matter if loadHTML detects the character encoding correctly, because there won't be any characters that need converting.

Since you've already converted the document to html entities, you don't need to convert encoding when you print the result. So:
echo $e->nodeValue;
// echo mb_convert_encoding($e->nodeValue,"utf-8","gbk");
The reason you didn't get the correct output is that you put <meta charset="gbk"/> in your html while it should be <meta charset="utf-8"/>.

PHP Encoding of Special Characters iso-8859-1

My PHP script parses a web site and pulls out an HTML DIV that looks like this (and saves it as a string)
<div id="merchantinfo">The following merchants: Nautica®, Brookstone®, Teds® ©2012 Blabla</div>
I store this as $merchantList (string).
However, when I output the data to the webpage
echo $merchantList
The encoding gets messed up and displays as:
NauticaÃ‚Â®, BrookstoneÃ‚Â®, TedsÃ‚Â® Â©2012 Blabla
I tried adding the following to the display page:
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
</head>
But that didn't do anything. --Thanks
EDIT:: ------------
For the question, the accepted answer is correct.
But I realized my actual issue was slightly different.
The initial parsing using DOMDocument::loadHTML had already mangled the UTF-8 encoding, causing the string to save as
<div id="merchantinfo">The following merchants: NauticaÃ®, BrookstoneÃ®, TedsÃ® ©2012 Blabla</div>
This was solved by:
$html = mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8");
$dom->loadHTML($html);

Use:
ini_set('default_charset', 'UTF-8');
And do not use iso-8859-1. Use UTF-8.
From the mojibake you posted the input string is utf-8, not iso-8859-1.

You need just to Use htmlspecialchars_decode function , exemple :
$string = '"hello dude"';
$decodechars = htmlspecialchars_decode($string);
echo $decodechars; // output : "hello dude"

How to keep the Chinese or other foreign language as they are instead of converting them into codes?

DOMDocument seems to convert Chinese characters into codes, for instance,
你的乱发 will become ä½ çš„ä¹±å‘
How can I keep the Chinese or other foreign language as they are instead of converting them into codes?
Below is my simple test,
$dom = new DOMDocument();
$dom->loadHTML($html);
If I add this below before loadHTML(),
$html = mb_convert_encoding($html, "HTML-ENTITIES", "UTF-8");
I get,
你的乱发
Even though the coverted codes will be displayed as Chinese characters, 你的乱发 still are not 你的乱发 what I am after....

DOMDocument seems to convert Chinese characters into codes [...]. How can I keep the Chinese or other foreign language as they are instead of converting them into codes?
$dom = new DOMDocument();
$dom->loadHTML($html);
If you're using the loadHTML function to load a HTML chunk. By default DOMDocument expects that string to be in HTML's default encoding (ISO-8859-1) however most often the charset (sic!) is meta-information provided next to the string you're using and not inside. To make this more complicated, that meta-information be be even inside the string.
Anyway as you have not shared the string data of the HTML and you have not specified the encoding, it's hard to tell specifically what is going on.
I assume the HTML is UTF-8 encoded but this is not signalled within the HTML string. So the following work-around can help:
$doc = new DOMDocument();
$doc->loadHTML('<?xml encoding="UTF-8">' . $html);
// dirty fix
foreach ($doc->childNodes as $item)
if ($item->nodeType == XML_PI_NODE)
$doc->removeChild($item); // remove hack
$doc->encoding = 'UTF-8'; // insert proper
It injects an encoding hint on the very beginning (and removes it after the HTML has been loaded). From that point on, DOMDocument will return UTF-8 (as always).

I just stumbled upon this thread when searching for a solution of a similar problem, i after loading the html properly and doing some parsing with Xpath etc... my text ends up like this:
你的乱发
this display fine in the body of the HTML, but won't display properly in a style or script tag (e.g. setting chinese-fonts).
to fix this, do the reverse lauthiamkok did:
$html = mb_convert_encoding($html, "UTF-8", "HTML-ENTITIES");
if for any reason the first workaround doesn't work for you, try this conversion.

I'm pretty sure ä½ çš„ä¹±å‘ is actually Windows Latin 1 (not ASCII, there are no diacritics in ASCII). Somewhere along the way your UTF-8 text got saved as Windows Latin 1....

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.