I have a section of code that parses some content from a webpage and I can't figure out why it's inserting the  in front of the degree symbol.
I've replicated what I'm seeing in my application in the php interpreter:
$ php -a
php > $dom=new domDocument;
php > $dom->loadHTML("<ol><li>What if I use a ° symbol here...</li></ol>");
php > $xpath = new DOMXpath($dom);
php > $steps = $xpath->query("//li");
php > foreach($steps as $step) { echo $step->nodeValue; }
What if I use a ° symbol here...
The problem is that the default encoding of DOMDocument::loadHTML is ISO-8859-1, while your input is a UTF-8 encoded string. You need tell to DOMDocument that you're using a different charset.
You can do that with
$dom->loadHTML("<?xml encoding=\"utf-8\" ?><ol><li>What if I use a ° symbol here...</li></ol>");
Maybe an encoding issue?
Normally DomDocument uses UTF-8.
But browsers tend to use different encodings when displaying the page. To force UTF-8 encoding you could add a tag like
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8" >
to your head element
Related
How can i remove only � (using curl To get data)
$str = "Check this out <a href=�http://www.somewebsite.com�>Somewebsite</a>, this is a great website
Windows� (XP 32bit/Vista/7/8/8.1)";
I just want � to be removed.
I tried
$output = preg_replace("/[^A-Za-z0-9]/","",$str);
it remove html also ... but i want html
Instead of doing a bad work-around like that, you should fix your charset issue instead. Your problem is likely that you don't use the same character-encoding in all levels of your application/scripts. Anything that has or can be set to a specific character-encoding, should be set to the same. The most general ones are below.
Save the document as UTF-8 (or UTF8 w/o BOM) (If you're using Notepad++, it's Format -> Convert to UFT-8 or UTF8 w/o BOM)
The header in both PHP and HTML should be set to UTF-8
HTML: <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />, inside the <head>-tag in your document.
PHP: header('Content-Type: text/html; charset=utf-8'); - PHP headers has to be set BEFORE any output is made (no HTML, no whitespace, no echo/print - nothing).
There are other aspects as well that might need to be set to UTF-8, it depends on what kind of PHP functions you are using and so on. But the above is generally a good start.
This question already has answers here:
PHP encoding with DOMDocument
(3 answers)
Closed 7 years ago.
I try to parse a text in French from html element using DOMDocument and Xpath. The problem is that the output encoding is incorrect.
Here is a text in French:
à la téléchargez mêmes
What I see on output:
à la téléchargez mêmes
PHP code:
<?php
$html = '<div id="demo">à la téléchargez mêmes</div>';
$doc = new DOMDocument();
#$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
echo $xpath->query("//div[#id='demo']")->item(0)->nodeValue;
Thanks for any suggestions.
With this command:
$doc->loadHTML($html);
you're commanding the DOMDocument to load your string $html
$html = '<div id="demo">à la téléchargez mêmes</div>';
with the ISO-8859-1 encoding.
But the string you use there was not viewed / typed by yourself in that ISO-8859-1 encoding but in the UTF-8 encoding.
So technically spoken, you've typed it wrong there ;)
Then on the other hand, when you command with your script to return a value:
$xpath->query("//div[#id='demo']")->item(0)->nodeValue;
that value will be UTF-8 encoded (scroll down to the Notes section and read about the character encoding).
To get a better view on the document, just output it directly after the call to loadHTML so that you can better see what is going on (echo $doc->saveHTML();, beautified):
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<body>
<div id="demo">
à la téléchargez mêmes
</div>
</body>
</html>
As you can see, you've explicitly commanded to insert Atile and the non-breaking-space and all these other characters, the string was taken as HTML 4.0 and as the HTML in your string didn't come with any specific character encoding specified, the default encoding (ISO-8859-1) was used.
So for what you do there, you can further read on with existing material that covers this and has even more information:
PHP DomDocument failing to handle utf-8 characters (☆) (Jul 2012)
How to keep the Chinese or other foreign language as they are instead of converting them into codes? (Apr 2012)
And additionally to the answer given in the first of the two there is an additional way to do this in your case:
$saved = libxml_use_internal_errors(true);
$result = $doc->loadHTML('<?xml>' . $html);
########
libxml_use_internal_errors($saved);
if ($result) {
$doc->removeChild($doc->documentElement->previousSibling);
}
This example not only adds proper error handling and return-value check if the HTML could be actually loaded or not, it also prefixes you string with a magic-sequence "<?xml>" that will set loadHTML into UTF-8 mode. After loading the HTML string as with UTF-8 encoding, the DOMProcessingInstruction is removed again. The encoding will remain:
$xpath = new DOMXpath($doc);
echo $xpath->query("//div[#id='demo']")->item(0)->nodeValue;
# prints "à la téléchargez mêmes" now
Find it demonstrated online here across many differen PHP versions: http://3v4l.org/TT3SM
If I use echo $doc->saveHTML(); It will show the characters accordingly , but once it reaches the xml? at xpath to extract the element , the issues are back again.
I cant seem to display the characters properly. How do i convert it properly. I'm getting:
婢跺繐顒滈拺鍙ョ瀵偓鐞涱偊鈧繑妲戦挅鍕綍婢舵牕顨� 闂€鍌溾敄缂侊綀濮虫稉濠呫€� 娑擃叀顣荤純鎴犵綍閺冭泛鐨绘總鍏呯瑐鐞涳綀鏉藉▎
Instead of proper Chinese:
<head><meta http-equiv="X-UA-Compatible" content="IE=edge"><meta charset="gbk"/></head>
My PHP code:
$html = file_get_contents('http://item.taobao.com/item.htm?spm=a2106.m874.1000384.41.aG3Kbi&id=20811635147&_u=o1ffj7oi9ad3&scm=1029.newlist-0.1.16&ppath=&sku=');
$doc = new DOMDocument();
// Based on Article http://stackoverflow.com/questions/11309194/php-domdocument-failing-to-handle-utf-8-characters/11310258#11310258
$searchPage = mb_convert_encoding($html,"HTML-ENTITIES","GBK");
$doc->loadHTML($searchPage);
// echo $doc->saveHTML();
$xpath = new DOMXpath($doc);
$elements = $xpath->query("//*[#id='detail']/div[1]/h3");
foreach ($elements as $e) {
//echo $e->nodeValue;
echo mb_convert_encoding($e->nodeValue,"utf-8","gbk");
}
You have the to_encoding and from_encoding parameters the wrong way around in your last call to mb_convert_encoding. The content returned from the XPath query is encoded as UTF-8, but you assumedly want the output encoded as gbk (given that you've set the meta charset to "gbk").
So the final loop should be:
foreach ($elements as $e) {
echo mb_convert_encoding($e->nodeValue,"gbk","utf-8");
}
The to_encoding is "gbk", and the from_encoding is "utf-8".
That said, the answer given by AgreeOrNot should work too, if you are happy with the page being encoded as UTF-8.
As for how the encoding process works, internally DOMDocument uses UTF-8, so that is why the results you get by from your xpath queries are UTF-8, and why you need to convert that to gbk with mb_convert_encoding if that is the character set you need.
When you call loadHTML, it attempts to detect the source encoding, and then convert the input from that encoding to UTF-8. Unfortunately the detection algorithm doesn't always work very well.
For example, although your example page has set the charset metatag, that metatag is not recognised by loadHTML, so it defaults to assuming the source encoding is Latin1. It would have worked if you had used an http-equiv metatag specifying the Content-Type.
<meta http-equiv="Content-Type" content="text/html; charset=gbk" />
The alternative is to avoid the problem altogether, but by converting all non-ASCII characters to html entities (as you have done). That way it doesn't matter if loadHTML detects the character encoding correctly, because there won't be any characters that need converting.
Since you've already converted the document to html entities, you don't need to convert encoding when you print the result. So:
echo $e->nodeValue;
// echo mb_convert_encoding($e->nodeValue,"utf-8","gbk");
The reason you didn't get the correct output is that you put <meta charset="gbk"/> in your html while it should be <meta charset="utf-8"/>.
I run the following code:
$page = '<p>Ä</p>';
$DOM = new DOMDocument;
$DOM->loadHTML($page);
echo 'source:'.$page;
echo 'dom: '.$DOM->getElementsByTagName('p')->item (0)->textContent;
and it outputs the following:
source: Ä
dom: Ã
so, I don't understand why when the text comes through DOMDocument its encoding becomes broken?
Here's a workaround that adds the proper encoding via meta header:
$DOM->loadHTML('<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />' . $page);
I'm not sure if that's the actual character set you're trying to use, but adjust where necessary
See also: domdocument character set issue
DOMDocument appears to be treating the input as UTF-8. In this conversion, Ä becomes Ä. Here's the catch: That second character does not exist in ISO-8859-1, but does exist in Windows-1252. This is why you are seeing no second character in your output.
You can fix this by calling utf8_decode on the output of textContent, or using UTF-8 as your page's character encoding.
DOMDocument seems to convert Chinese characters into codes, for instance,
你的乱发 will become ä½ çš„ä¹±å‘
How can I keep the Chinese or other foreign language as they are instead of converting them into codes?
Below is my simple test,
$dom = new DOMDocument();
$dom->loadHTML($html);
If I add this below before loadHTML(),
$html = mb_convert_encoding($html, "HTML-ENTITIES", "UTF-8");
I get,
你的乱发
Even though the coverted codes will be displayed as Chinese characters, 你的乱发 still are not 你的乱发 what I am after....
DOMDocument seems to convert Chinese characters into codes [...]. How can I keep the Chinese or other foreign language as they are instead of converting them into codes?
$dom = new DOMDocument();
$dom->loadHTML($html);
If you're using the loadHTML function to load a HTML chunk. By default DOMDocument expects that string to be in HTML's default encoding (ISO-8859-1) however most often the charset (sic!) is meta-information provided next to the string you're using and not inside. To make this more complicated, that meta-information be be even inside the string.
Anyway as you have not shared the string data of the HTML and you have not specified the encoding, it's hard to tell specifically what is going on.
I assume the HTML is UTF-8 encoded but this is not signalled within the HTML string. So the following work-around can help:
$doc = new DOMDocument();
$doc->loadHTML('<?xml encoding="UTF-8">' . $html);
// dirty fix
foreach ($doc->childNodes as $item)
if ($item->nodeType == XML_PI_NODE)
$doc->removeChild($item); // remove hack
$doc->encoding = 'UTF-8'; // insert proper
It injects an encoding hint on the very beginning (and removes it after the HTML has been loaded). From that point on, DOMDocument will return UTF-8 (as always).
I just stumbled upon this thread when searching for a solution of a similar problem, i after loading the html properly and doing some parsing with Xpath etc... my text ends up like this:
你的乱发
this display fine in the body of the HTML, but won't display properly in a style or script tag (e.g. setting chinese-fonts).
to fix this, do the reverse lauthiamkok did:
$html = mb_convert_encoding($html, "UTF-8", "HTML-ENTITIES");
if for any reason the first workaround doesn't work for you, try this conversion.
I'm pretty sure ä½ çš„ä¹±å‘ is actually Windows Latin 1 (not ASCII, there are no diacritics in ASCII). Somewhere along the way your UTF-8 text got saved as Windows Latin 1....