Well, apparently, PHP and it's standard libraries have some problems, and DOMDocument isn't an exception.
There are workarounds for utf8 characters when loading HTML string - $dom->loadHTML().
Apparently, I haven't found a way to do this when loading HTML from file - $dom->loadHTMLFile(). While it reads and sets the encoding from <meta /> tags, the problem strikes back if I haven't defined those. For instance, when loading a fragment of HTML (template part, like, footer.html), not a fully built HTML document.
So, how do I preserve utf8 characters, when loading HTML from file, that hasn't got it's <meta /> keys present, and defining those is not an option?
Update
footer.html (the file is encoded in UTF-8 without BOM):
<div id="footer">
<p>My sūpēr ōzōm ūtf8 štrīņģ</p>
</div>
index.php:
$dom = new DOMDocument;
$dom->loadHTMLFile('footer.html');
echo $dom->saveHTML(); // results in all familiar effed' up characters
Thanks in advance!
Try a hack like this one:
$doc = new DOMDocument();
$doc->loadHTML('<?xml encoding="UTF-8">' . $html);
// dirty fix
foreach ($doc->childNodes as $item)
if ($item->nodeType == XML_PI_NODE)
$doc->removeChild($item); // remove hack
$doc->encoding = 'UTF-8'; // insert proper
Several others are listed in the user comments here: http://php.net/manual/en/domdocument.loadhtml.php. It is also important that your document head includea meta tag to specify encoding FIRST, directly after the tag.
I would suggest using my answer here: https://stackoverflow.com/a/12846243/816753 and instead of adding another <head>, wrap your entire fragment in
<html>
<head><meta http-equiv='Content-type' content='text/html; charset=UTF-8' /></head>
<body><!-- your content here --></body>
</html>`
While I'm not sure about how to go about solving the problem with ->loadHTMLFile(), have you considered using file_get_contents() to get the HTML, run mb_convert_encoding() on that string, then pass that value in to ->loadHTML()?
Edit: Also, when you initialize DOMDocument, are you giving it the $encoding argument?
The key is for your browser only. Once the page is all built up, your browser should display the page correctly if it has the meta at the end.
You can always try to use the utf8_decode (or encode, I'm never sure lol) function before echo'ing the data like so:
echo utf8_decode($dom->saveHTML());
Related
I'm trying to parse some HTML using DOMDocument, but when I do, I suddenly lose my encoding (at least that is how it appears to me).
$profile = "<div><p>various japanese characters</p></div>";
$dom = new DOMDocument();
$dom->loadHTML($profile);
$divs = $dom->getElementsByTagName('div');
foreach ($divs as $div) {
echo $dom->saveHTML($div);
}
The result of this code is that I get a bunch of characters that are not Japanese. However, if I do:
echo $profile;
it displays correctly. I've tried saveHTML and saveXML, and neither display correctly. I am using PHP 5.3.
What I see:
ã¤ãªãã¤å·ã·ã«ã´ã«ã¦ãã¢ã¤ã«ã©ã³ãç³»ã®å®¶åºã«ã9人åå¼ã®5çªç®ã¨ãã¦çã¾ãããå½¼ãå«ãã¦4人ã俳åªã«ãªã£ããç¶è¦ªã¯æ¨æã®ã»ã¼ã«ã¹ãã³ã§ãæ¯è¦ªã¯éµä¾¿å±ã®å®¢å®¤ä¿ã ã£ããé«æ ¡æ代ã¯ãã£ãã£ã®ã¢ã«ãã¤ãã«å¤ãã¿ãæè²è³éãåããªããã«ããªãã¯ç³»ã®é«æ ¡ã¸é²å¦ã
What should be shown:
イリノイ州シカゴにて、アイルランド系の家庭に、9人兄弟の5番目として生まれる。彼を含めて4人が俳優になった。父親は木材のセールスマンで、母親は郵便局の客室係だった。高校時代はキャディのアルバイトに勤しみ、教育資金を受けながらカトリック系の高校へ進学
EDIT: I've simplified the code down to five lines so you can test it yourself.
$profile = "<div lang=ja><p>イリノイ州シカゴにて、アイルランド系の家庭に、</p></div>";
$dom = new DOMDocument();
$dom->loadHTML($profile);
echo $dom->saveHTML();
echo $profile;
Here is the html that is returned:
<div lang="ja"><p>イリノイ州シカゴã«ã¦ã€ã‚¢ã‚¤ãƒ«ãƒ©ãƒ³ãƒ‰ç³»ã®å®¶åºã«ã€</p></div>
<div lang="ja"><p>イリノイ州シカゴにて、アイルランド系の家庭に、</p></div>
DOMDocument::loadHTML will treat your string as being in ISO-8859-1 (the HTTP/1.1 default character set) unless you tell it otherwise. This results in UTF-8 strings being interpreted incorrectly.
If your string doesn't contain an XML encoding declaration, you can prepend one to cause the string to be treated as UTF-8:
$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML('<?xml encoding="utf-8" ?>' . $profile);
echo $dom->saveHTML();
If you cannot know if the string will contain such a declaration already, there's a workaround in SmartDOMDocument which should help you:
$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML(mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8'));
echo $dom->saveHTML();
This is not a great workaround, but since not all characters can be represented in ISO-8859-1 (like these katana), it's the safest alternative.
The problem is with saveHTML() and saveXML(), both of them do not work correctly in Unix. They do not save UTF-8 characters correctly when used in Unix, but they work in Windows.
The workaround is very simple:
If you try the default, you will get the error you described
$str = $dom->saveHTML(); // saves incorrectly
All you have to do is save as follows:
$str = $dom->saveHTML($dom->documentElement); // saves correctly
This line of code will get your UTF-8 characters to be saved correctly. Use the same workaround if you are using saveXML().
Update
As suggested by "Jack M" in the comments section below, and verified by "Pamela" and "Marco Aurélio Deleu", the following variation might work in your case:
$str = utf8_decode($dom->saveHTML($dom->documentElement));
Note
English characters do not cause any problem when you use saveHTML() without parameters (because English characters are saved as single byte characters in UTF-8)
The problem happens when you have multi-byte characters (such as Chinese, Russian, Arabic, Hebrew, ...etc.)
I recommend reading this article: http://coding.smashingmagazine.com/2012/06/06/all-about-unicode-utf8-character-sets/. You will understand how UTF-8 works and why you have this problem. It will take you about 30 minutes, but it is time well spent.
Make sure the real source file is saved as UTF-8 (You may even want to try the non-recommended BOM Chars with UTF-8 to make sure).
Also in case of HTML, make sure you have declared the correct encoding using meta tags:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
If it's a CMS (as you've tagged your question with Joomla) you may need to configure appropriate settings for the encoding.
This took me a while to figure out but here's my answer.
Before using DomDocument I would use file_get_contents to retrieve URLs and then process them with string functions. Perhaps not the best way but quick. After being convinced Dom was just as quick I first tried the following:
$dom = new DomDocument('1.0', 'UTF-8');
if ($dom->loadHTMLFile($url) == false) { // read the url
// error message
}
else {
// process
}
This failed spectacularly in preserving UTF-8 encoding despite the proper meta tags, PHP settings, and all the rest of the remedies offered here and elsewhere. Here's what works:
$dom = new DomDocument('1.0', 'UTF-8');
$str = file_get_contents($url);
if ($dom->loadHTML(mb_convert_encoding($str, 'HTML-ENTITIES', 'UTF-8')) == false) {
}
etc. Now everything's right with the world.
You could prefix a line enforcing utf-8 encoding, like this:
#$doc->loadHTML('<?xml version="1.0" encoding="UTF-8"?>' . "\n" . $profile);
And you can then continue with the code you already have, like:
$doc->saveXML()
Use correct header for UTF-8
Don't get satisfied by "it works".
#cmbuckley in his accepted answer advised to set <?xml encoding="utf-8" ?> to the document. However to use XML declaration in HTML document is a bit weird. HTML is not XML (unless it is XHTML) and it can confuse browsers and other software on the way to client (may be source of the failures reported by others).
I successfully used HTML5 declaration:
$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML('<!DOCTYPE html><meta charset="UTF-8">' . $profile);
echo $dom->saveHTML();
If you use other standard, use correct header, the DOMDocument follows the standards quite pedantically and seems to support HTML5, too (if not in your case, try to update the libxml extension).
You must feed the DOMDocument a version of your HTML with a header that make sense.
Just like HTML5.
$profile ='<?xml version="1.0" encoding="'.$_encoding.'"?>'. $html;
maybe is a good idea to keep your html as valid as you can, so you don't get into issues when you'll start query... around :-) and stay away from htmlentities!!!! That's an an necessary back and forth wasting resources.
keep your code insane!!!!
Use it for correct result
$dom = new DOMDocument();
$dom->loadHTML('<meta http-equiv="Content-Type" content="text/html; charset=utf-8">' . $profile);
echo $dom->saveHTML();
echo $profile;
This operation
mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8');
It is bad way, because special symbols like < ; , > ; can be in $profile, and they will not convert twice after mb_convert_encoding. It is the hole for XSS and incorrect HTML.
Works finde for me:
$dom = new \DOMDocument;
$dom->loadHTML(utf8_decode($html));
...
return utf8_encode( $dom->saveHTML());
The only thing that worked for me was the accepted answer of
$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML('<?xml encoding="utf-8" ?>' . $profile);
echo $dom->saveHTML();
HOWEVER
This brought about new issues, of having <?xml encoding="utf-8" ?> in the output of the document.
The solution for me was then to do
foreach ($doc->childNodes as $xx) {
if ($xx instanceof \DOMProcessingInstruction) {
$xx->parentNode->removeChild($xx);
}
}
Some solutions told me that to remove the xml header, that I had to perform
$dom->saveXML($dom->documentElement);
This didn't work for me as for a partial document (e.g. a doc with two <p> tags), only one of the <p> tags where being returned.
The problem is that when you add a parameter to DOMDocument::saveHTML() function, you lose the encoding. In a few cases, you'll need to avoid the use of the parameter and use old string function to find what your are looking for.
I think the previous answer works for you, but since this workaround didn't work for me, I'm adding that answer to help people who may be in my case.
I've looked at other answers (php: using DomDocument whenever I try to write UTF-8 it writes the hexadecimal notation of it, DOMDocument->saveHTML() converting to space) and either they don't apply to my situation, or I'm not understanding them.
I'm feeding some HTML into $dom like this...
$dom = new DOMDocument;
$dom->loadHTML($table_data_for_db);
I then do some stuff with it, then output it like this..
$table_data_for_db = $dom->saveHTML();
echo $table_data_for_db;
The problem is that special characters such as → end up like this →.
1.) Is there a way around this?
2.) Is there another way in PHP other than using DOMDocument, loadHTML, etc. to strip out sections of HTML? Like, if I want to remove <style id="fraction_class"> and all of its contents, is there another way?
Thank you.
I am using the following function to get the inner html of html string
function DOMinnerHTML($element)
{
$innerHTML = "";
$children = $element->childNodes;
foreach ($children as $child)
{
$tmp_dom = new DOMDocument('1.0', 'UTF-8');
$tmp_dom->appendChild($tmp_dom->importNode($child, true));
$innerHTML .= trim($tmp_dom->saveHTML());
}
return $innerHTML;
}
my html string also contains unicode character. here is example of html string
$html = '<div>Thats True. Yes it is well defined آپ مجھے تم کہہ کر پکاریں</div>';
When I use the above function
$output = DOMinnerHTML($html);
the output is as below
$output = '<div>Thats True. Yes it is well defined
کے۔سلطا</div>';
the actual unicode characters converted to numeric values.
I have debugged the code and found that in DOMinnerHTML function before the following line
$innerHTML .= trim($tmp_dom->saveHTML());
if I echo
echo $tmp_dom->textContent;
It shows the actual unicode characters but after saving to $innerHTML it outputs the numeric symbols.
Why it is doing that.
Note: please don't suggest me html_entity_decode like functions to convert numeric symbols to real unicode characters because, I also have user formatted data in my html string, that I don't want to convert.
Note: I have also tried by putting the
<meta http-equiv="content-type" content="text/html; charset=utf-8">
before my html string but no difference.
I had a similar problem. After reading the above comment, and after further investigation, I found a very simple solution.
All you have to do is use html_entity_decode() to convert the output of saveHTML(), as follows:
// Create a new dom document
$dom = new DOMDocument();
// .... Do some stuff, adding nodes, ...etc.
// the html_entity_decode function will solve the unicode issue you described
$result = html_entity_decode($dom->saveHTML();
// echo your output
echo $result;
This will ensure that unicode characters are displayed properly
Good question, and you did an excellent job narrowing down the problem to a single line of code that caused things to go haywire! This allowed me to figure out what is going wrong.
The problem is with the DOMDocument's saveHTML() function. It is doing exactly what it is supposed to do, but it's design is not what you wanted.
saveHTML() converts the document into a string "using HTML formatting" - which means that it does HTML entity encoding for you! Sadly, this is not what you wanted. Comments in the PHP docs also indicate that DOMDocument does not handle utf-8 especially well and does not do very well with fragments (as it automatically adds html, doctype, etc).
Check out this comment for a proposed solution by simply using another class: alternative to DOMDocument
After seeing many complaints about certain DOMDocument shortcomings,
such as bad handling of encodings and always saving HTML fragments
with , , and DOCTYPE, I decided that a better solution is
needed.
So here it is: SmartDOMDocument. You can find it at
http://beerpla.net/projects/smartdomdocument/
Currently, the main highlights are:
SmartDOMDocument inherits from DOMDocument, so it's very easy to use - just declare an object of type SmartDOMDocument instead of DOMDocument and enjoy the new behavior on top of all existing
functionality (see example below).
saveHTMLExact() - DOMDocument has an extremely badly designed "feature" where if the HTML code you are loading does not contain
and tags, it adds them automatically (yup, there are no
flags to turn this behavior off). Thus, when you call
$doc->saveHTML(), your newly saved content now has and
DOCTYPE in it. Not very handy when trying to work with code fragments
(XML has a similar problem). SmartDOMDocument contains a new function
called saveHTMLExact() which does exactly what you would want - it
saves HTML without adding that extra garbage that DOMDocument does.
encoding fix - DOMDocument notoriously doesn't handle encoding (at least UTF-8) correctly and garbles the output. SmartDOMDocument tries
to work around this problem by enhancing loadHTML() to deal with
encoding correctly. This behavior is transparent to you - just use
loadHTML() as you would normally.
mb_convert_encoding($html,'HTML-ENTITIES','UTF-8');
This worked for me
I'm using PHP's DOMDocument object to parse some HTML (fetched with cURL). When I get an element by ID and output it, any empty <span> </span> tags get an additional character and become <span>Â </span>.
The Code:
<?php
$document = new DOMDocument();
$document->validateOnParse = true;
$document->loadHTML( curl_exec($handle) );
curl_close($handle);
$element = $document->getElementById( __ELEMENT_ID__ );
echo $document->saveHTML();
echo $document->saveHTML($element);
?>
The $document->saveHTML() command behaves as expected and prints out the entire page. BUT, like I say above, on the echo $document->saveHTML($element) command transforms empty <span> tags into <span>Â </span>.
This happens to all <span> </span> tags within $element.
What in this process (of getting the element by ID and outputting the element) is inserting this extra character? I'm could work around it, but I'm more interested in getting to the root.
I was able to fix the problem by setting the character encoding of the page. The page I was fetching did not have a defined character encoding, and my page was just a snippet without defined header info. When I added
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>
The problem disappeared.
I'm trying to parse some HTML using DOMDocument, but when I do, I suddenly lose my encoding (at least that is how it appears to me).
$profile = "<div><p>various japanese characters</p></div>";
$dom = new DOMDocument();
$dom->loadHTML($profile);
$divs = $dom->getElementsByTagName('div');
foreach ($divs as $div) {
echo $dom->saveHTML($div);
}
The result of this code is that I get a bunch of characters that are not Japanese. However, if I do:
echo $profile;
it displays correctly. I've tried saveHTML and saveXML, and neither display correctly. I am using PHP 5.3.
What I see:
ã¤ãªãã¤å·ã·ã«ã´ã«ã¦ãã¢ã¤ã«ã©ã³ãç³»ã®å®¶åºã«ã9人åå¼ã®5çªç®ã¨ãã¦çã¾ãããå½¼ãå«ãã¦4人ã俳åªã«ãªã£ããç¶è¦ªã¯æ¨æã®ã»ã¼ã«ã¹ãã³ã§ãæ¯è¦ªã¯éµä¾¿å±ã®å®¢å®¤ä¿ã ã£ããé«æ ¡æ代ã¯ãã£ãã£ã®ã¢ã«ãã¤ãã«å¤ãã¿ãæè²è³éãåããªããã«ããªãã¯ç³»ã®é«æ ¡ã¸é²å¦ã
What should be shown:
イリノイ州シカゴにて、アイルランド系の家庭に、9人兄弟の5番目として生まれる。彼を含めて4人が俳優になった。父親は木材のセールスマンで、母親は郵便局の客室係だった。高校時代はキャディのアルバイトに勤しみ、教育資金を受けながらカトリック系の高校へ進学
EDIT: I've simplified the code down to five lines so you can test it yourself.
$profile = "<div lang=ja><p>イリノイ州シカゴにて、アイルランド系の家庭に、</p></div>";
$dom = new DOMDocument();
$dom->loadHTML($profile);
echo $dom->saveHTML();
echo $profile;
Here is the html that is returned:
<div lang="ja"><p>イリノイ州シカゴã«ã¦ã€ã‚¢ã‚¤ãƒ«ãƒ©ãƒ³ãƒ‰ç³»ã®å®¶åºã«ã€</p></div>
<div lang="ja"><p>イリノイ州シカゴにて、アイルランド系の家庭に、</p></div>
DOMDocument::loadHTML will treat your string as being in ISO-8859-1 (the HTTP/1.1 default character set) unless you tell it otherwise. This results in UTF-8 strings being interpreted incorrectly.
If your string doesn't contain an XML encoding declaration, you can prepend one to cause the string to be treated as UTF-8:
$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML('<?xml encoding="utf-8" ?>' . $profile);
echo $dom->saveHTML();
If you cannot know if the string will contain such a declaration already, there's a workaround in SmartDOMDocument which should help you:
$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML(mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8'));
echo $dom->saveHTML();
This is not a great workaround, but since not all characters can be represented in ISO-8859-1 (like these katana), it's the safest alternative.
The problem is with saveHTML() and saveXML(), both of them do not work correctly in Unix. They do not save UTF-8 characters correctly when used in Unix, but they work in Windows.
The workaround is very simple:
If you try the default, you will get the error you described
$str = $dom->saveHTML(); // saves incorrectly
All you have to do is save as follows:
$str = $dom->saveHTML($dom->documentElement); // saves correctly
This line of code will get your UTF-8 characters to be saved correctly. Use the same workaround if you are using saveXML().
Update
As suggested by "Jack M" in the comments section below, and verified by "Pamela" and "Marco Aurélio Deleu", the following variation might work in your case:
$str = utf8_decode($dom->saveHTML($dom->documentElement));
Note
English characters do not cause any problem when you use saveHTML() without parameters (because English characters are saved as single byte characters in UTF-8)
The problem happens when you have multi-byte characters (such as Chinese, Russian, Arabic, Hebrew, ...etc.)
I recommend reading this article: http://coding.smashingmagazine.com/2012/06/06/all-about-unicode-utf8-character-sets/. You will understand how UTF-8 works and why you have this problem. It will take you about 30 minutes, but it is time well spent.
Make sure the real source file is saved as UTF-8 (You may even want to try the non-recommended BOM Chars with UTF-8 to make sure).
Also in case of HTML, make sure you have declared the correct encoding using meta tags:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
If it's a CMS (as you've tagged your question with Joomla) you may need to configure appropriate settings for the encoding.
This took me a while to figure out but here's my answer.
Before using DomDocument I would use file_get_contents to retrieve URLs and then process them with string functions. Perhaps not the best way but quick. After being convinced Dom was just as quick I first tried the following:
$dom = new DomDocument('1.0', 'UTF-8');
if ($dom->loadHTMLFile($url) == false) { // read the url
// error message
}
else {
// process
}
This failed spectacularly in preserving UTF-8 encoding despite the proper meta tags, PHP settings, and all the rest of the remedies offered here and elsewhere. Here's what works:
$dom = new DomDocument('1.0', 'UTF-8');
$str = file_get_contents($url);
if ($dom->loadHTML(mb_convert_encoding($str, 'HTML-ENTITIES', 'UTF-8')) == false) {
}
etc. Now everything's right with the world.
You could prefix a line enforcing utf-8 encoding, like this:
#$doc->loadHTML('<?xml version="1.0" encoding="UTF-8"?>' . "\n" . $profile);
And you can then continue with the code you already have, like:
$doc->saveXML()
Use correct header for UTF-8
Don't get satisfied by "it works".
#cmbuckley in his accepted answer advised to set <?xml encoding="utf-8" ?> to the document. However to use XML declaration in HTML document is a bit weird. HTML is not XML (unless it is XHTML) and it can confuse browsers and other software on the way to client (may be source of the failures reported by others).
I successfully used HTML5 declaration:
$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML('<!DOCTYPE html><meta charset="UTF-8">' . $profile);
echo $dom->saveHTML();
If you use other standard, use correct header, the DOMDocument follows the standards quite pedantically and seems to support HTML5, too (if not in your case, try to update the libxml extension).
You must feed the DOMDocument a version of your HTML with a header that make sense.
Just like HTML5.
$profile ='<?xml version="1.0" encoding="'.$_encoding.'"?>'. $html;
maybe is a good idea to keep your html as valid as you can, so you don't get into issues when you'll start query... around :-) and stay away from htmlentities!!!! That's an an necessary back and forth wasting resources.
keep your code insane!!!!
Use it for correct result
$dom = new DOMDocument();
$dom->loadHTML('<meta http-equiv="Content-Type" content="text/html; charset=utf-8">' . $profile);
echo $dom->saveHTML();
echo $profile;
This operation
mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8');
It is bad way, because special symbols like < ; , > ; can be in $profile, and they will not convert twice after mb_convert_encoding. It is the hole for XSS and incorrect HTML.
Works finde for me:
$dom = new \DOMDocument;
$dom->loadHTML(utf8_decode($html));
...
return utf8_encode( $dom->saveHTML());
The only thing that worked for me was the accepted answer of
$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML('<?xml encoding="utf-8" ?>' . $profile);
echo $dom->saveHTML();
HOWEVER
This brought about new issues, of having <?xml encoding="utf-8" ?> in the output of the document.
The solution for me was then to do
foreach ($doc->childNodes as $xx) {
if ($xx instanceof \DOMProcessingInstruction) {
$xx->parentNode->removeChild($xx);
}
}
Some solutions told me that to remove the xml header, that I had to perform
$dom->saveXML($dom->documentElement);
This didn't work for me as for a partial document (e.g. a doc with two <p> tags), only one of the <p> tags where being returned.
The problem is that when you add a parameter to DOMDocument::saveHTML() function, you lose the encoding. In a few cases, you'll need to avoid the use of the parameter and use old string function to find what your are looking for.
I think the previous answer works for you, but since this workaround didn't work for me, I'm adding that answer to help people who may be in my case.