saveHTML() doesn't output special characters properly

saveHTML() doesn't output special characters properly - php

I've looked at other answers (php: using DomDocument whenever I try to write UTF-8 it writes the hexadecimal notation of it, DOMDocument->saveHTML() converting to space) and either they don't apply to my situation, or I'm not understanding them.
I'm feeding some HTML into $dom like this...
$dom = new DOMDocument;
$dom->loadHTML($table_data_for_db);
I then do some stuff with it, then output it like this..
$table_data_for_db = $dom->saveHTML();
echo $table_data_for_db;
The problem is that special characters such as → end up like this â†’.
1.) Is there a way around this?
2.) Is there another way in PHP other than using DOMDocument, loadHTML, etc. to strip out sections of HTML? Like, if I want to remove <style id="fraction_class"> and all of its contents, is there another way?
Thank you.

Related

The Actual Unicode Characters automatically converted to Numeric values using DOMDocument->saveHTML()

I am using the following function to get the inner html of html string
function DOMinnerHTML($element)
{
$innerHTML = "";
$children = $element->childNodes;
foreach ($children as $child)
{
$tmp_dom = new DOMDocument('1.0', 'UTF-8');
$tmp_dom->appendChild($tmp_dom->importNode($child, true));
$innerHTML .= trim($tmp_dom->saveHTML());
}
return $innerHTML;
}
my html string also contains unicode character. here is example of html string
$html = '<div>Thats True. Yes it is well defined آپ مجھے تم کہہ کر پکاریں</div>';
When I use the above function
$output = DOMinnerHTML($html);
the output is as below
$output = '<div>Thats True. Yes it is well defined
کے۔سلط&#1575</div>';
the actual unicode characters converted to numeric values.
I have debugged the code and found that in DOMinnerHTML function before the following line
$innerHTML .= trim($tmp_dom->saveHTML());
if I echo
echo $tmp_dom->textContent;
It shows the actual unicode characters but after saving to $innerHTML it outputs the numeric symbols.
Why it is doing that.
Note: please don't suggest me html_entity_decode like functions to convert numeric symbols to real unicode characters because, I also have user formatted data in my html string, that I don't want to convert.
Note: I have also tried by putting the
<meta http-equiv="content-type" content="text/html; charset=utf-8">
before my html string but no difference.

I had a similar problem. After reading the above comment, and after further investigation, I found a very simple solution.
All you have to do is use html_entity_decode() to convert the output of saveHTML(), as follows:
// Create a new dom document
$dom = new DOMDocument();
// .... Do some stuff, adding nodes, ...etc.
// the html_entity_decode function will solve the unicode issue you described
$result = html_entity_decode($dom->saveHTML();
// echo your output
echo $result;
This will ensure that unicode characters are displayed properly

Good question, and you did an excellent job narrowing down the problem to a single line of code that caused things to go haywire! This allowed me to figure out what is going wrong.
The problem is with the DOMDocument's saveHTML() function. It is doing exactly what it is supposed to do, but it's design is not what you wanted.
saveHTML() converts the document into a string "using HTML formatting" - which means that it does HTML entity encoding for you! Sadly, this is not what you wanted. Comments in the PHP docs also indicate that DOMDocument does not handle utf-8 especially well and does not do very well with fragments (as it automatically adds html, doctype, etc).
Check out this comment for a proposed solution by simply using another class: alternative to DOMDocument
After seeing many complaints about certain DOMDocument shortcomings,
such as bad handling of encodings and always saving HTML fragments
with , , and DOCTYPE, I decided that a better solution is
needed.
So here it is: SmartDOMDocument. You can find it at
http://beerpla.net/projects/smartdomdocument/
Currently, the main highlights are:
SmartDOMDocument inherits from DOMDocument, so it's very easy to use - just declare an object of type SmartDOMDocument instead of DOMDocument and enjoy the new behavior on top of all existing
functionality (see example below).
saveHTMLExact() - DOMDocument has an extremely badly designed "feature" where if the HTML code you are loading does not contain
and tags, it adds them automatically (yup, there are no
flags to turn this behavior off). Thus, when you call
$doc->saveHTML(), your newly saved content now has and
DOCTYPE in it. Not very handy when trying to work with code fragments
(XML has a similar problem). SmartDOMDocument contains a new function
called saveHTMLExact() which does exactly what you would want - it
saves HTML without adding that extra garbage that DOMDocument does.
encoding fix - DOMDocument notoriously doesn't handle encoding (at least UTF-8) correctly and garbles the output. SmartDOMDocument tries
to work around this problem by enhancing loadHTML() to deal with
encoding correctly. This behavior is transparent to you - just use
loadHTML() as you would normally.

mb_convert_encoding($html,'HTML-ENTITIES','UTF-8');
This worked for me

simplexml_load_string() != simplexml_import_dom()?

If I load an HTML page using DOMDocument::loadHTMLFile() then pass it to simplexml_import_dom() everything is fine, however, if I using $dom->saveHTML() to get a string representation from the DOMDocument then use simplexml_load_string(), I get nothing. Actually, if I use a very simple page it will work, but as soon as there is anything more complex, it fails without any errors in the PHP log file.
Can anyone shed light on this?
Is it something to do with HTML not being parsable XML?
I am trying to strip out CR's and newlines from the formatted HTML text before using the contents as they have nothing to do with the content but get inserted into the SimpleXMLElement object, which is rather tedious.

Is it something to do with HTML not being parsable XML?
YES! HTML is a far less strict syntax so simplexml_load_string will not work with it by itself. This is because simplexml is simple and HTML is convoluted. On the other hand, DOMDocument is designed to be able to read the convoluted HTML structure, which means that since it can make sense of HTML and simplexml can make sense of it, you can bridge the proverbial gap there.
<!-- Valid HTML but not valid XML -->
<ul>
<li>foo
<li>bar
</ul>

HTML may or may not be valid XML. when you use loadHTMLFile it doesnt necessarily have to be well formed xml because the DOM is an HTML one so different rules, but when you pass a string to SimpleXML it must indeed be well formed.

If I get your question correclty and you simply want no whitespace in your output, then there is no need to use simplexml here.
Use: DOMDocument::preservewhitespace
like:
$dom->preserveWhiteSpace = false;
before saveHTML and you're set.

Need help with regular expressions in PHP

I am trying to index some content from a series of .html's that share the same format.
So I get a lot of lines like this: <a href="meh">[18] blah blah blah < a...
And the idea is to extract the number (18) and the text next to it (blah...). Furthermore, I know that every qualifying line will start with "> and end with either <a or </p. The issue stems from the need to keep all other htmHTML tags as part of the text (<i>, <u>, etc.).
So then I have something like this:
$docString = file_get_contents("http://whatever.com/some.htm");
$regex="/\">\ [(.*?)\ ] (<\/a>)(.) *?(<)/";
preg_match_all($regex,$docString,$match);
Let's look at $regex for a sec. Ignore it's spaces, I just put them here because else some characters disappear. I specify that it will start with ">. Then I do the numbers inside the [] thing. Then I single out the </a>. So far so good.
At the end, I do a (.)*?(<). This is the turning point. By leaving the last bit, (<) like that, The text will be interrupted when an underline or italics tag is found. However, if I put (<a|</p) the resulting array ends up empty. I've tried changing that to only (<a), but it seems that 2 characters mess up the whole ting.
What can I do? I've been struggling with this all day.

PHP Tidy is your friend. Don't use regexes.

Something like /">\[(.*)\](.*)(?:<(?:a|\/p))/ seems to work fine for given your example and description. Perhaps adding non-capturing subpatterns does it? Please provide a counterexample wherein this doesn't work for you.
Though I agree that RegEx isn't a parser, it sounds like what you're looking for is part of a regularly behaved string - which is exactly what RegEx is strong at.

As you've found, using a regex to parse HTML is not very easy. This is because HTML is not particularly regular.
I suggest using an XML parser such as PHP's DomDocument.
Create an object, then use the loadHTMLFile method to open the file. Extract your a tags with getElementsByTagName, and then extract the content as the NodeValue property.
It might look like
// Create a DomDocument object
$html = new DOMDocument();
// Load the url's contents into the DOM
$html->loadHTMLFile("http://whatever.com/some.htm");
// make an array to hold the text
$anchors = array();
//Loop through the a tags and store them in an array
foreach($html->getElementsByTagName('a') as $link) {
$anchors[] = $link->nodeValue;
}
One alternative to this style of XML/HTML parser is phpquery. The documentation on their page should do a good job of explaining how to extract the tags. If you know jQuery, the interface may seem more natural.

php output xml produces parse error "’"

Is there any function that I can use to parse any string to ensure it won't cause xml parsing problems? I have a php script outputting a xml file with content obtained from forms.
The thing is, apart from the usual string checks from a php form, some of the user text causes xml parsing errors. I'm facing this "’" in particular. This is the error I'm getting Entity 'rsquo' not defined
Does anyone have any experience in encoding text for xml output?
Thank you!
Some clarification:
I'm outputting content from forms in a xml file, which is subsequently parsed by javascript.
I process all form inputs with: htmlentities(trim($_POST['content']), ENT_QUOTES, 'UTF-8');
When I want to output this content into a xml file, how should I encode it such that it won't throw up xml parsing errors?
So far the following 2 solutions work:
1) echo '<content><![CDATA['.$content.']]></content>';
2) echo '<content>'.htmlspecialchars(html_entity_decode($content, ENT_QUOTES, 'UTF-8'),ENT_QUOTES, 'UTF-8').'</content>'."\n";
Are the above 2 solutions safe? Which is better?
Thanks, sorry for not providing this information earlier.

You take it the wrong way - don't look for a parser which doesn't give you errors. Instead try to have a well-formed xml.
How did you get ’ from the user? If he literally typed it in, you are not processing the input correctly - for example you should escape & to &. If it is you who put the entity there (perhaps in place of some apostrophe), either define it in DTD (<!ENTITY rsquo "&x2019;">) or write it using a numeric notation (’), because almost every of the named entities are a part of HTML. XML defines only a few basic ones, as Gumbo pointed out.
EDIT based on additions to the question:
In #1, you escape the content in the way that if user types in ]]> <°)))><, you have a problem.
In #2, you are doing the encoding and decoding which result in the original value of the $content. the decoding should not be necessary (if you don't expect users to post values like & which should be interpreted like &).
If you use htmlspecialchars() with ENT_QUOTES, it should be ok, but see how Drupal does it.

html_entity_decode($string, ENT_QUOTES, 'UTF-8')

Enclose the value within CDATA tags.
<message><![CDATA[’]]></message>
From the w3schools site:
Characters like "<" and "&" are illegal in XML elements.
"<" will generate an error because the parser interprets it as the start of a new element.
"&" will generate an error because the parser interprets it as the start of an character entity.
Some text, like JavaScript code, contains a lot of "<" or "&" characters. To avoid errors script code can be defined as CDATA.
Everything inside a CDATA section is ignored by the parser.

The problem is that your htmlentities function is doing what it should - generating HTML entities from characters. You're then inserting these into an XML document which doesn't have the HTML entities defined (things like ’ are HTML-specific).
The easiest way to handle this is keep all input raw (i.e. don't parse with htmlentities), then generate your XML using PHP's XML functions.
This will ensure that all text is properly encoded, and your XML is well-formed.
Example:
$user_input = "...<>&'";
$doc = new DOMDocument('1.0','utf-8');
$element = $doc->createElement("content");
$element->appendChild($doc->createTextNode($user_input));
$doc->appendChild($element);

I had a similar problem that the data i needed to add to the XML was already being returned by my code as htmlentities() (not in the database like this).
i used:
$doc = new DOMDocument('1.0','utf-8');
$element = $doc->createElement("content");
$element->appendChild($doc->createElement('string', htmlspecialchars(html_entity_decode($string, ENT_QUOTES, 'UTF-8'), ENT_XML1, 'UTF-8')));
$doc->appendChild($element);
or if it was not already in htmlentities()
just the below should work
$doc = new DOMDocument('1.0','utf-8');
$element = $doc->createElement("content");
$element->appendChild($doc->createElement('string', htmlspecialchars($string, ENT_XML1, 'UTF-8')));
$doc->appendChild($element);
basically using htmlspecialchars with ENT_XML1 should get user imputed data into XML safe data (and works fine for me):
htmlspecialchars($string, ENT_XML1, 'UTF-8');

Use htmlspecialchars() will solve your problem. See the post below.
PHP - Is htmlentities() sufficient for creating xml-safe values?

This worked for me. Some one facing the same issue can try this.
htmlentities($string, ENT_XML1)
With special characters conversion.
htmlspecialchars(htmlentities($string, ENT_XML1))

htmlspecialchars($trim($_POST['content'], ENT_XML1, 'UTF-8');
Should do it.

PHP DOMDocument - get html source of BODY

I'm using PHP's DOMDocument to parse and normalize user-submitted HTML using the loadHTML method to parse the content then getting a well-formed result via saveHTML:
$dom= new DOMDocument();
$dom->loadHTML('<div><p>Hello World');
$well_formed= $dom->saveHTML();
echo($well_formed);
This does a beautiful job of parsing the fragment and adding the appropriate closing tags. The problem is that I'm also getting a bunch of tags I don't want such as <!DOCTYPE>, <html>, <head> and <body>. I understand that every well-formed HTML document needs these tags, but the HTML fragment I'm normalizing is going to be inserted into an existing valid document.

The quick solution to your problem is to use an xPath expression to grab the body.
$dom= new DOMDocument();
$dom->loadHTML('<div><p>Hello World');
$xpath = new DOMXPath($dom);
$body = $xpath->query('/html/body');
echo($dom->saveXml($body->item(0)));
A word of warning here. Sometimes loadHTML will throw a warning when it encounters certainly poorly formed HTML documents. If you're parsing those kind of HTML documents, you'll need to find a better html parser [self link warning].

IN your case, you do not want to work with an HTML document, but with an HTML fragment -- a portion of HTML code ;; which means DOMDocument is not quite what you need.
Instead, I would rather use something like HTMLPurifier (quoting) :
HTML Purifier is a standards-compliant
HTML filter library written in PHP.
HTML Purifier will not only remove all
malicious code (better known as XSS)
with a thoroughly audited, secure yet
permissive whitelist, it will also
make sure your documents are standards compliant, something only
achievable with a comprehensive
knowledge of W3C's specifications.
And, if you try your portion of code :
<div><p>Hello World
Using the demo page of HTMLPurifier, you get this clean HTML as an output :
<div><p>Hello World</p></div>
Much better, isn't it ? ;-)
(Note that HTMLPurfier suppots a wide range of options, and that taking a look at its documentation might not hurt)

Faced with the same problem, I've created a wrapper around DOMDocument called SmartDOMDocument to overcome this and some other shortcomings (such as encoding problems).
You can find it here: http://beerpla.net/projects/smartdomdocument

This was taken from another post and worked perfectly for my use:
$layout = preg_replace('~<(?:!DOCTYPE|/?(?:html|head|body))[^>]*>\s*~i', '', $layout);

TL;DR: $dom->saveHTML($dom->documentElement->lastChild);
Where $dom->documentElement->lastChild is the body-node but could be every other available DOMNode of the document.
Actucally the DOMDocument::saveHTML-method itself is capable of doing what you want.
It takes a DOMNode-object as the first argument to output a subset of the document.
$dom = new DOMDocument();
$dom->loadHTML('<div><p>Hello World');
$well_formed= $dom->saveHTML($dom->documentElement->lastChild);
echo($well_formed);
There are several ways of retrieving the body-node. Here are 2:
$bodyNode = $dom->documentElement->lastChild;
$bodyNode = $dom->getElementsByTagName('body')->item(0);
From the PHP Manual
public DOMDocument::saveHTML(?DOMNode $node = null): string|false
Parameters
node
Optional parameter to output a subset of the document.
https://www.php.net/manual/en/domdocument.savehtml.php

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.