DOMDocument escaping end chars in PHP

DOMDocument escaping end chars in PHP - php

I have a problem with class DOMDocument. I use this php class to edit a html template. I have in this template this meta tag:
<meta http-equiv="Content-type" content="text/html;charset=UTF-8"/>
But after editing, although I was not editing this tag, it escapes the end char "/" and it doesn't work.
This is the script:
$textValue = $company.'<br />'.$firstName.' '.$lastName.'<br />'.$adress;
$values = array($company, $firstName.' '.$lastName, $adress);
$document = new DOMDocument;
$document->loadHTMLFile($dir.'temp/OEBPS/signature.html');
$dom = $document->getElementById('body');
for ($i = 0; $i < count($values); $i++) {
$dom->appendChild($document->createElement('p', $values[$i]));
}
$document->saveHTMLFile($dir.'temp/OEBPS/signature.html');
echo 'signature added <br />';

Please see the answer provided by this question: Why doesn't PHP DOM include slash on self closing tags?
In short, DOMDocument->saveHTMLFile() outputs its internal structure as regular old HTML instead of XHTML. If you absolutely need XHTML, you can use DOMDocument->saveXMLFile() which will use self-closing tags. The only problem with this method is some HTML tags cannot use self-closing tags like <script> and <style> so you have to put a space in their content so that they don't use self-closing tags.
I would recommend just ignoring the issue unless it is mandatory that you fix it. Self-closing tags are a relic of XHTML and are unused in HTML5.

Related

How can I strip html tags except some of them?

I need to remove all html codes from a php string except:
<p>
<em>
<small>
You know, strip_tags() function is good, but it strips all html tags, how can I tell it remove all html except those tags above?

You should check out the manual: Example #1 strip_tags() example
Syntax: strip_tags ( Your-string, Allowable-Tags )
If you pass the second parameter, these tags will not be stripped.
strip_tags($string, '<p><em><small>');

According to your comment, you want to remove HTML elements only if they have some class or attribute. You'll need to build up a DOM then:
<?php
$data = <<<DATA
<div>
<p>These line shall stay</p>
<p class="myclass">Remove this one</p>
<p>I will be deleted as well</p>
<p>But keep this</p>
</div>
DATA;
$dom = new DOMDOcument();
$dom->loadHTML($data, LIBXML_HTML_NOIMPLIED);
$xpath = new DOMXPath($dom);
$elements_to_be_removed = $xpath->query("//*[count(#*)>0]");
foreach ($elements_to_be_removed as $element) {
$element->parentNode->removeChild($element);
}
// just to check
echo $dom->saveHTML();
?>
To change which elements shall be removed, you'll need to change the query, ie to remove all elements with the class myclass, it must read "//*[class='myclass']".

HTML Purifier - Escape disallowed tags instead of stripping

I'm using HTML Purifier to sanitize user input. I have a list of allowed elements configured, which means that any tag not in the allowed list is stripped. Code below:
require_once "HTMLPurifier.standalone.php";
$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.AllowedElements', array('strong','b','em','i'));
$purifier = new HTMLPurifier($config);
$safe_html = $purifier->purify($dirty_html));
Rather than only retaining their contents, I would like the elements that are not included in the list to be escaped and sent back as text.
To illustrate, given the white list shown above, the following input string:
<strong>CLAIM YOUR PRIZE</strong>
turns into "<strong>CLAIM YOUR PRIZE</strong>", because a is not whitelisted. Similarly,
<b>Check the article here</b>
becomes "<b>Check the article here</b>".
Is there a way to turn the above two examples into the following:
<a href="javascript:alert('XSS')"><strong>CLAIM YOUR PRIZE</strong></a>
<b>Check the article <a href="http://example.com/">here</a></b>
purely by adjusting HTML Purifier's configuration without resorting to regular expression-based "hacks"? If there is, then I'd like to know how it's done.

The setting Core.EscapeInvalidTags should be what you're looking for:
require_once(__DIR__ . '/library/HTMLPurifier.auto.php');
$dirty_html = '<strong>CLAIM YOUR PRIZE<div></div></strong>';
$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.AllowedElements', array('strong','b','em','i'));
$config->set('Core.EscapeInvalidTags', true);
$purifier = new HTMLPurifier($config);
$safe_html = $purifier->purify($dirty_html);
echo $safe_html . PHP_EOL;
...gives:
<a href="javascript:alert('XSS')"><strong>CLAIM YOUR PRIZE<div /></strong></a>
I threw in the invalid child element <div></div> there so you can see what happens: HTML Purifier will still 'alter' the original HTML due to parsing it (<div></div> becomes <div />), but the information remains (and is converted to <div />).

Search and replace a string of HTML using the PHP DOM Parser

How can I search and replace a specific string (text + html tags) in a web page using the native PHP DOM Parser?
For example, search for
<p> Check this site </p>
This string is somewhere inside inside an html tree.
I would like to find it and replace it with another string. For example,
<span class="highligher"><p> Check this site </p></span>
Bear in mind that there is no ID to the <p> or <a> nodes. There can be many of those identical nodes, holding different pieces of text.
I tried str_replace, however it fails with complex html markup, so I have turned to HTML Parsers now.
EDIT:
The string to be found and replaced might contain a variety of HTML tags, like divs, headlines, bolds etc.. So, I am looking for a solution that can construct a regex or DOM xpath query depending on the contents of the string being searched.
Thanks!

Is this what you wanted:
<?php
// load
$doc = new DOMDocument();
$doc->loadHTMLFile("filename.html");
// search p elements
$p_elements = $doc->getElementsByTagName('p');
// parse this elements, if available
if (!is_null($p_elements))
{
foreach ($p_elements as $p_element)
{
// get p element nodes
$nodes = $p_element->childNodes;
// check for "a" nodes in these nodes
foreach ($nodes as $node) {
// found an a node - check must be defined better!
if(strtolower($node->nodeName) === 'a')
{
// create the new span element
$span_element = $doc->createElement('span');
$span_element->setAttribute('class', 'highlighter');
// replace the "p" element with the span
$p_element->parentNode->replaceChild($span_element, $p_element);
// append the "p" element to the span
$span_element->appendChild($p_element);
}
}
}
}
// output
echo '<pre>';
echo htmlentities($doc->saveHTML());
echo '</pre>';
This HTML is the basis for conversion:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><title>Your Title Here</title></head><body bgcolor="FFFFFF">
<hr>Link Name
is a link to another nifty site
<h1>This is a Header</h1>
<h2>This is a Medium Header</h2>
<p> Check this site </p>
Send me mail at <a href="mailto:support#yourcompany.com">
support#yourcompany.com</a>.
<p> This is a new paragraph!
</p><hr><p> Check this site </p>
</body></html>
The output looks like that, it wraps the elements you mentioned:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><title>Your Title Here</title></head><body bgcolor="FFFFFF">
<hr>Link Name
is a link to another nifty site
<h1>This is a Header</h1>
<h2>This is a Medium Header</h2>
<span class="highlighter"><p> Check this site </p></span>
Send me mail at <a href="mailto:support#yourcompany.com">
support#yourcompany.com</a>.
<p> This is a new paragraph!
</p><hr><span class="highlighter"><p> Check this site </p></span>
</body></html>

You could use a regular expression with preg_replace.
preg_replace("/<\s*p[^>]*>(.*?)<\s*\/\s*p>/", '<span class="highligher"><p>$1</p></span>', '<p> Check this site</p>');
The third parameter of preg_replace can be used to restrict the number of replacements
http://php.net/manual/en/function.preg-replace.php
http://www.pagecolumn.com/tool/all_about_html_tags.htm - for more examples on regular expressions for HTML
You will need to edit the regular expression to only capture the p tags with the google href
EDIT
preg_replace("/<\s*\w.*?><a href\s*=\s*\"?\s*(.*)(google.com)\s*\">(.*?)<\/a>\s*<\/\s*\w.*?>/", '<span class="highligher"><p>$3</p></span>', $string);

Fixing unclosed HTML tags

I am working on some blog layout and I need to create an abstract of each post (say 15 of the lastest) to show on the homepage. Now the content I use is already formatted in html tags by the textile library. Now if I use substr to get 1st 500 chars of the post, the main problem that I face is how to close the unclosed tags.
e.g
<div>.......................</div>
<div>...........
<p>............</p>
<p>...........| 500 chars
</p>
<div>
What I get is two unclosed tags <p> and <div> , p wont create much trouble , but div just messes with the whole page layout. So any suggestion how to track the opening tags and close them manually or something?

There are lots of methods that can be used:
Use a proper HTML parser, like DOMDocument
Use PHP Tidy to repair the un-closed tag
Some would suggest HTML Purifier

As ajreal said, DOMDocument is a solution.
Example :
$str = "
<html>
<head>
<title>test</title>
</head>
<body>
<p>error</i>
</body>
</html>
";
$doc = new DOMDocument();
#$doc->loadHTML($str);
echo $doc->saveHTML();
Advantage : natively included in PHP, contrary to PHP Tidy.

You can use DOMDocument to do it, but be careful of string encoding issues. Also, you'll have to use a complete HTML document, then extract the components you want. Here's an example:
function make_excerpt ($rawHtml, $length = 500) {
// append an ellipsis and "More" link
$content = substr($rawHtml, 0, $length)
. '… More >';
// Detect the string encoding
$encoding = mb_detect_encoding($content);
// pass it to the DOMDocument constructor
$doc = new DOMDocument('', $encoding);
// Must include the content-type/charset meta tag with $encoding
// Bad HTML will trigger warnings, suppress those
#$doc->loadHTML('<html><head>'
. '<meta http-equiv="content-type" content="text/html; charset='
. $encoding . '"></head><body>' . trim($content) . '</body></html>');
// extract the components we want
$nodes = $doc->getElementsByTagName('body')->item(0)->childNodes;
$html = '';
$len = $nodes->length;
for ($i = 0; $i < $len; $i++) {
$html .= $doc->saveHTML($nodes->item($i));
}
return $html;
}
$html = "<p>.......................</p>
<p>...........
<p>............</p>
<p>...........| 500 chars";
// output fixed html
echo make_excerpt($html, 500);
Outputs:
<p>.......................</p>
<p>...........
</p>
<p>............</p>
<p>...........| 500 chars… More ></p>
If you are using WordPress you should wrap the substr() invocation in a call to wpautop - wpautop(substr(...)). You may also wish to test the length of the $rawHtml passed to the function, and skip appending the "More" link if it isn't long enough.

Highlighting Text: How to echo HTML DOM element with all tags

I want to highlight specified keywords in the body of an HTML document. At first I used preg_replace to put a < span > around the keywords, but of course that caused problems if the keyword was part of a tag, like the letter "i" (as in < li >). So instead, I'm using DOM::loadHTMLFile(path) to load the document, and then use the preg_replace inside the values of each child.
So far, so good. I can echo out the plain text of the document with the appropriate words highlighted and no interference from tags. But I need to echo the entire body of the text including the tags after the changes, and I don't know how. Here's what I have so far:
if (file_exists('mss/'.$link_title)) {
$descfile = DOMDocument::loadHTMLFile('mss/'.$link_title);
foreach ($descfile->childNodes as $e) {
$desc_output = $e->nodeValue;
$desc_output = preg_replace($to_highlight, "<span class=\"highlight\">$0</span>", $desc_output);
}
echo ???
}
What should I echo?

If you change your code to:
$e->nodeValue = preg_replace($to_highlight, "<span class=\"highlight\">$0</span>", $e->nodeValue);
You can probably use:
http://php.net/manual/de/domdocument.savehtml.php
to output your entire html document.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

DOMDocument escaping end chars in PHP - php

Related

How can I strip html tags except some of them?

HTML Purifier - Escape disallowed tags instead of stripping

Search and replace a string of HTML using the PHP DOM Parser

Fixing unclosed HTML tags

Highlighting Text: How to echo HTML DOM element with all tags

Categories

Resources