HTML Purifier strip elements with a certain attribute - php

Is there anyway to make HTML Purifier strip elements with a certain attribute.
I'm using HTML Purifier to clean up a full webpage into just its basic content so I can index and search it.
I want to be able to add an attribute like data-no-index to some wrapper to make them ignored.
This is my HTML Purifier setup:
$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.Allowed', 'h1,h2,h3,h4,h5,h6,p,a[href],ul,ol,li,img[src]');
$purifier = new HTMLPurifier($config);

Related

Remove script tag from parsed html with simple html dom parser

I am using simple html dom parser with php to parse a link using below code.
foreach($html->find('div#ProductDescription_Tab') as $description)
{
$comments = $description->find('.hsn_comments', 0);
$comments->outertext = '';
echo $description->outertext;
}
This gives me the parsed data along with javascript "script" tags. How can I remove these script tags?
Ok So i figured out myself just use Advanced Html Dom library its totally compatible with simple html dom & by using it you will get much more control. Its very simple to remove what you want from parsed html. For Ex.
//to remove script tag
$scripts = $description->find('script')->remove;
//to remove css style tag
$style = $description->find('style')->remove;
// to remove a div with class name findify-element
$findify = $description->find('div.findify-element')->remove;
https://sourceforge.net/projects/advancedhtmldom/

HTML Purifier - iframe and scripts

I'm using HTML Purifier in my project.
My html is something like this. (containing simple html element + script + iframe)
<p>content...<p>
<iframe></iframe>
<script>alert('abc');</script>
<p>content2</p>
With default config, it turned into this
<p>content...</p>
<p></p>
<p>Content2</p>
But if I set the config like this...
$config->set('HTML.Trusted', true);
$config->set('HTML.SafeIframe', true);
I got this
<p>content...</p>
<p>
<iframe></iframe>
<script type="text/javascript"><!--//--><![CDATA[//><!--
alert('abc');
//--><!]]></script>
</p>
<p>content2</p>
Is there anyway to use HTML Purifier to completely remove 'script' tag but preserve 'iframe' tag? Or other alternative to HTML Purifier?
I've tried
$config->set('Filter.YouTube', true);
$config->set('URI.SafeIframeRegexp', '%^https://(www.youtube.com/embed/|player.vimeo.com/video/)%');
But it turned out that the 'script' tag still there.
[edited]
full example.
$config = HTMLPurifier_Config::createDefault();
$html = "<p>content...<p><iframe ...></iframe><script>alert('abc');</script><p>content2</p>";
$config->set(
'HTML.ForbiddenElements',
'script'
);
$purifier = new HTMLPurifier($config);
$clean_html = $purifier->purify($html);
Result
<p>content...</p><p></p><p>content2</p>
You were half on the right track. If you set HTML.SafeIframe to true and URI.SafeIframeRegexp to the URLs you want to accept (%^https://(www.youtube.com/embed/|player.vimeo.com/video/)% works fine), an input example of:
<p>content...<p>
<iframe src="https://www.youtube.com/embed/blep"></iframe>
<script>alert('abc');</script>
<p>content2</p>
...turns into...
<p>content...</p><p>
<iframe src="https://www.youtube.com/embed/blep"></iframe>
</p><p>content2</p>
Explanation: HTML.SafeIframe allows the <iframe> tag, but HTML Purifier still expects a whitelist for the URLs that the iframe can contain, since otherwise an <iframe> opens too much malicious potential. URI.SafeIframeRegexp supplies the whitelist (in the form of a regex that needs to be matched).
See if that works for you!
Code
This is the code that made the transformation I just mentioned:
$dirty = '<p>content...<p>
<iframe src="https://www.youtube.com/embed/blep"></iframe>
<script>alert(\'abc\');</script>
<p>content2</p>';
$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.SafeIframe', true);
$config->set('URI.SafeIframeRegexp', '%^https://(www.youtube.com/embed/|player.vimeo.com/video/)%');
$purifier = new HTMLPurifier($config);
$clean = $purifier->purify($dirty);
Regarding HTML.Trusted
I implore you to never set HTML.Trusted to true if you don't fully trust each and every one of the people submitting the HTML.
Amongst other things, it allows forms in your input HTML to survive the purification unmolested, which (if you're purifying for a website, which I assume you are) makes phishing attacks trivial. It allows your input to use style tags which survive unscathed. There are some things it will still strip (any HTML tag that HTML Purifier doesn't actually know anything about, i.e. most HTML5 tags being some of them, various JavaScript attribute handlers as well), but there are enough attack vectors that you might as well not be purifying if you use this directive. As Ambush Commander once put it:
You shouldn't be using %HTML.Trusted anyway; it really ought to be named %HTML.Unsafe or something.
Consider using a full-fledged HTML parser like Masterminds html5-php. HTML code would then be parsed without undesired alterations like wrapping IFRAME in P, and you would be able to manipulate the resulting DOM tree the way you want, including removing some elements while keeping other ones.
For example, the following code could be used for removing SCRIPT elements from the document:
foreach ($dom->getElementsByTagName('script') as $script) {
$script->parentNode->removeChild($script);
}
And note that code like this:
<script type="text/javascript"><!--//--><![CDATA[//><!--
alert('abc');
//--><!]]></script>`
is obsolete. The modern HTML5 equivalent code is :
<script>alert('abc');</script>
exactly as in your source code before being processed by HTML Purifier.

HTMLPurifier without htmlspecialchars

I am using HTMLPurifier for simple Tinymce WYSIWYG.If I don't use htmlspecialchars,would it be open to XSS Attack?This is what I'm doing
$detail = $purifier->purify($detail);
to purify data for that textarea.If I use htmlspecialchars,it strips all basic tags as well which is not user friendly for an WYSIWYG editor.But the problem is,this allows <script> tag as well.
And if I change conf setting to
$config->set('ExtractStyleBlocks.1', true);
It doesn't allow < and > for <script> tag.Convert < and > for <script> only.But it shows <p>This is paragraph</p> ,<strong>This text is bold</strong> and so on.It shouldn't show <p> and other simple tags to user,but only the text.
How can I get rid of this problem.
Please help.Thanks for your time.
Edit
Here is my HTMLPurifier initialization
$config = HTMLPurifier_Config::createDefault();
//$config->set('ExtractStyleBlocks', true);
$config->set('HTML.ForbiddenElements', array('script','style','applet'));
$purifier = new HTMLPurifier($config);
getting data from database
while(mysqli_stmt_fetch($stmt1)){
$id=htmlspecialchars($id);
$title=htmlspecialchars($title);
$detail = $purifier->purify($detail);
$posts.="<div id='date_news'><div id='news_holder$id' class='news_holder'><h3 id='show_title'>".htmlspecialchars($title)."</h3>".$detail."</div>";
HTML for $detail
At Database
<p><strong>Alu Vazi</strong></p>
<p>I love alu vazi with<script>alert("XSS")</script></p>
User screen
Alu Vazi
I love alu vazi with<script>alert("XSS")</script>
OK, following my comment try adding this to your HTML Purifier config, it should be enabled by default, but worth a shot.
$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.ForbiddenElements', array('script','style','applet'));
$purifier = new HTMLPurifier($config);
Edit
<p>I love alu vazi with<script>alert("XSS")</script></p>
You've already escaped the <script> tag here so HTML Purifier has nothing to parse. It will be output on the page as a result but you have effectively neutralised the XSS attempt.
In your code something is already escaping HTML characters before saving to the database.

configuring HTMLPurifier to disable hyperlinks

I am trying to disable hyperlinks and show them in plain text using HTMLPurifier but I did not get right. Here is my code:
$html ='link<b>test</b>';
require_once 'include/htmlpurifier/library/HTMLPurifier.auto.php';
$Config = HTMLPurifier_Config::createDefault();
$Config->set('AutoFormat.DisplayLinkURI', true);
$purifier = new HTMLPurifier();
$html = $purifier->purify($html);
echo $html;
The current output is:
link<b>test</b>
What is the problem? The output should be:
<a>link</a> (http://www.localhost.com/)<b>test</b>
First problem: you're not passing the config object to the HTML Purifier constructor, so it doesn't work.
Second problem: you haven't actually told HTML Purifier to remove href attributes from a tags. I'm not really sure what will happen to DisplayLinkURI if you do that though.

Is it possible to configure HtmlPurifer to strip tags that have an attribute with a specific value?

I am using HtmlPurifer to sanitize html input.
Here is my HtmlPurifer config:
$config = HTMLPurifier_Config::createDefault();
$config->set('Core.Encoding', 'UTF-8');
$config->set('HTML.Doctype', 'XHTML 1.0 Transitional');
$config->set('HTML.AllowedAttributes', "*.style,a.href,a.target,img.src,img.height,img.width");
$config->set('HTML.AllowedElements','a,p,ol,li,ul,b,u,strike,br,span,img,div');
$config->set('HTML.ForbiddenAttributes', "*#class,div#*");
$config->set('Attr.AllowedFrameTargets', array('_blank'));
$config->set('CSS.AllowedProperties', array('text-decoration', 'font-weight', 'font-style'));
$config->set('AutoFormat.RemoveSpansWithoutAttributes', true);
$config->set('AutoFormat.RemoveEmpty', true);
$purifier = new HTMLPurifier($config);
$sanitized = $purifier->purify($data);
Works like a charm.
BUT...
I am wondering if it is possible to configure HtmlPurifer such that it will strip any element that does not have an attribute with a SPECIFIC value.
For example, I might want to remove
<p class="badParagraph" />
but not
<p class="goodParagraph" />
Does anyone know if this is possible and, if so, how to go about it?
Thanks!
As was mentioned in the corresponding HTML Purifier thread, there is not presently an easy way of doing this. Injectors are a fairly general mechanism by which you could implement this; in which case look at RemoveEmpty for some inspiration.

Categories