HtmlPurifier - allow data attribute

HtmlPurifier - allow data attribute - php

I'm trying to allow some data-attribute with htmlPurifier for all my span but no way...
I have this string:
<p>
<span data-time-start="1" data-time-end="5" id="5">
<word class="word">My</word>
<word class="word">Name</word>
</span>
<span data-time-start="6" data-time-end="15" id="88">
<word class="word">Is</word>
<word class="word">Zooboo</word>
</span>
<p>
My htmlpurifier config:
$this->HTMLpurifierConfigInverseTransform = \HTMLPurifier_Config::createDefault();
$this->HTMLpurifierConfigInverseTransform->set('HTML.Allowed', 'span,u,strong,em');
$this->HTMLpurifierConfigInverseTransform->set('HTML.ForbiddenElements', 'word,p');
$this->HTMLpurifierConfigInverseTransform->set('CSS.AllowedProperties', 'font-weight, font-style, text-decoration');
$this->HTMLpurifierConfigInverseTransform->set('AutoFormat.RemoveEmpty', true);
I purify my $value like this:
$purifier = new \HTMLPurifier($this->HTMLpurifierConfigInverseTransform);
var_dump($purifier->purify($value));die;
And get this :
<span>My Name</span><span>Is Zoobo</span>
But how to conserve my data attributes id, data-time-start, data-time-end in my span ?
I need to have this :
<span data-time-start="1" data-time-end="5" id="5">My Name</span data-time-start="6" data-time-end="15" id="88"><span>Is Zoobo</span>
I tried to test with this config:
$this->HTMLpurifierConfigInverseTransform->set('HTML.Allowed', 'span[data-time-start],u,strong,em');
but error message :
User Warning: Attribute 'data-time-start' in element 'span' not
supported (for information on implementing this, see the support
forums)
Thanks for your help !!
EDIT 1
I tried to allow ID in the firdt time with this code line:
$this->HTMLpurifierConfigInverseTransform->set('Attr.EnableID', true);
It doesn't work for me ...
EDIT 2
For data-* attributes, I add this line but nothing happened too...
$def = $this->HTMLpurifierConfigInverseTransform->getHTMLDefinition(true);
$def->addAttribute('sub', 'data-time-start', 'CDATA');
$def->addAttribute('sub', 'data-time-end', 'CDATA');

HTML Purifier is aware of the structure of HTML and uses this knowledge as basis of its white-listing process. If you add a standard attribute to a whitelist, it doesn't allow arbitrary content for that attribute - it understands the attribute and will still reject content that makes no sense.
For example, if you had an attribute somewhere that took numeric values, HTML Purifier would still deny HTML that tried to enter the value 'foo' for that attribute.
If you add custom attributes, just adding it to the whitelist does not teach HTML Purifier how to handle the attributes: What data can it expect in those attributes? What data is malicious?
There's extensive documentation how you can tell HTML Purifier about the structure of your custom attributes here: Customize
There's a code example for the 'target' attribute of the <a>-tag:
$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.DefinitionID', 'enduser-customize.html tutorial');
$config->set('HTML.DefinitionRev', 1);
$config->set('Cache.DefinitionImpl', null); // remove this later!
$def = $config->getHTMLDefinition(true);
$def->addAttribute('a', 'target', 'Enum#_blank,_self,_target,_top');
That would add target as a field that accepts only the values "_blank", "_self", "_target" and "_top". That's a bit stricter than the actual HTML definition, but for most purposes entirely sufficient.
That's the general approach you will need to take for data-time-start and data-time-end. For possible configuration, check out the official HTML Purifier documentation (as linked above). My best guess from your example is that you don't want Enum#... but Number, like this...
$def->addAttribute('span', 'data-time-start', 'Number');
$def->addAttribute('span', 'data-time-end', 'Number');
...but check it out and see what suits your use-case best. (While you're implementing this, don't forget you also need to list the attributes in the whitelist as you're currently doing.)
For id, you should include Attr.EnableID = true as part of your configuration.
I hope that helps!

If anyone else lands here (like I did) for the id attribute not working, and more weirdly not working in all cases.
In version 4.8.0 Attr.ID.HTML5 was added and reflects the usage of relaxed format introduced for HTML5.
For example, numeric values were not allowed, as well as values that start with a number. The following examples are all valid in HTML5, but only the first three are valid for pre-HTML5 (the default behaviour of the purifier):
foo (both)
foo-bar (both)
foo-10 (both)
10 (HTML5 only)
10-foo (HTML5 only)

Related

HTML Purifier - iframe and scripts

I'm using HTML Purifier in my project.
My html is something like this. (containing simple html element + script + iframe)
<p>content...<p>
<iframe></iframe>
<script>alert('abc');</script>
<p>content2</p>
With default config, it turned into this
<p>content...</p>
<p></p>
<p>Content2</p>
But if I set the config like this...
$config->set('HTML.Trusted', true);
$config->set('HTML.SafeIframe', true);
I got this
<p>content...</p>
<p>
<iframe></iframe>
<script type="text/javascript"><!--//--><![CDATA[//><!--
alert('abc');
//--><!]]></script>
</p>
<p>content2</p>
Is there anyway to use HTML Purifier to completely remove 'script' tag but preserve 'iframe' tag? Or other alternative to HTML Purifier?
I've tried
$config->set('Filter.YouTube', true);
$config->set('URI.SafeIframeRegexp', '%^https://(www.youtube.com/embed/|player.vimeo.com/video/)%');
But it turned out that the 'script' tag still there.
[edited]
full example.
$config = HTMLPurifier_Config::createDefault();
$html = "<p>content...<p><iframe ...></iframe><script>alert('abc');</script><p>content2</p>";
$config->set(
'HTML.ForbiddenElements',
'script'
);
$purifier = new HTMLPurifier($config);
$clean_html = $purifier->purify($html);
Result
<p>content...</p><p></p><p>content2</p>

You were half on the right track. If you set HTML.SafeIframe to true and URI.SafeIframeRegexp to the URLs you want to accept (%^https://(www.youtube.com/embed/|player.vimeo.com/video/)% works fine), an input example of:
<p>content...<p>
<iframe src="https://www.youtube.com/embed/blep"></iframe>
<script>alert('abc');</script>
<p>content2</p>
...turns into...
<p>content...</p><p>
<iframe src="https://www.youtube.com/embed/blep"></iframe>
</p><p>content2</p>
Explanation: HTML.SafeIframe allows the <iframe> tag, but HTML Purifier still expects a whitelist for the URLs that the iframe can contain, since otherwise an <iframe> opens too much malicious potential. URI.SafeIframeRegexp supplies the whitelist (in the form of a regex that needs to be matched).
See if that works for you!
Code
This is the code that made the transformation I just mentioned:
$dirty = '<p>content...<p>
<iframe src="https://www.youtube.com/embed/blep"></iframe>
<script>alert(\'abc\');</script>
<p>content2</p>';
$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.SafeIframe', true);
$config->set('URI.SafeIframeRegexp', '%^https://(www.youtube.com/embed/|player.vimeo.com/video/)%');
$purifier = new HTMLPurifier($config);
$clean = $purifier->purify($dirty);
Regarding HTML.Trusted
I implore you to never set HTML.Trusted to true if you don't fully trust each and every one of the people submitting the HTML.
Amongst other things, it allows forms in your input HTML to survive the purification unmolested, which (if you're purifying for a website, which I assume you are) makes phishing attacks trivial. It allows your input to use style tags which survive unscathed. There are some things it will still strip (any HTML tag that HTML Purifier doesn't actually know anything about, i.e. most HTML5 tags being some of them, various JavaScript attribute handlers as well), but there are enough attack vectors that you might as well not be purifying if you use this directive. As Ambush Commander once put it:
You shouldn't be using %HTML.Trusted anyway; it really ought to be named %HTML.Unsafe or something.

Consider using a full-fledged HTML parser like Masterminds html5-php. HTML code would then be parsed without undesired alterations like wrapping IFRAME in P, and you would be able to manipulate the resulting DOM tree the way you want, including removing some elements while keeping other ones.
For example, the following code could be used for removing SCRIPT elements from the document:
foreach ($dom->getElementsByTagName('script') as $script) {
$script->parentNode->removeChild($script);
}
And note that code like this:
<script type="text/javascript"><!--//--><![CDATA[//><!--
alert('abc');
//--><!]]></script>`
is obsolete. The modern HTML5 equivalent code is :
<script>alert('abc');</script>
exactly as in your source code before being processed by HTML Purifier.

configure htmlpurifier with html5 and css3

I am using htmlpurifier. I have some doubts which are as below.
1- My config file contain
$config->set('HTML.Trusted' ,true);
$config->set('CSS.Trusted', true);
But a simple Google is landing me to pages where there are recommendation not to use *.Trusted as "true".
I am not able to understand why should we should not set *.Trusted to true? Can you please explain me. Because if I remove it than I wont get inline css? Even CSS.AllowTricky is not helping.
2- I found that HTML5 and CSS3 selectors are not allowed.
like the code at htmlpurifier/library/HTMLPurifier/Filter/ExtractStyleBlocks.php
// - No Unicode support
// - No escapes support
// - No string support (by proxy no attrib support)
// - element_name is matched against allowed
// elements (some people might find this
// annoying...)
// - Pseudo-elements one of :first-child, :link,
// :visited, :active, :hover, :focus
// handle ruleset
$selectors = array_map('trim', explode(',', $selector));
$new_selectors = array();
foreach ($selectors as $sel) {
//some code to filter css selectors
}
do not contain any code which can allow selectors like '[class*="grid-"]'. Hence all such css is getting removed after purification. Is ther some way to allow all CSS3?
3- Is there some way to allow all HTML 5 tags? for example if we have html like
<section class="mainhead">
<div class="subhead"> </div> </section>
then purifier removes and due to which some css like
.mainhead .subhead { //some css}
wont work.

Setting HTML.Trusted to true enables unsafe elements, such as script and forms. If you, for example, want to allow forms, but don't want to allow scripts, just add them to HTML.ForbiddenElements:
$config->set('HTML.Trusted', true);
$config->set('HTML.ForbiddenElements', ['script']);
CSS3 selectors are not supported - there is a pending Pull Request, but it has been inactive for more than a year (as of Aug 2019). I don't think this will change any soon.
As for HTML5 support - you can use an extension package https://github.com/xemlock/htmlpurifier-html5 (which I'm the author of), which adds spec compliant definitions of HTML5 elements. It's usage is almost the same the bare HTML Purifier - you just have to replace HTMLPurifier_Config with HTMLPurifier_HTML5Config.

htmlpurifier removes all the formatting done by CKEDitor

I am using CKEditor to let the user post their comments. I thought to use the htmlpurifier to secure my html. But when I tried it, it actually removes all the formatting done by CKEditor.
The CKEditor generated the following html
<div class="originalpost"><span style="color:#B22222;">
<em><u><strong><span style="font-size:250%;">
This is Pakistan</span></strong></u></em></span></div>
After purifying with htmlpurifier the html became like this
<div class=""originalpost""><span><em><u><strong>
<span>This is Pakistan</span></strong></u></em></span></div>
It actually removes all the inline css styles and also class=""originalpost"" is not understand able.
I have used the following way to purify the html with htmlpurifier
require_once("path\HTMLPurifier.auto.php");
$config = HTMLPurifier_Config::createDefault();
$purifier = new HTMLPurifier($config);
$html = "xyzhtml";
$clean_html = $purifier->purify($html);
I want to keep the user formatting, How can I configure htmlpurifier to keep the user formatting also don't change the inline css.

It actually removes all the inline css styles
Inline styles are indeed dangerous - JavaScript can be injected into them using url(), IE's dodgy expression() and browser-specific behavioural extensions.
HTMLPurifier can parse inline styles and filter the dangerous properties and values. You can turn this on by including style in your whitelisted attributes.
$config->set('HTML.AllowedAttributes', '*.style, ...');
style is not included in the default attribute list because parsing styles is a lot of extra complexity (with accompanying chance of bugs) and most applications don't need it.
You can configure the properties that are permitted using %CSS.AllowedProperties if you wish.
I can't reproduce the " problem but certainly ensuring PHP's magic_quotes_gpc option is turned off is an all-round good thing...

I bet that you need to turn off Sybase quotes.

Is it possible to configure HtmlPurifer to strip tags that have an attribute with a specific value?

I am using HtmlPurifer to sanitize html input.
Here is my HtmlPurifer config:
$config = HTMLPurifier_Config::createDefault();
$config->set('Core.Encoding', 'UTF-8');
$config->set('HTML.Doctype', 'XHTML 1.0 Transitional');
$config->set('HTML.AllowedAttributes', "*.style,a.href,a.target,img.src,img.height,img.width");
$config->set('HTML.AllowedElements','a,p,ol,li,ul,b,u,strike,br,span,img,div');
$config->set('HTML.ForbiddenAttributes', "*#class,div#*");
$config->set('Attr.AllowedFrameTargets', array('_blank'));
$config->set('CSS.AllowedProperties', array('text-decoration', 'font-weight', 'font-style'));
$config->set('AutoFormat.RemoveSpansWithoutAttributes', true);
$config->set('AutoFormat.RemoveEmpty', true);
$purifier = new HTMLPurifier($config);
$sanitized = $purifier->purify($data);
Works like a charm.
BUT...
I am wondering if it is possible to configure HtmlPurifer such that it will strip any element that does not have an attribute with a SPECIFIC value.
For example, I might want to remove
<p class="badParagraph" />
but not
<p class="goodParagraph" />
Does anyone know if this is possible and, if so, how to go about it?
Thanks!

As was mentioned in the corresponding HTML Purifier thread, there is not presently an easy way of doing this. Injectors are a fairly general mechanism by which you could implement this; in which case look at RemoveEmpty for some inspiration.

Creating an inline 'Jargon' helper in PHP

I have an article formatted in HTML. It contains a whole lot of jargon words that perhaps some people wouldn't understand.
I also have a glossary of terms (MySQL Table) with definitions which would be helpful to there people.
I want to go through the HTML of my article and find instances of these glossary terms and replace them with some nice JavaScript which will show a 'tooltip' with a definition for the term.
I've done this nearly, but i'm still having some problems:
terms are being found within words (ie: APS is in Perhaps)
I have to make sure that it doesn't do this to alt, title, linked text, etc. So only text that doesn't have any formatting applied. BUT it needs to work in tables and paragraphs.
Here is the code I have:
$query_glossary = "SELECT word FROM glossary_terms WHERE status = 1 ORDER BY LENGTH(word) DESC";
$result_glossary = mysql_query_run($query_glossary);
//reset mysql via seek so we don't have to do the query again
mysql_data_seek($result_glossary,0);
while($glossary = mysql_fetch_array($result_glossary)) {
//once done we can replace the words with a nice tip
$glossary_word = $glossary['word'];
$glossary_word = preg_quote($glossary_word,'/');
$article['content'] = preg_replace_callback('/[\s]('.$glossary_word.')[\s](.*?>)/i','article_checkOpenTag',$article['content'],10);
}
And here is the PHP function:
function article_checkOpenTag($matches) {
if (strpos($matches[0], '<') === false) {
return $matches[0];
}
else {
$query_term = "SELECT word,glossary_term_id,info FROM glossary_terms WHERE word = '".escape($matches[1])."'";
$result_term = mysql_query_run($query_term);
$term = mysql_fetch_array($result_term);
# CREATING A RELEVENT LINK
$glossary_id = $term['glossary_term_id'];
$glossary_link = SITEURL.'/glossary/term/'.string_to_url($term['word']).'-'.$term['glossary_term_id'];
# SOME DESCRIPTION STUFF FOR THE TOOLTIP
if(strlen($term['info'])>400) {
$glossary_info = substr(strip_tags($term['info']),0,350).' ...<br /> Read More';
}
else {
$glossary_info = $term['info'];
}
return ' '.$term['word'].'',$glossary_info,400,1,0,1).'">'.$matches[1].'</a> '.$matches[2];
}
}

Move the load from server to client. Assuming that your "dictionary of slang" changes not frequently and that you want to "add nice tooltips" to words across a lot of articles, you can export it into a .js file and add a corresponding <script> entry into your pages - just a static file easily cacheable by a web-browser.
Then write a client-side js-script that will try to find a dom-node where "a content with slang" is put, then parse out the occurences of the words from your dictionary and wrap them with some html to show tooltips. Everything with js, everything client-side.
If the method is not suitable and you're going to do the job within your php backend, at least consider some caching of processed content.
I also see that you insert a description text for every "jargon word" found within content. What if a word is very frequent across an article? You get overhead. Make that descriptions separate, put them into JS as an object. The task is to find words which have a description and just mark them using some short tag, for instance <em>. Your js-script should find that em`s, pick a description from the object (associative array with descriptions for words) and construct a tooltip dynamically on "mouse over" event.

Interestingly enough, I was searching exactly NOT for a question like yours, but while reading I realized that your question is one that I had been through quite some time ago
It was basically a system to parse a dictionary and spits augmented HTML.
My suggestion would include instead:
Use database if you want, but a cached generated CSV file could be faster to use as dictionary
Use a hook in your rendering system to parse the actual content within this dictionary
caching of the page could be useful too
I elaborated a solution on my blog (in French, sorry for that). But it outlines basically something that you can actually use to do that.
I called it "ContentAbbrGenerator" as a MODx plugin. But the raw of the plugin can be applied outside of the established structure.
Anyway you can download the zip file and get the RegExes and find a way around it.
My objective
Use one file that is read to get the kind of html decoration.
Generate html from within author entered content that doesnt know about accessibility and tags (dfn and or abbr)
Make it re-usable.
Make it i18n-izable. That is, in french, we use the english definition but the adaptative technology reads the english word in french and sounds weird. So we had to use the lang="" attribute to make it clear.
What I did
Is basically that the text you give, gets more semantic.
Imagine the following dictionary:
en;abbr;HTML;Hyper Text Markup Language;es
en;abbr;abbr;Abbreviation
Then, the content entered by the CMS could spit a text like this:
<p>Have you ever wanted to do not hassle with HTML abbr tags but was too lazy to hand-code them all!? That is my solution :)</p>
That gets translated into:
<p>Have you ever wanted to do not hassle with <abbr title="Hyper Text Markup Language" lang="es">HTML</abbr> <abbr title="Abbreviation">abbr</abbr> tags but was too lazy to hand-code them all!? That is my solution :)</p>
All depends from one CSV file that you can generate from your database.
The conventions I used
The file /abbreviations.txt is publicly available on the server (that could be generated) is a dictionary, one definition per accronym
An implementation has only to read the file and apply it BEFORE sending it to the client
The tooltips
I strongly recommend you use the tooltip tool that even Twitter Bootstrap implements. It basically reads the title of any marked up tags you want.
Have a look there: Bootstrap from Twitter with Toolip helper.
PS: I'm very sold to the use of the patterns Twitter put forward with this Bootstrap project, it's worth a look!!

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.