HTMLPurify - Disable Javascript - php

I use HTMLPurify for disabling JavasSript in a textarea.
My problem is:
$config = HTMLPurifier_Config::createDefault();
$purifier = new HTMLPurifier();
$va = $purifier->purify($va);
This removes script tags, but does not remove [a href='javascript:...']link[/a]
What should I do to remove the bad links and retain good links?

Try setting the AllowedSchemes whitelist.

The live demo is indeed filtering both href="javascript:... and onclick. You can see the demo here.
Maybe you are using an older version?

Use regular expressions to scan the textareas content for invalid / unwanted tags.

Related

How to properly escape HTML editor content corretly?

So I am using TinyMCE editor and have handled getting the content in the text area by using htmlspecialchars() which works fine, but I'm a little confused on the other side of using an WYSIWYG editor... The content output part.
I am using HTML Purifier to output the content, but from what I understand I've just been doing for example:
$purifierConfig = HTMLPurifier_Config::createDefault();
$purifierConfig->set('HTML.Allowed', 'p');
$Purifier = new HTMLPurifier($purifierConfig);
$input = $Purifier->purify($input);
I've only tested with the p tags, but does this mean I am going to have to go through everything TinyMCE uses and add it in as what is allowed? Or is there a better way of tackling this problem with safe output of an WYSIWYG editor?
Yes, you need to set all allowed tags you want, separated by a comma. You can also specify what attributes are allowed by enclosing them with brackets:
$purifierConfig = HTMLPurifier_Config::createDefault();
$purifierConfig->set('HTML.Allowed', 'p,a[href],b,i,strong,em');
$Purifier = new HTMLPurifier($purifierConfig);
$input = $Purifier->purify($input);
I guess for a better understanding, the printDefinition can help.

Get plain text from ckeditor

I want to know if it's possible to get plain text (text without html code) when I submit my form having a ckeditor textarea.
In fact, I want to have a simple textarea with the spellchecker option of ckeditor.
p.s. I am using vtiger 6.
It's possible, and given that you're using PHP a decent-ish solution would be HTML Purifier. Assuming you can install it via PEAR (which is the most straightforward way to do it), your example would look like:
require_once 'HTMLPurifier.auto.php';
$config = HTMLPurifier_Config::createDefault();
$config->set('HTML', 'Allowed', ''); // Allow Nothing
$purifier = new HTMLPurifier($config);
return $purifier->purify($_REQUEST['field_name']);
You're going to want to remove most of the editor buttons and other options from CKeditor as well to prevent your users from adding a lot of formatting that won't survive filtration.
However:
Using the full CKEditor stack for spellcheck is overkill! Why not just use a standalone, jQuery spellchecker like this one?
You can use the PHP function that strips HTML tags from a string, strip_tags:
$plaintext = strip_tags($_POST['mytexteditor']);
You can also allow certain tags:
$plaintext_with_ps = strip_tags($_POST['mytexteditor'], '<p>');
Don't attempt to use it as a security measure, however.

Disabling all html tags, except some tags I want to allow

I want to allow tags like <b>, <h1>, <h2> but still disabling HTML in posts.
How can I do that? can I do it with htmlspecialchars?
Thanks for your help , The site and the users are really helpful :).
you can use strip_tags http://fr.php.net/manual/en/function.strip-tags.php the second arguments allows some tags.
Use strip_tags().
http://php.net/manual/en/function.strip-tags.php
$stripped = strip_tags($text, "<b><h1><h2>");
Caveat emptor. There might be some security implications there.
There is a great tool called HTML Purifier that does exactly what you are asking for :) You should check it out: http://htmlpurifier.org/
Another alternative is to use BBCode which can be found here: http://nbbc.sourceforge.net/
Good luck :D

How to remove a link from content using php?

$text = file_get_contents('http://www.example.com/file.php?id=name');
echo preg_replace('#<a.*?>.*?</a>#i', '', $text)
the link contains this content:
text text text. <br><a href='http://www.example.com' target='_blank' title='title' style='text-decoration:none;'>name</a>
what is the problem at this script?
You can't parse HTML with regular expressions. Use an XML/HTML parser.
Tempted to flag your question, but there's no option for "Report user for summoning Cthulhu"
I'd recommend reading: http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html
RegEx is very poor and not at all intended to parse HTML. That's why there are HTML parsing libraries. Find and use one for PHP. :)
use <a[^>]+>[^<]*</a> (works fine as long as theres just text and no tags inside the a element)
USE strip_tags this way
$t = 'http://yoururl.com/test1.php';
$t1 = file_get_contents($t);
$text = strip_tags($t1);
it should work getting rid of all the links inside the page you are reading, visit the reference anyway, it may not work for complicated elements http://php.net/manual/en/function.strip-tags.php

question regarding php function preg_replace

I want to dynamically remove specific tags and their content from an html file and thought of using preg_replace but can't get the syntax right. Basically it should, for example, do something like :
Replace everything between (and including) "" by nothing.
Could anybody help me out on this please ?
Easy dude.
To have a Ungreedy regexpr, use the U modifier
And to make it multiline, use the s modifier.
Knowing that, to remove all paragraphes use this pattern :
#<p[^>]*>(.*)?</p>#sU
Explain :
I use # delimiter to not have to protect my \ characters (to have a more readable pattern)
<p[^>]*> : part detecting an opening paragraph (with a hypothetic style, such as )
(.*)? : Everything (in "Ungreedy mode")
</p> : Obviously, the closing paragraph
Hope that help !
If you are trying to sanitize your data, it is often recommended that you use a whitelist as opposed to blacklisting certain terms and tags. This is easier to sanitize and prevent XSS attacks. There's a well known library called HTML Purifier that, although large and somewhat slow, has amazing results regarding purifying your data.
I would suggest not trying to do this with a regular expression. A safer approach would be to use something like
Simple HTML DOM
Here is the link to the API Reference: Simple HTML DOM API Reference
Another option would be to use DOMDocument
The idea here is to use a real HTML parser to parse the data and then you can move/traverse through the tree and remove whichever elements/attributes/text you need to. This is a much cleaner approach than trying to use a regular expression to replace data within the HTML.
<?php
$doc = new DOMDocument;
$doc->loadHTMLFile('blah.html');
$content = $doc->documentElement;
$table = $content->getElementsByTagName('table')->item(0);
$delfirstTable = $content->removeChild($table);
echo $doc->saveHTML();
?>
If you don't know what is between the tags, Phill's response won't work.
This will work if there's no other tags in between, and is definitely the easier case. You can replace the div with whatever tag you need, obviously.
preg_replace('#<div>[^<]+</div>#','',$html);
If there could be other tags in the middle, this should work, but could cause problems. You're probably better off going with the DOM solution above, if so
preg_replace('#<div>.+</div>#','',$html);
These aren't tested
PSEUDO CODE
function replaceMe($html_you_want_to_replace,$html_dom) {
return preg_replace(/^$html_you_want_to_replace/, '', $html_dom);
}
HTML Before
<div>I'm Here</div><div>I'm next</div>
<?php
$html_dom = "<div>I'm Here</div><div>I'm next</div>";
$get_rid_of = "<div>I'm Here</div>";
replaceMe($get_rid_of);
?>
HTML After
<div>I'm next</div>
I know it's a hack job

Categories