Smiley Replace within CDATA of an HTML-String - php

i have got a simple problem :( I need to replace text smilies with the according smiley-image. ok.. thats not really complex, but now i have to replace only smilie appereances outside of HTML Tags. short examplae:
Text:
Thats a good example :/ .. with a link inside.
i want to replace ":/" with the image of this smiley...
ok, how to do that the best way?

I won't try to create some super script but think about it.... smilies are just about always surrounded by spaces. So str replace ' :/ ' with the smiley. You could be saying "what about a smiley at the end of a sentence(where it would be used the most)". Well just check for at least one space on either the left or the right of a potential smiley.
Using the above scripts:
$smiley_array = array(
":) " => "<a href...>",
" :)" => "<a href...>",
":/ " => "<a href...>",
" :/" => "<a href...>");
$codes = array_keys($smiley_array);
$links = array_values($smiley_array);
$str = str_replace($codes, $links, $str);
If you rather not have to type everything twice you can generate the array from a single smiley array.

Why don't you just try to use some special chars around your smiley text like this maybe -:/-
This will make your smiley text some kind of unique and easy to recognize

Use preg_replace with a lookbehind assertion. Example:
$smileys = array(
':/' => '<img src="..." alt=":/">'
);
foreach ($smileys as $smile => $img) {
$text = preg_replace('#(?<!<[^<>]*)' . preg_quote($smile, '#') . '#',
$img, $text);
}
The regex should match only smileys that are not inside angle brackets. This might be slow if you have a lot of false positives.

I wouldn't know about the best way, only the way I would do it.
Build an array having the smiley codes as the keys and the link as the value. The use str_replace. Pass as "needle" an array of the keys (the smiley codes) and as "replace" an array of the values.
For instance, suppose you have something like this:
$smiley_array = array(":)" => "<a href...>",
":(" => "<a href=....>");
$codes = array_keys($smiley_array);
$links = array_values($smiley_array);
$str = str_replace($codes, $links, $str);
EDIT: In case this could accidentally replace other instances with smiley-links you should consider using regexes with preg_replace. Obviously preg_replace is slower than str_replace.

You can use regex, or the extra sloppy version of the above:
$smiley_array = array(":)" => "<a href...>",
":(" => "<a href=....>");
$codes = array_keys($smiley_array);
$links = array_values($smiley_array);
$str = str_replace("://", "%%QF%%", $str);
$str = str_replace($codes, $links, $str);
$str = str_replace("%%QF%%", "://", $str);
Actually, assuming str_replace follows the array sorting...
this should work:
$smiley_array = array("://" => "%%QF%%", ":)" => "<a href...>",
":(" => "<a href=....>", "%%QF%%" => "://");
$codes = array_keys($smiley_array);
$links = array_values($smiley_array);
$str = str_replace($codes, $links, $str);

Possible overkill (increased cpu/load), but 99.99999999% safe:
<?php
$n = new DOMDocument();
$n->loadHTML('<p>Thats a good example :/ .. with a link inside.</p>');
$x = new DOMXPath($n);
$instances = $x->query('//text()[contains(.,\':/\')]');//or use '//*[child::text()]' for all textnodes
foreach($instances as $node){
if($node instanceof DOMText && preg_match_all('/:\//',$node->wholeText,$matches,PREG_OFFSET_CAPTURE|PREG_SET_ORDER)){
foreach($matches[0] as $match){
$newnode = $node->splitText($match[1]);
$newnode->replaceData(0,strlen($match[0]),'');
$img = $n->createElement('img');
$img->setAttribute('src','smily.gif');
$img = $newnode->parentNode->insertBefore($img,$newnode);
//var_dump($match);
}
}
}
var_dump($n->saveHTML());
?>
But in reality you do not want to do this all that often, save once, show many, if you are letting users edit the html (beit in wysiwyg or elsewise, the 'return' transformation (img to text) is a whole lot lighter. Up to you to expand with different smilies (one monster regex to match them, or several smaller ones / strstr()'s for readability, and a array for smiley to src (e.g. array(':/'=>'frown.gif')) would be the way to go.

Related

Converting text to smiley if multiple smileys are combined together not working

I'm trying to convert text ($icon) to smiley image ($image). I used to do it with str_replace(), but that seems to perform the replace sequentially and as such it also replaces items in previously converted results (for example in the tag).
I am now using the following code:
foreach($smiliearray as $image => $icon){
$pattern[]="/(?<!\S)" . preg_quote($icon, '/') . "(?!\S)/u";
$replacement[]=" <img src='$image' border='0' alt=''> ";
}
$text = preg_replace($pattern,$replacement,$text);
This code works, but only if the smiley code is surrounded by whitespace. So basically if someone types ":);)", it won't catch it as two separate smilieys, but ":) ;)" does.
How can I fix it so that also a string of smileys (not separated by space) are converted?
Note that there can be unlimited kinds of smiley codes and smiley images. I do not know beforehand which ones, because other people can submit codes and smileys, so it is not just ":)" and ";)", but can also be "rofl", ":eh", ":-{", etc.
I can partially fix it by adding a \W non-word to the end of the 2nd capturegroup: (?!\S\W), and further by adding a 2nd $pattern and $replacement with a \W to the first capturegroup. But I don't think that is the way it should be done, and it only partially solves it.
I used to do it with str_replace(), but that seems to perform the
replace sequentially and as such it also replaces items in previously
converted results...
A good and true reason to use strtr(). You don't even need Regular Expressions:
<?php
// I assume your original array looks like this
$origSmileys = [
"/1.png" => ':)',
"/2.png" => ':(',
"/3.png" => ':P',
"/4.png" => '>:('
];
// sample input string
$str = " I'm :) but :(>:(:( now :P";
// iterating over smileys to add html tag
$newSmileys = array_map(function($value) {
return "<img src='$value' border='0' alt=''>";
}, array_flip($origSmileys));
// replace
echo strtr($str, $newSmileys);
Live demo

Extracting links from a piece of text in PHP except ignoring image links

I have this piece of text, and I want to extract links from this. Some links with have tags and some will be out there just like that, in plain format. But I also have images, and I don't want their links.
How would I extract links from this piece of text but ignoring image links. So basically and google.com should both be extract.
string(441) "<p class="fr-tag">Please visit https://www.google.co.uk/?gfe_rd=cr&ei=9P2DVaW2BMWo8wfK74HYCg and this link should be filtered and this http://d.pr/i/1i2Xu <img class="fr-fin fr-tag" alt="Image title" src="https://cft-forum.s3-us-west-2.amazonaws.com/uploads%2F1434714755338-Screen+Shot+2015-06-19+at+12.52.28.png" width="300"></p>"
I have tried the following but its incomplete:
$dom = new DOMDocument();
$dom->loadHTML($html);
$tags = $dom->getElementsByTagName('a');
foreach ($tags as $tag) {
$hrefs[] = $tag->getAttribute('href');
Using just that one string to test, the following works for me:
$str = '<p class="fr-tag">Please visit https://www.google.co.uk/?gfe_rd=cr&ei=9P2DVaW2BMWo8wfK74HYCg and this link should be filtered and this http://d.pr/i/1i2Xu <img class="fr-fin fr-tag" alt="Image title" src="https://cft-forum.s3-us-west-2.amazonaws.com/uploads%2F1434714755338-Screen+Shot+2015-06-19+at+12.52.28.png" width="300"></p>';
preg_match('~a href="(.*?)"~', $str, $strArr);
Using a href ="..." in the preg_match() statement returns an array, $strArr containing two values, the two links to google.
Array
(
[0] => a href="https://www.google.co.uk/?gfe_rd=cr&ei=9P2DVaW2BMWo8wfK74HYCg"
[1] => https://www.google.co.uk/?gfe_rd=cr&ei=9P2DVaW2BMWo8wfK74HYCg
)
I would try something like this.
Find and remove images tags:
$content = preg_replace("/<img[^>]+\>/i", "(image) ", $content);
Find and collect URLs.
preg_match_all('#\bhttps?://[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))#', $content, $match);
Output Urls:
print_r($match);
Good luck!
I played around with this a lot more and have an answer that may better suit what you are trying to do with a bit of "future proofing"
$str = '<p class="fr-tag">Please visit www.google.co.uk/?gfe_rd=cr&ei=9P2DVaW2BMWo8wfK74HYCg and this link should be filtered and this http://d.pr/i/1i2Xu <img class="fr-fin fr-tag" alt="Image title" src="https://cft-forum.s3-us-west-2.amazonaws.com/uploads%2F1434714755338-Screen+Shot+2015-06-19+at+12.52.28.png" width="300"></p>';
$str = str_replace(' ',' ',$str);
$strArr = explode(' ',$str);
$len = count($strArr);
for($i = 0; $i < $len; $i++){
if(stristr($strArr[$i],'http') || stristr($strArr[$i],"www")){
$matches[] = $strArr[$i];
}
}
echo "<pre>";
print_r($matches);
echo "</pre>";
I went back and analyzed your string and noticed that if you translate the to spaces you can then explode the string into an array, step through that and if any elements contain http or www then add them to the $matches array to be processed later. The output is pretty clean and easy to work with and you also get rid of most of the html markup this way.
Something to note is that this probably isn't the best way to do this. I haven't tested with any other strings but the one you offered so there's optimization that can be done.

PHP replace words to links except images

My code is
$words = array();
$links = array();
$result = mysql_query("SELECT `keyword`, `link` FROM `articles` where `link`!='".$act."' ")
or die(mysql_error());
$i = 0;
while($row = mysql_fetch_array( $result ))
{
if (!empty($row['keyword']))
{
$words[$i] = '/(?<!(src="|alt="))'.$row['keyword'].'/i';
$links[$i] = ''.$row['keyword'].'';
$i++;
}
}
$text = preg_replace($words, $links, $text);
I want to replace Hello with Guys except img src and alt.
From
Say Hello my dear <img src="say-hello-my-dear.jpg" alt="say hello my dear" />
I want
Say Guys my dear <img src="say-hello-my-dear.jpg" alt="say hello my dear" />
The current code, replaces only when my keyword has only 1 word.
EDIT: the previsouly suggested correction was not relevant.
Still:
I would suggest you not to use any regex but only str_replace in your case if you have a performance constraint.
You must change your MySQL functions that are legacy: http://php.net/manual/en/function.mysql-fetch-array.php
EDIT: I can't believe it took me that long to understand that you're trying to parse big chunks of HTML with regular expressions.
Read the answer to this question:
RegEx match open tags except XHTML self-contained tags
Edit: I updated the code to work better.
I'm unsure exactly what the issue is but looking at your code I wouldn't be surprised that the negative look behind regex isn't matching multiple word strings where the "keyword" is not the first word after the src or alt. It might possible to beef up the regex, but IMHO a complicated regex might be a little too brittle for your html parsing needs. I'd recommend doing some basic html parsing yourself and doing a simple string replace in the right places.
Here's some basic code. There is certainly a much better solution than this, but I'm not going to spend too much time on this. Probably, rather than inserting html in a text node, you should create a new html a element with the right attributes. Then you wouldn't have to decode it. But this would be my basic approach.
$text = "Lorem ipsum <img src=\"lorem ipsum\" alt=\"dolor sit amet\" /> dolor sit amet";
$result = array(
array('keyword' => 'lorem', 'link' => 'http://www.google.com'),
array('keyword' => 'ipsum', 'link' => 'http://www.bing.com'),
array('keyword' => 'dolor sit', 'link' => 'http://www.yahoo.com'),
);
$doc = new DOMDocument();
$doc->loadHTML($text);
$xpath = new DOMXPath($doc);
foreach($result as $row) {
if (!empty($row['keyword'])) {
$search = $row['keyword'];
$replace = ''.$row['keyword'].'';
$text_nodes = $xpath->evaluate('//text()');
foreach($text_nodes as $text_node) {
$text_node->nodeValue = str_ireplace($search, $replace, $text_node->nodeValue);
}
}
}
echo html_entity_decode($doc->saveHTML());
The $result data structure is meant to be similar to result of your mysql_fetch_array(). I'm only getting the children of the root for the created html DOMDocument. If the $text is more complicated, it should be pretty easy to traverse more thoroughly through the document. I hope this helps you.

Add id attribute to hyperlinks through PHP Regular Expressions

I am still relatively new to Regular Expressions and feel My code is being too greedy. I am trying to add an id attribute to existing links in a piece of code. My functions is like so:
function addClassHref($str) {
//$str = stripslashes($str);
$preg = "/<[\s]*a[\s]*href=[\s]*[\"\']?([\w.-]*)[\"\']?[^>]*>(.*?)<\/a>/i";
preg_match_all($preg, $str, $match);
foreach ($match[1] as $key => $val) {
$pattern[] = '/' . preg_quote($match[0][$key], '/') . '/';
$replace[] = "<a id='buttonRed' href='$val'>{$match[2][$key]}</a>";
}
return preg_replace($pattern, $replace, $str);
}
This adds the id tag like I want but it breaks the hyperlink. For example:
If the original code is : Link
Instead of <a id="class" href="http://www.google.com">Link</a>
It is giving
<a id="class" href="http">Link</a>
Any suggestions or thoughts?
Do not use regular expressions to parse XML or HTML.
$doc = new DOMDocument();
$doc->loadHTML($html);
$all_a = $doc->getElementsByTagName('a');
$firsta = $all_a->item(0);
$firsta->setAttribute('id', 'idvalue');
echo $doc->saveHTML($firsta);
You've got some overcomplications in your regex :)
Also, there's no need for the loop as preg_replace() will hit all the instances of the search pattern in the relevant string. The first regex below will take everything in the a tag and simply add the id attribute on at the end.
$str = 'Link' . "\n" .
'Link' . "\n" .
'Link';
$p = "{<\s*a\s*(href=[^>]*)>([^<]*)</a>}i";
$r = "<a $1 id=\"class\">$2</a>";
echo preg_replace($p, $r, $str);
If you only want to capture the href attribute you could do the following:
$p = '{<\s*a\s*href=["\']([^"\']*)["\'][^>]*>([^<]*)</a>}i';
$r = "<a href='$1' id='class'>$2</a>";
Your first subpattern ([\w.-]*) doesn't match :, thus it stops at "http".
Couldn't you just use a simple str_replace() for this? Regex seems like overkill if this is all you're doing.
$str = str_replace('<a ', '<a id="someID" ', $str);

Is my anti XSS method OK for allowing user HTML in PHP?

I am working on finding a good way to make user submitted data, in this case allow HTML and have it be as safe and fast as I can.
I know EVERY SINGLE PERSON on this site seems to think http://htmlpurifier.org is the answer here. I do agree partially. htmlpurifier has the best open source code out there for filtering user submitted HTML but there solution is very bulky and is not good for performance on a high traffic site. I might even use there solution someday but for now my goal is to find a more lightweight method.
I have been using the 2 functions below for about 2 and a half years now with no problems yet but I think it is time to take some input from the pro's on here if they will help me.
The first function is called FilterHTML($string) it is ran before user data is saved to a mysql database. The second function is called format_db_value($text, $nl2br = false) and I use it on a page where I plan to show the user submitted data.
Below the 2 functions is a bunch of the XSS codes I found on http://ha.ckers.org/xss.html and I then ran them on these 2 functions to see how affective my code is, I am somewhat pleased with the results, they did block out every code I tried but I know it is still not 100% safe obviously.
Can you guys please look over it and give me any advice for my code itself or even on the whole html filtering concept.
I would like to do a whitelist approach someday but htmlpurifier is the only solution I have found worth using for that and as I mentioned it is not lightweight as I would like.
function FilterHTML($string) {
if (get_magic_quotes_gpc()) {
$string = stripslashes($string);
}
$string = html_entity_decode($string, ENT_QUOTES, "ISO-8859-1");
// convert decimal
$string = preg_replace('/&#(\d+)/me', "chr(\\1)", $string); // decimal notation
// convert hex
$string = preg_replace('/&#x([a-f0-9]+)/mei', "chr(0x\\1)", $string); // hex notation
//$string = html_entity_decode($string, ENT_COMPAT, "UTF-8");
$string = preg_replace('#(&\#*\w+)[\x00-\x20]+;#U', "$1;", $string);
$string = preg_replace('#(<[^>]+[\s\r\n\"\'])(on|xmlns)[^>]*>#iU', "$1>", $string);
//$string = preg_replace('#(&\#x*)([0-9A-F]+);*#iu', "$1$2;", $string); //bad line
$string = preg_replace('#/*\*()[^>]*\*/#i', "", $string); // REMOVE /**/
$string = preg_replace('#([a-z]*)[\x00-\x20]*([\`\'\"]*)[\\x00-\x20]*j[\x00-\x20]*a[\x00-\x20]*v[\x00-\x20]*a[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //JAVASCRIPT
$string = preg_replace('#([a-z]*)([\'\"]*)[\x00-\x20]*v[\x00-\x20]*b[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //VBSCRIPT
$string = preg_replace('#([a-z]*)[\x00-\x20]*([\\\]*)[\\x00-\x20]*#([\\\]*)[\x00-\x20]*i([\\\]*)[\x00-\x20]*m([\\\]*)[\x00-\x20]*p([\\\]*)[\x00-\x20]*o([\\\]*)[\x00-\x20]*r([\\\]*)[\x00-\x20]*t#iU', '...', $string); //#IMPORT
$string = preg_replace('#([a-z]*)[\x00-\x20]*e[\x00-\x20]*x[\x00-\x20]*p[\x00-\x20]*r[\x00-\x20]*e[\x00-\x20]*s[\x00-\x20]*s[\x00-\x20]*i[\x00-\x20]*o[\x00-\x20]*n#iU', '...', $string); //EXPRESSION
$string = preg_replace('#</*\w+:\w[^>]*>#i', "", $string);
$string = preg_replace('#</?t(able|r|d)(\s[^>]*)?>#i', '', $string); // strip out tables
$string = preg_replace('/(potspace|pot space|rateuser|marquee)/i', '...', $string); // filter some words
//$string = str_replace('left:0px; top: 0px;','',$string);
do {
$oldstring = $string;
//bgsound|
$string = preg_replace('#</*(applet|meta|xml|blink|link|script|iframe|frame|frameset|ilayer|layer|title|base|body|xml|AllowScriptAccess|big)[^>]*>#i', "...", $string);
} while ($oldstring != $string);
return addslashes($string);
}
Below function is used when showing user submitted code on a webpage
function format_db_value($text, $nl2br = false) {
if (is_array($text)) {
$tmp_array = array();
foreach ($text as $key => $value) {
$tmp_array[$key] = format_db_value($value);
}
return $tmp_array;
} else {
$text = htmlspecialchars(stripslashes($text));
if ($nl2br) {
return nl2br($text);
} else {
return $text;
}
}
}
The codes below are from ha.ckers.org and they all seem to fail on my functions above
I did not try everyone on that site though there is a lot more, this is just some of them.
The original code is on the top line of each set and the code after running through my functions is on the line below it.
<IMG SRC="javascript:alert(\'XSS\');"><b>hello</b> hiii
<IMG SRC=...alert('XSS');"><b>hello</b> hiii
<IMG SRC=JaVaScRiPt:alert('XSS')>
<IMG SRC=...alert('XSS')>
<IMG SRC=javascript:alert(String.fromCharCode(88,83,83))>
<IMG SRC=...alert(String.fromCharCode(88,83,83))>
<IMG SRC=javascript:alert('XSS')>
<IMG SRC=...alert('XSS')>
<IMG SRC=&#0000106&#0000097&#0000118&#0000097&#0000115&#0000099&#0000114&#0000105&#0000112&#0000116&#0000058&#0000097&#0000108&#0000101&#0000114&#0000116&#0000040&#0000039&#0000088&#0000083&#0000083&#0000039&#0000041>
<IMG SRC=F MLEJNALN !>
<IMG SRC=&#x6A&#x61&#x76&#x61&#x73&#x63&#x72&#x69&#x70&#x74&#x3A&#x61&#x6C&#x65&#x72&#x74&#x28&#x27&#x58&#x53&#x53&#x27&#x29>
<IMG SRC=...alert('XSS')>
<IMG SRC="jav
ascript:alert('XSS');">
<IMG SRC=...alert('XSS');">
perl -e 'print "<IMG SRC=javascript:alert("XSS")>";' > out
perl -e 'print "<IMG SRC=java\0script:alert(\"XSS\")>";' > out
<BODY onload!#$%&()*~+-_.,:;?#[/|\]^`=alert("XSS")>
...
<iframe src=http://ha.ckers.org/scriptlet.html <
...
<LAYER SRC="http://ha.ckers.org/scriptlet.html"></LAYER>
......
<META HTTP-EQUIV="Link" Content="<http://ha.ckers.org/xss.css>; REL=stylesheet">
...; REL=stylesheet">
<IMG STYLE="xss:...(alert('XSS'))">
<IMG STYLE="xss:expr/*XSS*/ession(alert('XSS'))">
<XSS STYLE="xss:...(alert('XSS'))">
<XSS STYLE="xss:expression(alert('XSS'))">
<EMBED SRC="data:image/svg+xml;base64,PHN2ZyB4bWxuczpzdmc9Imh0dH A6Ly93d3cudzMub3JnLzIwMDAvc3ZnIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcv MjAwMC9zdmciIHhtbG5zOnhsaW5rPSJodHRwOi8vd3d3LnczLm9yZy8xOTk5L3hs aW5rIiB2ZXJzaW9uPSIxLjAiIHg9IjAiIHk9IjAiIHdpZHRoPSIxOTQiIGhlaWdodD0iMjAw IiBpZD0ieHNzIj48c2NyaXB0IHR5cGU9InRleHQvZWNtYXNjcmlwdCI+YWxlcnQoIlh TUyIpOzwvc2NyaXB0Pjwvc3ZnPg==" type="image/svg+xml" AllowScriptAccess="always"></EMBED>
<EMBED SRC="data:image/svg+xml;base64,PHN2ZyB4bWxuczpzdmc9Imh0dH A6Ly93d3cudzMub3JnLzIwMDAvc3ZnIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcv MjAwMC9zdmciIHhtbG5zOnhsaW5rPSJodHRwOi8vd3d3LnczLm9yZy8xOTk5L3hs aW5rIiB2ZXJzaW9uPSIxLjAiIHg9IjAiIHk9IjAiIHdpZHRoPSIxOTQiIGhlaWdodD0iMjAw IiBpZD0ieHNzIj48c2NyaXB0IHR5cGU9InRleHQvZWNtYXNjcmlwdCI+YWxlcnQoIlh TUyIpOzwvc2NyaXB0Pjwvc3ZnPg==" type="image/svg+xml" AllowScriptAccess="always"></EMBED>
<IMG
SRC
=
"
j
a
v
a
s
c
r
i
p
t
:
a
l
e
r
t
(
'
X
S
S
'
)
"
>
<IMG
SRC
=...
a
l
e
r
t
(
'
X
S
S
'
)
"
>
Only way to be sure is to whitelist the tags and attributes that they can use and write strict regexps to validate allowed values of attributes. If you want to allow attributes such as "style" then you have additional complexity.
Blacklisting only might make attack for some people harder but it will not make it any harder for the person that uses technique you have not heard of yet.
I'd try using regexp to add missing closing tags to what users entered and replace <br> with <br /> and so on, then parse it using SimpleXML, then iterate over it and remove any tag that is not in whitelist, any attribute that is not in the whitelist for given tag, and any attribute that has a value that does conform to precise regexp for this attribute. After all I'd use asXML() to get the text back. I'd start with minimal set of tags and attributes and add new ones as needed being especially careful of anything that may contain url.
Here is four alternatives :
Pear's HTML_Safe
HTML_Sanitizer
htmLawed
HTML_Filter
IMHO htmlawed is the best -- lean, fast, full HTML coverage, most flexible... black OR white list for tags AND attributes. Safe? Defeats all the ha.ckers XSS codes
How about using PHP's native HTML parser?
I was curious about it, so I've wrote some code for testing (requires PHP 5.3.6+):
$badHtml = file_get_contents('badHtml.txt');
$html = sprintf('<div id="input">%s</div>', $badHtml);
// tidy is no required, but may fix invalid markup
$tidy = new \tidy();
$tidy->parseString($html, array(), 'utf8');
$tidy->cleanRepair();
$dom = new \DomDocument('1.0', 'UTF-8');
libxml_use_internal_errors(true);
$dom->loadHtml($tidy);
$input = $dom->getElementById('input');
// tag as key, attributes as values
$allowed = array(
'table' => array('border'),
'tbody' => array(),
'tr' => array(),
'td' => array(),
'th' => array(),
'img' => array('src', 'alt'),
'p' => array(),
'ul' => array(),
'ol' => array(),
'li' => array(),
'a' => array('href', 'title'),
'strong' => array(),
'em' => array(),
'sub' => array(),
'sup' => array(),
);
$walk = function(\DomNode $node) use($allowed, &$walk){
// only check tags
if($node->nodeType !== XML_ELEMENT_NODE)
return;
if(!isset($allowed[$node->nodeName]))
return $node->parentNode->removeChild($node);
foreach($node->attributes as $key => $attr){
if(!in_array($key, $allowed[$node->nodeName], true))
$node->removeAttribute($key);
// expect URLs here
if(!in_array($key, array('href', 'src'), true))
continue;
if(!filter_var($attr->value, FILTER_VALIDATE_URL))
return $node->parentNode->removeChild($node);
}
array_map($walk, iterator_to_array($node->childNodes));
};
// convert DOMNodeList to array because this way the bad stuff
// can be removed within the loop
array_map($walk, iterator_to_array($input->childNodes));
// export HTML
$sanitized = $dom->saveHtml($input);
The output, without running Tidy:
Seems ok. Or did it remove too much? :)
Should be way faster than HTMLPurifier, theoretically more secure since it's less permissive, and probably faster than the regexes too.

Categories