PHP replace words to links except images

PHP replace words to links except images - php

My code is
$words = array();
$links = array();
$result = mysql_query("SELECT `keyword`, `link` FROM `articles` where `link`!='".$act."' ")
or die(mysql_error());
$i = 0;
while($row = mysql_fetch_array( $result ))
{
if (!empty($row['keyword']))
{
$words[$i] = '/(?<!(src="|alt="))'.$row['keyword'].'/i';
$links[$i] = ''.$row['keyword'].'';
$i++;
}
}
$text = preg_replace($words, $links, $text);
I want to replace Hello with Guys except img src and alt.
From
Say Hello my dear <img src="say-hello-my-dear.jpg" alt="say hello my dear" />
I want
Say Guys my dear <img src="say-hello-my-dear.jpg" alt="say hello my dear" />
The current code, replaces only when my keyword has only 1 word.

EDIT: the previsouly suggested correction was not relevant.
Still:
I would suggest you not to use any regex but only str_replace in your case if you have a performance constraint.
You must change your MySQL functions that are legacy: http://php.net/manual/en/function.mysql-fetch-array.php

EDIT: I can't believe it took me that long to understand that you're trying to parse big chunks of HTML with regular expressions.
Read the answer to this question:
RegEx match open tags except XHTML self-contained tags

Edit: I updated the code to work better.
I'm unsure exactly what the issue is but looking at your code I wouldn't be surprised that the negative look behind regex isn't matching multiple word strings where the "keyword" is not the first word after the src or alt. It might possible to beef up the regex, but IMHO a complicated regex might be a little too brittle for your html parsing needs. I'd recommend doing some basic html parsing yourself and doing a simple string replace in the right places.
Here's some basic code. There is certainly a much better solution than this, but I'm not going to spend too much time on this. Probably, rather than inserting html in a text node, you should create a new html a element with the right attributes. Then you wouldn't have to decode it. But this would be my basic approach.
$text = "Lorem ipsum <img src=\"lorem ipsum\" alt=\"dolor sit amet\" /> dolor sit amet";
$result = array(
array('keyword' => 'lorem', 'link' => 'http://www.google.com'),
array('keyword' => 'ipsum', 'link' => 'http://www.bing.com'),
array('keyword' => 'dolor sit', 'link' => 'http://www.yahoo.com'),
);
$doc = new DOMDocument();
$doc->loadHTML($text);
$xpath = new DOMXPath($doc);
foreach($result as $row) {
if (!empty($row['keyword'])) {
$search = $row['keyword'];
$replace = ''.$row['keyword'].'';
$text_nodes = $xpath->evaluate('//text()');
foreach($text_nodes as $text_node) {
$text_node->nodeValue = str_ireplace($search, $replace, $text_node->nodeValue);
}
}
}
echo html_entity_decode($doc->saveHTML());
The $result data structure is meant to be similar to result of your mysql_fetch_array(). I'm only getting the children of the root for the created html DOMDocument. If the $text is more complicated, it should be pretty easy to traverse more thoroughly through the document. I hope this helps you.

Related

preg_replace and preg_match_all to move img from wordpress $content

I am using preg_replace to delete from $content certain <img>:
$content=preg_replace('/(?!<img.+?id="img_menu".*?\/>)(?!<img.+?id="featured_img".*?\/>)<img.+?\/>/','',$content);
When I am now displaying the content using wordpress the_content function, I did indeed remove the <img>s from $content:
I'd like beforehand to get this images to place them elsewhere in the template. I am using the same regex pattern with preg_match_all:
preg_match_all('/(?!<img.+?id="img_menu".*?\/>)(?!<img.+?id="featured_img".*?\/>)<img.+?\/>/', $content, $matches);
But I can't get my imgs?
preg_match_all('/(?!<img.+?id="img_menu".*?\/>)(?!<img.+?id="featured_img".*?\/>)<img.+?\/>/', $content, $matches);
print_r($matches);
Array ( [0] => Array ( ) )

assuming and hopefully you are using php5, this is a task for DOMDocument and xpath. regex with html elements mostly will work, but check the following example from
<img alt=">" src="/path.jpg" />
regex will fail. since there aren't many guarantees in programming, take the guarantee that xpath will find EXACTLY what you want, at a perfomance cost, so to code it:
$doc = new DOMDocument();
$doc->loadHTML('<span><img src="com.png" /><img src="com2.png" /></span>');
$xpath = new DOMXPath($doc);
$imgs = $xpath->query('//span/img');
$html = '';
foreach($imgs as $img){
$html .= $doc->saveXML($img);
}
now you have all img elements in $html, use str_replace() to remove them from $content, and from there you can have a drink and be pleased that xpath with html elements is painless, just a little slower
ps. i couldnt be be bother understanding your regex, i just think xpath is better in your situation

at the end i have used preg_replace_callback:
$content2 = get_the_content();
$removed_imgs = array();
$content2 = preg_replace_callback('#(?!<img.+?id="featured_img".*?\/>)(<img.+? />)#',function($r) {
global $removed_imgs;
$removed_imgs[] = $r[1];
return '';
},$content2);
foreach($removed_imgs as $img){
echo $img;
}

Using preg_replace_callback() to extract all images from a string of HTML

Tricky preg_replace_callback function here - I am admittedly not great at PRCE expressions.
I am trying to extract all img src values from a string of HTML, save the img src values to an array, and additionally replace the img src path to a local path (not a remote path). Ie I might have, surrounded by a lot of other HTML:
img src='http://www.mysite.com/folder/subfolder/images/myimage.png'
And I would want to extract myimage.png to an array, and additionally change the src to:
src='images/myimage.png'
Can that be done?
Thanks

Does it need to use regular expressions? Handling HTML is normally easier with DOM functions:
<?php
$domd = new DOMDocument();
libxml_use_internal_errors(true);
$domd->loadHTML(file_get_contents("http://stackoverflow.com"));
libxml_use_internal_errors(false);
$items = $domd->getElementsByTagName("img");
$data = array();
foreach($items as $item) {
$data[] = array(
"src" => $item->getAttribute("src"),
"alt" => $item->getAttribute("alt"),
"title" => $item->getAttribute("title"),
);
}
print_r($data);

Do you need regex for this? Not necessary. Are regex the most readable solution? Probably not - at least unless you are fluent in regex. Are regex more efficient when scanning large amounts of data? Absolutely, the regex are compiled and cached upon first appearance. Do regex win the "least lines of code" trophy?
$string = <<<EOS
<html>
<body>
blahblah<br>
<img src='http://www.mysite.com/folder/subfolder/images/myimage.png'>blah<br>
blah<img src='http://www.mysite.com/folder/subfolder/images/another.png' />blah<br>
</body>
</html>
EOS;
preg_match_all("%<img .*?src=['\"](.*?)['\"]%s", $string, $matches);
$images = array_map(function ($element) { return preg_replace("%^.*/(.*)$%", 'images/$1', $element); }, $matches[1]);
print_r($images);
Two lines of code, that's hard to undercut in PHP. It results in the following $images array:
Array
(
[0] => images/myimage.png
[1] => images/another.png
)
Please note that this won't work with PHP versions prior to 5.3 unless you replace the anonymous function with a proper one.

Smiley Replace within CDATA of an HTML-String

i have got a simple problem :( I need to replace text smilies with the according smiley-image. ok.. thats not really complex, but now i have to replace only smilie appereances outside of HTML Tags. short examplae:
Text:
Thats a good example :/ .. with a link inside.
i want to replace ":/" with the image of this smiley...
ok, how to do that the best way?

I won't try to create some super script but think about it.... smilies are just about always surrounded by spaces. So str replace ' :/ ' with the smiley. You could be saying "what about a smiley at the end of a sentence(where it would be used the most)". Well just check for at least one space on either the left or the right of a potential smiley.
Using the above scripts:
$smiley_array = array(
":) " => "<a href...>",
" :)" => "<a href...>",
":/ " => "<a href...>",
" :/" => "<a href...>");
$codes = array_keys($smiley_array);
$links = array_values($smiley_array);
$str = str_replace($codes, $links, $str);
If you rather not have to type everything twice you can generate the array from a single smiley array.

Why don't you just try to use some special chars around your smiley text like this maybe -:/-
This will make your smiley text some kind of unique and easy to recognize

Use preg_replace with a lookbehind assertion. Example:
$smileys = array(
':/' => '<img src="..." alt=":/">'
);
foreach ($smileys as $smile => $img) {
$text = preg_replace('#(?<!<[^<>]*)' . preg_quote($smile, '#') . '#',
$img, $text);
}
The regex should match only smileys that are not inside angle brackets. This might be slow if you have a lot of false positives.

I wouldn't know about the best way, only the way I would do it.
Build an array having the smiley codes as the keys and the link as the value. The use str_replace. Pass as "needle" an array of the keys (the smiley codes) and as "replace" an array of the values.
For instance, suppose you have something like this:
$smiley_array = array(":)" => "<a href...>",
":(" => "<a href=....>");
$codes = array_keys($smiley_array);
$links = array_values($smiley_array);
$str = str_replace($codes, $links, $str);
EDIT: In case this could accidentally replace other instances with smiley-links you should consider using regexes with preg_replace. Obviously preg_replace is slower than str_replace.

You can use regex, or the extra sloppy version of the above:
$smiley_array = array(":)" => "<a href...>",
":(" => "<a href=....>");
$codes = array_keys($smiley_array);
$links = array_values($smiley_array);
$str = str_replace("://", "%%QF%%", $str);
$str = str_replace($codes, $links, $str);
$str = str_replace("%%QF%%", "://", $str);
Actually, assuming str_replace follows the array sorting...
this should work:
$smiley_array = array("://" => "%%QF%%", ":)" => "<a href...>",
":(" => "<a href=....>", "%%QF%%" => "://");
$codes = array_keys($smiley_array);
$links = array_values($smiley_array);
$str = str_replace($codes, $links, $str);

Possible overkill (increased cpu/load), but 99.99999999% safe:
<?php
$n = new DOMDocument();
$n->loadHTML('<p>Thats a good example :/ .. with a link inside.</p>');
$x = new DOMXPath($n);
$instances = $x->query('//text()[contains(.,\':/\')]');//or use '//*[child::text()]' for all textnodes
foreach($instances as $node){
if($node instanceof DOMText && preg_match_all('/:\//',$node->wholeText,$matches,PREG_OFFSET_CAPTURE|PREG_SET_ORDER)){
foreach($matches[0] as $match){
$newnode = $node->splitText($match[1]);
$newnode->replaceData(0,strlen($match[0]),'');
$img = $n->createElement('img');
$img->setAttribute('src','smily.gif');
$img = $newnode->parentNode->insertBefore($img,$newnode);
//var_dump($match);
}
}
}
var_dump($n->saveHTML());
?>
But in reality you do not want to do this all that often, save once, show many, if you are letting users edit the html (beit in wysiwyg or elsewise, the 'return' transformation (img to text) is a whole lot lighter. Up to you to expand with different smilies (one monster regex to match them, or several smaller ones / strstr()'s for readability, and a array for smiley to src (e.g. array(':/'=>'frown.gif')) would be the way to go.

Using regex to remove HTML tags

I need to convert
$text = 'We had <i>fun</i>. Look at this photo of Joe';
[Edit] There could be multiple links in the text.
to
$text = 'We had fun. Look at this photo (http://example.com) of Joe';
All HTML tags are to be removed and the href value from <a> tags needs to be added like above.
What would be an efficient way to solve this with regex? Any code snippet would be great.

First do a preg_replace to keep the link. You could use:
preg_replace('(.*?)', '$\2 ($\1)', $str);
Then use strip_tags which will finish off the rest of the tags.

try an xml parser to replace any tag with it's inner html and the a tags with its href attribute.
http://www.php.net/manual/en/book.domxml.php

The DOM solution:
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach($xpath->query('//a[#href]') as $node) {
$textNode = new DOMText(sprintf('%s (%s)',
$node->nodeValue, $node->getAttribute('href')));
$node->parentNode->replaceChild($textNode, $node);
}
echo strip_tags($dom->saveHTML());
and the same without XPath:
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('a') as $node) {
if($node->hasAttribute('href')) {
$textNode = new DOMText(sprintf('%s (%s)',
$node->nodeValue, $node->getAttribute('href')));
$node->parentNode->replaceChild($textNode, $node);
}
}
echo strip_tags($dom->saveHTML());
All it does is load any HTML into a DomDocument instance. In the first case it uses an XPath expression, which is kinda like SQL for XML, and gets all links with an href attribute. It then creates a text node element from the innerHTML and the href attribute and replaces the link. The second version just uses the DOM API and no Xpath.
Yes, it's a few lines more than Regex but this is clean and easy to understand and it won't give you any headaches when you need to add additional logic.

I've done things like this using variations of substring and replace. I'd probably use regex today but you wanted an alternative so:
For the <i> tags, I'd do something like:
$text = replace($text, "<i>", "");
$text = replace($text, "</i>", "");
(My php is really rusty, so replace may not be the right function name -- but the idea is what I'm sharing.)
The <a> tag is a bit more tricky. But, it can be done. You need to find the point that <a starts and that the > ends with. Then you extract the entire length and replace the closing </a>
That might go something like:
$start = strrpos( $text, "<a" );
$end = strrpos( $text, "</a>", $start );
$text = substr( $text, $start, $end );
$text = replace($text, "</a>", "");
(I don't know if this will work, again the idea is what I want to communicate. I hope the code fragments help but they probably don't work "out of the box". There are also a lot of possible bugs in the code snippets depending on your exact implementation and environment)
Reference:
strrpos - http://www.php.net/manual/en/function.strrpos.php
replace - http://www.php.net/manual/en/function.str-replace.php
substr - http://php.net/manual/en/function.substr.php

It's also very easy to do with a parser:
# available from http://simplehtmldom.sourceforge.net
include('simple_html_dom.php');
# parse and echo
$html = str_get_html('We had <i>fun</i>. Look at this photo of Joe');
$a = $html->find('a');
$a[0]->outertext = "{$a[0]->innertext} ( {$a[0]->href} )";
echo strip_tags($html);
And that produces the code you want in your test case.

Is my anti XSS method OK for allowing user HTML in PHP?

I am working on finding a good way to make user submitted data, in this case allow HTML and have it be as safe and fast as I can.
I know EVERY SINGLE PERSON on this site seems to think http://htmlpurifier.org is the answer here. I do agree partially. htmlpurifier has the best open source code out there for filtering user submitted HTML but there solution is very bulky and is not good for performance on a high traffic site. I might even use there solution someday but for now my goal is to find a more lightweight method.
I have been using the 2 functions below for about 2 and a half years now with no problems yet but I think it is time to take some input from the pro's on here if they will help me.
The first function is called FilterHTML($string) it is ran before user data is saved to a mysql database. The second function is called format_db_value($text, $nl2br = false) and I use it on a page where I plan to show the user submitted data.
Below the 2 functions is a bunch of the XSS codes I found on http://ha.ckers.org/xss.html and I then ran them on these 2 functions to see how affective my code is, I am somewhat pleased with the results, they did block out every code I tried but I know it is still not 100% safe obviously.
Can you guys please look over it and give me any advice for my code itself or even on the whole html filtering concept.
I would like to do a whitelist approach someday but htmlpurifier is the only solution I have found worth using for that and as I mentioned it is not lightweight as I would like.
function FilterHTML($string) {
if (get_magic_quotes_gpc()) {
$string = stripslashes($string);
}
$string = html_entity_decode($string, ENT_QUOTES, "ISO-8859-1");
// convert decimal
$string = preg_replace('/&#(\d+)/me', "chr(\\1)", $string); // decimal notation
// convert hex
$string = preg_replace('/&#x([a-f0-9]+)/mei', "chr(0x\\1)", $string); // hex notation
//$string = html_entity_decode($string, ENT_COMPAT, "UTF-8");
$string = preg_replace('#(&\#*\w+)[\x00-\x20]+;#U', "$1;", $string);
$string = preg_replace('#(<[^>]+[\s\r\n\"\'])(on|xmlns)[^>]*>#iU', "$1>", $string);
//$string = preg_replace('#(&\#x*)([0-9A-F]+);*#iu', "$1$2;", $string); //bad line
$string = preg_replace('#/*\*()[^>]*\*/#i', "", $string); // REMOVE /**/
$string = preg_replace('#([a-z]*)[\x00-\x20]*([\`\'\"]*)[\\x00-\x20]*j[\x00-\x20]*a[\x00-\x20]*v[\x00-\x20]*a[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //JAVASCRIPT
$string = preg_replace('#([a-z]*)([\'\"]*)[\x00-\x20]*v[\x00-\x20]*b[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //VBSCRIPT
$string = preg_replace('#([a-z]*)[\x00-\x20]*([\\\]*)[\\x00-\x20]*#([\\\]*)[\x00-\x20]*i([\\\]*)[\x00-\x20]*m([\\\]*)[\x00-\x20]*p([\\\]*)[\x00-\x20]*o([\\\]*)[\x00-\x20]*r([\\\]*)[\x00-\x20]*t#iU', '...', $string); //#IMPORT
$string = preg_replace('#([a-z]*)[\x00-\x20]*e[\x00-\x20]*x[\x00-\x20]*p[\x00-\x20]*r[\x00-\x20]*e[\x00-\x20]*s[\x00-\x20]*s[\x00-\x20]*i[\x00-\x20]*o[\x00-\x20]*n#iU', '...', $string); //EXPRESSION
$string = preg_replace('#</*\w+:\w[^>]*>#i', "", $string);
$string = preg_replace('#</?t(able|r|d)(\s[^>]*)?>#i', '', $string); // strip out tables
$string = preg_replace('/(potspace|pot space|rateuser|marquee)/i', '...', $string); // filter some words
//$string = str_replace('left:0px; top: 0px;','',$string);
do {
$oldstring = $string;
//bgsound|
$string = preg_replace('#</*(applet|meta|xml|blink|link|script|iframe|frame|frameset|ilayer|layer|title|base|body|xml|AllowScriptAccess|big)[^>]*>#i', "...", $string);
} while ($oldstring != $string);
return addslashes($string);
}
Below function is used when showing user submitted code on a webpage
function format_db_value($text, $nl2br = false) {
if (is_array($text)) {
$tmp_array = array();
foreach ($text as $key => $value) {
$tmp_array[$key] = format_db_value($value);
}
return $tmp_array;
} else {
$text = htmlspecialchars(stripslashes($text));
if ($nl2br) {
return nl2br($text);
} else {
return $text;
}
}
}
The codes below are from ha.ckers.org and they all seem to fail on my functions above
I did not try everyone on that site though there is a lot more, this is just some of them.
The original code is on the top line of each set and the code after running through my functions is on the line below it.
<IMG SRC="javascript:alert(\'XSS\');"><b>hello</b> hiii
<IMG SRC=...alert('XSS');"><b>hello</b> hiii
<IMG SRC=JaVaScRiPt:alert('XSS')>
<IMG SRC=...alert('XSS')>
<IMG SRC=javascript:alert(String.fromCharCode(88,83,83))>
<IMG SRC=...alert(String.fromCharCode(88,83,83))>
<IMG SRC=javascript:alert('XSS')>
<IMG SRC=...alert('XSS')>
<IMG SRC=&#0000106&#0000097&#0000118&#0000097&#0000115&#0000099&#0000114&#0000105&#0000112&#0000116&#0000058&#0000097&#0000108&#0000101&#0000114&#0000116&#0000040&#0000039&#0000088&#0000083&#0000083&#0000039&#0000041>
<IMG SRC=F MLEJNALN !>
<IMG SRC=&#x6A&#x61&#x76&#x61&#x73&#x63&#x72&#x69&#x70&#x74&#x3A&#x61&#x6C&#x65&#x72&#x74&#x28&#x27&#x58&#x53&#x53&#x27&#x29>
<IMG SRC=...alert('XSS')>
<IMG SRC="jav
ascript:alert('XSS');">
<IMG SRC=...alert('XSS');">
perl -e 'print "<IMG SRC=javascript:alert("XSS")>";' > out
perl -e 'print "<IMG SRC=java\0script:alert(\"XSS\")>";' > out
<BODY onload!#$%&()*~+-_.,:;?#[/|\]^`=alert("XSS")>
...
<iframe src=http://ha.ckers.org/scriptlet.html <
...
<LAYER SRC="http://ha.ckers.org/scriptlet.html"></LAYER>
......
<META HTTP-EQUIV="Link" Content="<http://ha.ckers.org/xss.css>; REL=stylesheet">
...; REL=stylesheet">
<IMG STYLE="xss:...(alert('XSS'))">
<IMG STYLE="xss:expr/*XSS*/ession(alert('XSS'))">
<XSS STYLE="xss:...(alert('XSS'))">
<XSS STYLE="xss:expression(alert('XSS'))">
<EMBED SRC="data:image/svg+xml;base64,PHN2ZyB4bWxuczpzdmc9Imh0dH A6Ly93d3cudzMub3JnLzIwMDAvc3ZnIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcv MjAwMC9zdmciIHhtbG5zOnhsaW5rPSJodHRwOi8vd3d3LnczLm9yZy8xOTk5L3hs aW5rIiB2ZXJzaW9uPSIxLjAiIHg9IjAiIHk9IjAiIHdpZHRoPSIxOTQiIGhlaWdodD0iMjAw IiBpZD0ieHNzIj48c2NyaXB0IHR5cGU9InRleHQvZWNtYXNjcmlwdCI+YWxlcnQoIlh TUyIpOzwvc2NyaXB0Pjwvc3ZnPg==" type="image/svg+xml" AllowScriptAccess="always"></EMBED>
<EMBED SRC="data:image/svg+xml;base64,PHN2ZyB4bWxuczpzdmc9Imh0dH A6Ly93d3cudzMub3JnLzIwMDAvc3ZnIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcv MjAwMC9zdmciIHhtbG5zOnhsaW5rPSJodHRwOi8vd3d3LnczLm9yZy8xOTk5L3hs aW5rIiB2ZXJzaW9uPSIxLjAiIHg9IjAiIHk9IjAiIHdpZHRoPSIxOTQiIGhlaWdodD0iMjAw IiBpZD0ieHNzIj48c2NyaXB0IHR5cGU9InRleHQvZWNtYXNjcmlwdCI+YWxlcnQoIlh TUyIpOzwvc2NyaXB0Pjwvc3ZnPg==" type="image/svg+xml" AllowScriptAccess="always"></EMBED>
<IMG
SRC
=
"
j
a
v
a
s
c
r
i
p
t
:
a
l
e
r
t
(
'
X
S
S
'
)
"
>
<IMG
SRC
=...
a
l
e
r
t
(
'
X
S
S
'
)
"
>

Only way to be sure is to whitelist the tags and attributes that they can use and write strict regexps to validate allowed values of attributes. If you want to allow attributes such as "style" then you have additional complexity.
Blacklisting only might make attack for some people harder but it will not make it any harder for the person that uses technique you have not heard of yet.
I'd try using regexp to add missing closing tags to what users entered and replace <br> with <br /> and so on, then parse it using SimpleXML, then iterate over it and remove any tag that is not in whitelist, any attribute that is not in the whitelist for given tag, and any attribute that has a value that does conform to precise regexp for this attribute. After all I'd use asXML() to get the text back. I'd start with minimal set of tags and attributes and add new ones as needed being especially careful of anything that may contain url.

Here is four alternatives :
Pear's HTML_Safe
HTML_Sanitizer
htmLawed
HTML_Filter

IMHO htmlawed is the best -- lean, fast, full HTML coverage, most flexible... black OR white list for tags AND attributes. Safe? Defeats all the ha.ckers XSS codes

How about using PHP's native HTML parser?
I was curious about it, so I've wrote some code for testing (requires PHP 5.3.6+):
$badHtml = file_get_contents('badHtml.txt');
$html = sprintf('<div id="input">%s</div>', $badHtml);
// tidy is no required, but may fix invalid markup
$tidy = new \tidy();
$tidy->parseString($html, array(), 'utf8');
$tidy->cleanRepair();
$dom = new \DomDocument('1.0', 'UTF-8');
libxml_use_internal_errors(true);
$dom->loadHtml($tidy);
$input = $dom->getElementById('input');
// tag as key, attributes as values
$allowed = array(
'table' => array('border'),
'tbody' => array(),
'tr' => array(),
'td' => array(),
'th' => array(),
'img' => array('src', 'alt'),
'p' => array(),
'ul' => array(),
'ol' => array(),
'li' => array(),
'a' => array('href', 'title'),
'strong' => array(),
'em' => array(),
'sub' => array(),
'sup' => array(),
);
$walk = function(\DomNode $node) use($allowed, &$walk){
// only check tags
if($node->nodeType !== XML_ELEMENT_NODE)
return;
if(!isset($allowed[$node->nodeName]))
return $node->parentNode->removeChild($node);
foreach($node->attributes as $key => $attr){
if(!in_array($key, $allowed[$node->nodeName], true))
$node->removeAttribute($key);
// expect URLs here
if(!in_array($key, array('href', 'src'), true))
continue;
if(!filter_var($attr->value, FILTER_VALIDATE_URL))
return $node->parentNode->removeChild($node);
}
array_map($walk, iterator_to_array($node->childNodes));
};
// convert DOMNodeList to array because this way the bad stuff
// can be removed within the loop
array_map($walk, iterator_to_array($input->childNodes));
// export HTML
$sanitized = $dom->saveHtml($input);
The output, without running Tidy:
Seems ok. Or did it remove too much? :)
Should be way faster than HTMLPurifier, theoretically more secure since it's less permissive, and probably faster than the regexes too.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP replace words to links except images - php

EDIT: the previsouly suggested correction was not relevant. Still: I would suggest you not to use any regex but only str_replace in your case if you have a performance constraint. You must change your MySQL functions that are legacy: http://php.net/manual/en/function.mysql-fetch-array.php

EDIT: I can't believe it took me that long to understand that you're trying to parse big chunks of HTML with regular expressions. Read the answer to this question: RegEx match open tags except XHTML self-contained tags

Related

preg_replace and preg_match_all to move img from wordpress $content

Using preg_replace_callback() to extract all images from a string of HTML

Smiley Replace within CDATA of an HTML-String

Using regex to remove HTML tags

Is my anti XSS method OK for allowing user HTML in PHP?

Categories

Resources