Multiple regular expression interfere - php

I use regex to create html tags in plain text. like this
loop
$SearchArray[] = "/\b(".preg_quote($user['name'], "/").")\b/i";
$ReplaceArray[] = '$1';
-
$str = preg_replace($SearchArray, $ReplaceArray, $str);
I'm looking for a way to not match $user['name'] in a tag.

You could use preg_replace_callback()
for 5.3+:
$callback = function($match) using ($user) {
return ''.$match[1].'';
};
$regex = "/\b(".preg_quote($user['name'], "/").")\b/i";
$str = preg_replace_callback($regex, $callback, $string);
for 5.2+:
$method = 'return \'\'.$match[1].\'\';';
$callback = create_function('$match', $method);
$regex = "/\b(".preg_quote($user['name'], "/").")\b/i";
$str = preg_replace_callback($regex, $callback, $string);

So the problem is that you're making several passes over the document, replacing a different user name in each pass, and you're afraid you'll unintentionally replace a name inside a tag that was created in a previous pass, right?
I would try to do all of the replacements in one pass, using preg_replace_callback as #ircmaxwell suggested, and one regex that can match any legal user name. In the callback function, you look up the matched string to see if it's a real user's name. If it is, return the generated link; if not, return the matched string for reinsertion.

It looks like you're trying to add a bunch of anchors to a document. Have you thought of using SimpleXML. This assumes that the anchor tags are part of a larger xhtml document.
//$xhtml_doc is some xhtml doc's path
$doc = simplexml_load_file($xhtml);
//NOTE: find the parent element for all these anchors (maybe with xpath)
//example: $parent = $doc->xpath('//div[#id=parent]');
foreach($user as $k => $v){
$anchor = $doc->addChild('a', $v['name']);
$anchor->addAttribute('href', $v['url']);
}
return $doc->asXML();
simpleXML helps me a lot in these situations. It'll be a lot faster than regex, even if this isn't exactly what you want to do.

Related

Stop Script Tags But Allow Others [duplicate]

I want to allow a lot of user submitted html for user profiles, I currently try to filter out what I don't want but I am now wanting to change and use a whitelist approach.
Here is my current non-whitelist approach
function FilterHTML($string) {
if (get_magic_quotes_gpc()) {
$string = stripslashes($string);
}
$string = html_entity_decode($string, ENT_QUOTES, "ISO-8859-1");
// convert decimal
$string = preg_replace('/&#(\d+)/me', "chr(\\1)", $string); // decimal notation
// convert hex
$string = preg_replace('/&#x([a-f0-9]+)/mei', "chr(0x\\1)", $string); // hex notation
//$string = html_entity_decode($string, ENT_COMPAT, "UTF-8");
$string = preg_replace('#(&\#*\w+)[\x00-\x20]+;#U', "$1;", $string);
$string = preg_replace('#(<[^>]+[\s\r\n\"\'])(on|xmlns)[^>]*>#iU', "$1>", $string);
//$string = preg_replace('#(&\#x*)([0-9A-F]+);*#iu', "$1$2;", $string); //bad line
$string = preg_replace('#/*\*()[^>]*\*/#i', "", $string); // REMOVE /**/
$string = preg_replace('#([a-z]*)[\x00-\x20]*([\`\'\"]*)[\\x00-\x20]*j[\x00-\x20]*a[\x00-\x20]*v[\x00-\x20]*a[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //JAVASCRIPT
$string = preg_replace('#([a-z]*)([\'\"]*)[\x00-\x20]*v[\x00-\x20]*b[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //VBSCRIPT
$string = preg_replace('#([a-z]*)[\x00-\x20]*([\\\]*)[\\x00-\x20]*#([\\\]*)[\x00-\x20]*i([\\\]*)[\x00-\x20]*m([\\\]*)[\x00-\x20]*p([\\\]*)[\x00-\x20]*o([\\\]*)[\x00-\x20]*r([\\\]*)[\x00-\x20]*t#iU', '...', $string); //#IMPORT
$string = preg_replace('#([a-z]*)[\x00-\x20]*e[\x00-\x20]*x[\x00-\x20]*p[\x00-\x20]*r[\x00-\x20]*e[\x00-\x20]*s[\x00-\x20]*s[\x00-\x20]*i[\x00-\x20]*o[\x00-\x20]*n#iU', '...', $string); //EXPRESSION
$string = preg_replace('#</*\w+:\w[^>]*>#i', "", $string);
$string = preg_replace('#</?t(able|r|d)(\s[^>]*)?>#i', '', $string); // strip out tables
$string = preg_replace('/(potspace|pot space|rateuser|marquee)/i', '...', $string); // filter some words
//$string = str_replace('left:0px; top: 0px;','',$string);
do {
$oldstring = $string;
//bgsound|
$string = preg_replace('#</*(applet|meta|xml|blink|link|script|iframe|frame|frameset|ilayer|layer|title|base|body|xml|AllowScriptAccess|big)[^>]*>#i', "...", $string);
} while ($oldstring != $string);
return addslashes($string);
}
The above works pretty well, I have never had any problems after 2 years of use with it but for a whitelist approach is there anything similars to stackoverflows C# method but in PHP?
http://refactormycode.com/codes/333-sanitize-html
HTML Purifier is a
standards-compliant HTML filter
library written in PHP. HTML Purifier
will not only remove all malicious
code (better known as XSS) with a
thoroughly audited, secure yet
permissive whitelist, it will also
make sure your documents are
standards compliant, something only
achievable with a comprehensive
knowledge of W3C's specifications.
Maybe it is safer to use DOMDocument to analyze it correctly, remove disallowed tags with removeChild() and then get the result.
It is not always safe to filter stuff with regular expressions, specially if things start to get such complexity. Hackers can find a way to cheat your filters, forums and social networks do know that very well.
For instance, browsers ignore spaces after the <. Your regex filter <script, but if I use < script... big FAIL!
HTML Purifier is the best HTML parser/cleaner out there.
For those of you suggesting simply using strip_tags...be aware: strip_tags will NOT strip out tag attributes and broken tags will also mess it up.
From the manual page:
Warning Because strip_tags() does not actually validate the HTML, partial, or broken tags can result in the removal of more text/data than expected.
Warning This function does not modify
any attributes on the tags that you
allow using allowable_tags , including
the style and onmouseover attributes
that a mischievous user may abuse when
posting text that will be shown to
other users.
You CANNOT rely on just this one solution.
You can just use the strip_tags() function
Since the function is defined as
string strip_tags ( string $str [, string $allowable_tags ] )
You can do this:
$html = $_POST['content'];
$html = strip_tags($html, '<b><a><i><u><span>');
But take note that using strip_tags, you won't be able to filter off the attributes. e.g.
link
Try this function "getCleanHTML" below, extract text content from the elements with exceptions of elements with tag name in the whitelist. This code is clean and easy to understand and debug.
<?php
$TagWhiteList = array(
'b', 'i', 'u', 'strong', 'em', 'a', 'img'
);
function getHTMLCode($Node) {
$Document = new DOMDocument();
$Document->appendChild($Document->importNode($Node, true));
return $Document->saveHTML();
}
function getCleanHTML($Node, $Text = "") {
global $TagWhiteList;
$TextName = $Node->tagName;
if ($TextName == null)
return $Text.$Node->textContent;
if (in_array($TextName, $TagWhiteList))
return $Text.getHTMLCode($Node);
$Node = $Node->firstChild;
if ($Node != null)
$Text = getCleanHTML($Node, $Text);
while($Node->nextSibling != null) {
$Text = getCleanHTML($Node->nextSibling, $Text);
$Node = $Node->nextSibling;
}
return $Text;
}
$Doc = new DOMDocument();
$Doc->loadHTMLFile("Test.html");
echo getCleanHTML($Doc->documentElement)."\n";
?>
Hope this helps.
It's a pretty simple aim to achieve actually - you just need to check for anything that's NOT some tags from a list of whitelisted tags and remove them from the source. It can be done quite easily with one regex.
function sanitize($html) {
$whitelist = array(
'b', 'i', 'u', 'strong', 'em', 'a'
);
return preg_replace("/<(^".implode("|", $whitelist).")(.*)>(.*)<\/(^".implode("|", $whitelist).")>/", "", $html);
}
I haven't tested this, and there's probably an error in there somewhere but you get the gist of how it works. You might also want to look at using a formatting language such as Textile or Markdown.
Jamie

Replacing multiple href values using a regular expression in PHP

I know this question has more of a WordPress background to it, but I'm hoping it's just my lack of PHP knowledge that is the problem here.
I have a regex that looks for <a> tags such as: Find Out More, which are generated from content inputted by the user. Once it finds this content a foreach loop runs over the matched text and uses a WordPress function $postId = url_to_postid( $url ); to convert the URL into a PostID.
This is an example output: Find Out More.
This works fine as long as there is one link in each piece of matched text. However if there are two or more links, it sets every link to have the same href which is incorrect.
I'm sure this is something to do with how I've got my loop running. The code I'm using is below:
<?php
$string = get_field('sample_text_box');
$pattern = "/(?<=href=(\"|'))[^\"']+(?=(\"|'))/";
preg_match_all($pattern, $string, $matches);
$urls = $matches[0];
foreach($urls as $key => $url) {
$postId[$key] = url_to_postid( $url );
}
$newstring = preg_replace($pattern , $postId[$key] , $string);
echo $newstring;
?>
The line get_field('sample_text_box'); is an Advanced Custom Field function which "returns the value of the specified field". You can read about it here if it helps: https://www.advancedcustomfields.com/resources/get_field/
Thanks!
The problem with your code is that you are indeed replacing the pattern with a single value of $postId[$key], where $key is the last value assigned in foreach.
You also can not pass $postId instead because preg_replace expects both pattern and replacement be of the same type - whether strings, or arrays.
However, the problem is easily solved with preg_replace_callback function:
$pattern = '/(?<=href\="|\')([^"\']+)(?="|\')/m';
$new_string = preg_replace_callback($pattern, function ($matches) {
return isset($matches[1]) ? url_to_postid($matches[1]) : $matches[0];
}, $string);

PHP: How to strip tags in a string with certain attributes that have certain values?

I need a function like this:
function strip_tags_with_attribute_values($string, $allowedTags, $allowedAttribute, $allowedValue) {
...
}
And it must produce results like this:
$str = '<p class="bla">hello1</p><p class='bla2'>hello2</p>';
echo strip_tags_with_attribute_values($str, '<p>', 'class', 'bla');
Must produce:
hello1<p class='bla2'>hello2</p>
Why do I need this?
Users copy and paste text from word into the FCKEditor (in Drupal). I need to strip out all the style attributes from the p and span tags.
In your case, using something as simple as
$str = preg_replace("/<p class=\"(bla)\">(.+?)<\/p>/is", "$2", $str);
Should work. If you want arguments, you could try
function strip_tags_with_attribute_values($str, $tag, $att, $val)
{
$pat = "/<{$tag} {$att}=\"{$val}\">(.+?)<\/{$tag}>/is";
$str = preg_replace($pat, "$1", $str);
return $str;
}
Or something similar. This wont work correctly if a tag has multiple attributes. If that's your case, you'll probably want to try using a DOM object or XPATH to strip those out.

preg_replace apply string function (like urlencode) in replacement

i want to parse all links in html document string in php in such way: replace href='LINK' to href='MY_DOMAIN?URL=LINK', so because LINK will be url parameter it must be urlencoded. i'm trying to do so:
preg_replace('/href="(.+)"/', 'href="http://'.$host.'/?url='.urlencode('${1}').'"', $html);
but '${1}' is just string literal, not founded in preg url, what need i do, to make this code working?
Well, to answer your question, you have two choices with Regex.
You can use the e modifier to the regex, which tells preg_replace that the replacement is php code and should be executed. This is typically seen as not great, since it's really no better than eval...
preg_replace($regex, "'href=\"http://{$host}?url='.urlencode('\\1').'\"'", $html);
The other option (which is better IMHO) is to use preg_replace_callback:
$callback = function ($match) use ($host) {
return 'href="http://'.$host.'?url='.urlencode($match[1]).'"';
};
preg_replace_callback($regex, $callback, $html);
But also never forget, don't parse HTML with regex...
So in practice, the better way of doing it (The more robust way), would be:
$dom = new DomDocument();
$dom->loadHtml($html);
$aTags = $dom->getElementsByTagName('a');
foreach ($aTags as $aElement) {
$href = $aElement->getAttribute('href');
$href = 'http://'.$host.'?url='.urlencode($href);
$aElement->setAttribute('href', $href);
}
$html = $dom->saveHtml();
Use the 'e' modifier.
preg_replace('/href="([^"]+)"/e',"'href=\"http://'.$host.'?url='.urlencode('\\1').'\"'",$html);
http://uk.php.net/preg-replace - example #4

Allow user submitted HTML in PHP

I want to allow a lot of user submitted html for user profiles, I currently try to filter out what I don't want but I am now wanting to change and use a whitelist approach.
Here is my current non-whitelist approach
function FilterHTML($string) {
if (get_magic_quotes_gpc()) {
$string = stripslashes($string);
}
$string = html_entity_decode($string, ENT_QUOTES, "ISO-8859-1");
// convert decimal
$string = preg_replace('/&#(\d+)/me', "chr(\\1)", $string); // decimal notation
// convert hex
$string = preg_replace('/&#x([a-f0-9]+)/mei', "chr(0x\\1)", $string); // hex notation
//$string = html_entity_decode($string, ENT_COMPAT, "UTF-8");
$string = preg_replace('#(&\#*\w+)[\x00-\x20]+;#U', "$1;", $string);
$string = preg_replace('#(<[^>]+[\s\r\n\"\'])(on|xmlns)[^>]*>#iU', "$1>", $string);
//$string = preg_replace('#(&\#x*)([0-9A-F]+);*#iu', "$1$2;", $string); //bad line
$string = preg_replace('#/*\*()[^>]*\*/#i', "", $string); // REMOVE /**/
$string = preg_replace('#([a-z]*)[\x00-\x20]*([\`\'\"]*)[\\x00-\x20]*j[\x00-\x20]*a[\x00-\x20]*v[\x00-\x20]*a[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //JAVASCRIPT
$string = preg_replace('#([a-z]*)([\'\"]*)[\x00-\x20]*v[\x00-\x20]*b[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //VBSCRIPT
$string = preg_replace('#([a-z]*)[\x00-\x20]*([\\\]*)[\\x00-\x20]*#([\\\]*)[\x00-\x20]*i([\\\]*)[\x00-\x20]*m([\\\]*)[\x00-\x20]*p([\\\]*)[\x00-\x20]*o([\\\]*)[\x00-\x20]*r([\\\]*)[\x00-\x20]*t#iU', '...', $string); //#IMPORT
$string = preg_replace('#([a-z]*)[\x00-\x20]*e[\x00-\x20]*x[\x00-\x20]*p[\x00-\x20]*r[\x00-\x20]*e[\x00-\x20]*s[\x00-\x20]*s[\x00-\x20]*i[\x00-\x20]*o[\x00-\x20]*n#iU', '...', $string); //EXPRESSION
$string = preg_replace('#</*\w+:\w[^>]*>#i', "", $string);
$string = preg_replace('#</?t(able|r|d)(\s[^>]*)?>#i', '', $string); // strip out tables
$string = preg_replace('/(potspace|pot space|rateuser|marquee)/i', '...', $string); // filter some words
//$string = str_replace('left:0px; top: 0px;','',$string);
do {
$oldstring = $string;
//bgsound|
$string = preg_replace('#</*(applet|meta|xml|blink|link|script|iframe|frame|frameset|ilayer|layer|title|base|body|xml|AllowScriptAccess|big)[^>]*>#i', "...", $string);
} while ($oldstring != $string);
return addslashes($string);
}
The above works pretty well, I have never had any problems after 2 years of use with it but for a whitelist approach is there anything similars to stackoverflows C# method but in PHP?
http://refactormycode.com/codes/333-sanitize-html
HTML Purifier is a
standards-compliant HTML filter
library written in PHP. HTML Purifier
will not only remove all malicious
code (better known as XSS) with a
thoroughly audited, secure yet
permissive whitelist, it will also
make sure your documents are
standards compliant, something only
achievable with a comprehensive
knowledge of W3C's specifications.
Maybe it is safer to use DOMDocument to analyze it correctly, remove disallowed tags with removeChild() and then get the result.
It is not always safe to filter stuff with regular expressions, specially if things start to get such complexity. Hackers can find a way to cheat your filters, forums and social networks do know that very well.
For instance, browsers ignore spaces after the <. Your regex filter <script, but if I use < script... big FAIL!
HTML Purifier is the best HTML parser/cleaner out there.
For those of you suggesting simply using strip_tags...be aware: strip_tags will NOT strip out tag attributes and broken tags will also mess it up.
From the manual page:
Warning Because strip_tags() does not actually validate the HTML, partial, or broken tags can result in the removal of more text/data than expected.
Warning This function does not modify
any attributes on the tags that you
allow using allowable_tags , including
the style and onmouseover attributes
that a mischievous user may abuse when
posting text that will be shown to
other users.
You CANNOT rely on just this one solution.
You can just use the strip_tags() function
Since the function is defined as
string strip_tags ( string $str [, string $allowable_tags ] )
You can do this:
$html = $_POST['content'];
$html = strip_tags($html, '<b><a><i><u><span>');
But take note that using strip_tags, you won't be able to filter off the attributes. e.g.
link
Try this function "getCleanHTML" below, extract text content from the elements with exceptions of elements with tag name in the whitelist. This code is clean and easy to understand and debug.
<?php
$TagWhiteList = array(
'b', 'i', 'u', 'strong', 'em', 'a', 'img'
);
function getHTMLCode($Node) {
$Document = new DOMDocument();
$Document->appendChild($Document->importNode($Node, true));
return $Document->saveHTML();
}
function getCleanHTML($Node, $Text = "") {
global $TagWhiteList;
$TextName = $Node->tagName;
if ($TextName == null)
return $Text.$Node->textContent;
if (in_array($TextName, $TagWhiteList))
return $Text.getHTMLCode($Node);
$Node = $Node->firstChild;
if ($Node != null)
$Text = getCleanHTML($Node, $Text);
while($Node->nextSibling != null) {
$Text = getCleanHTML($Node->nextSibling, $Text);
$Node = $Node->nextSibling;
}
return $Text;
}
$Doc = new DOMDocument();
$Doc->loadHTMLFile("Test.html");
echo getCleanHTML($Doc->documentElement)."\n";
?>
Hope this helps.
It's a pretty simple aim to achieve actually - you just need to check for anything that's NOT some tags from a list of whitelisted tags and remove them from the source. It can be done quite easily with one regex.
function sanitize($html) {
$whitelist = array(
'b', 'i', 'u', 'strong', 'em', 'a'
);
return preg_replace("/<(^".implode("|", $whitelist).")(.*)>(.*)<\/(^".implode("|", $whitelist).")>/", "", $html);
}
I haven't tested this, and there's probably an error in there somewhere but you get the gist of how it works. You might also want to look at using a formatting language such as Textile or Markdown.
Jamie

Categories