i want to parse all links in html document string in php in such way: replace href='LINK' to href='MY_DOMAIN?URL=LINK', so because LINK will be url parameter it must be urlencoded. i'm trying to do so:
preg_replace('/href="(.+)"/', 'href="http://'.$host.'/?url='.urlencode('${1}').'"', $html);
but '${1}' is just string literal, not founded in preg url, what need i do, to make this code working?
Well, to answer your question, you have two choices with Regex.
You can use the e modifier to the regex, which tells preg_replace that the replacement is php code and should be executed. This is typically seen as not great, since it's really no better than eval...
preg_replace($regex, "'href=\"http://{$host}?url='.urlencode('\\1').'\"'", $html);
The other option (which is better IMHO) is to use preg_replace_callback:
$callback = function ($match) use ($host) {
return 'href="http://'.$host.'?url='.urlencode($match[1]).'"';
};
preg_replace_callback($regex, $callback, $html);
But also never forget, don't parse HTML with regex...
So in practice, the better way of doing it (The more robust way), would be:
$dom = new DomDocument();
$dom->loadHtml($html);
$aTags = $dom->getElementsByTagName('a');
foreach ($aTags as $aElement) {
$href = $aElement->getAttribute('href');
$href = 'http://'.$host.'?url='.urlencode($href);
$aElement->setAttribute('href', $href);
}
$html = $dom->saveHtml();
Use the 'e' modifier.
preg_replace('/href="([^"]+)"/e',"'href=\"http://'.$host.'?url='.urlencode('\\1').'\"'",$html);
http://uk.php.net/preg-replace - example #4
Related
I want to allow a lot of user submitted html for user profiles, I currently try to filter out what I don't want but I am now wanting to change and use a whitelist approach.
Here is my current non-whitelist approach
function FilterHTML($string) {
if (get_magic_quotes_gpc()) {
$string = stripslashes($string);
}
$string = html_entity_decode($string, ENT_QUOTES, "ISO-8859-1");
// convert decimal
$string = preg_replace('/&#(\d+)/me', "chr(\\1)", $string); // decimal notation
// convert hex
$string = preg_replace('/&#x([a-f0-9]+)/mei', "chr(0x\\1)", $string); // hex notation
//$string = html_entity_decode($string, ENT_COMPAT, "UTF-8");
$string = preg_replace('#(&\#*\w+)[\x00-\x20]+;#U', "$1;", $string);
$string = preg_replace('#(<[^>]+[\s\r\n\"\'])(on|xmlns)[^>]*>#iU', "$1>", $string);
//$string = preg_replace('#(&\#x*)([0-9A-F]+);*#iu', "$1$2;", $string); //bad line
$string = preg_replace('#/*\*()[^>]*\*/#i', "", $string); // REMOVE /**/
$string = preg_replace('#([a-z]*)[\x00-\x20]*([\`\'\"]*)[\\x00-\x20]*j[\x00-\x20]*a[\x00-\x20]*v[\x00-\x20]*a[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //JAVASCRIPT
$string = preg_replace('#([a-z]*)([\'\"]*)[\x00-\x20]*v[\x00-\x20]*b[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //VBSCRIPT
$string = preg_replace('#([a-z]*)[\x00-\x20]*([\\\]*)[\\x00-\x20]*#([\\\]*)[\x00-\x20]*i([\\\]*)[\x00-\x20]*m([\\\]*)[\x00-\x20]*p([\\\]*)[\x00-\x20]*o([\\\]*)[\x00-\x20]*r([\\\]*)[\x00-\x20]*t#iU', '...', $string); //#IMPORT
$string = preg_replace('#([a-z]*)[\x00-\x20]*e[\x00-\x20]*x[\x00-\x20]*p[\x00-\x20]*r[\x00-\x20]*e[\x00-\x20]*s[\x00-\x20]*s[\x00-\x20]*i[\x00-\x20]*o[\x00-\x20]*n#iU', '...', $string); //EXPRESSION
$string = preg_replace('#</*\w+:\w[^>]*>#i', "", $string);
$string = preg_replace('#</?t(able|r|d)(\s[^>]*)?>#i', '', $string); // strip out tables
$string = preg_replace('/(potspace|pot space|rateuser|marquee)/i', '...', $string); // filter some words
//$string = str_replace('left:0px; top: 0px;','',$string);
do {
$oldstring = $string;
//bgsound|
$string = preg_replace('#</*(applet|meta|xml|blink|link|script|iframe|frame|frameset|ilayer|layer|title|base|body|xml|AllowScriptAccess|big)[^>]*>#i', "...", $string);
} while ($oldstring != $string);
return addslashes($string);
}
The above works pretty well, I have never had any problems after 2 years of use with it but for a whitelist approach is there anything similars to stackoverflows C# method but in PHP?
http://refactormycode.com/codes/333-sanitize-html
HTML Purifier is a
standards-compliant HTML filter
library written in PHP. HTML Purifier
will not only remove all malicious
code (better known as XSS) with a
thoroughly audited, secure yet
permissive whitelist, it will also
make sure your documents are
standards compliant, something only
achievable with a comprehensive
knowledge of W3C's specifications.
Maybe it is safer to use DOMDocument to analyze it correctly, remove disallowed tags with removeChild() and then get the result.
It is not always safe to filter stuff with regular expressions, specially if things start to get such complexity. Hackers can find a way to cheat your filters, forums and social networks do know that very well.
For instance, browsers ignore spaces after the <. Your regex filter <script, but if I use < script... big FAIL!
HTML Purifier is the best HTML parser/cleaner out there.
For those of you suggesting simply using strip_tags...be aware: strip_tags will NOT strip out tag attributes and broken tags will also mess it up.
From the manual page:
Warning Because strip_tags() does not actually validate the HTML, partial, or broken tags can result in the removal of more text/data than expected.
Warning This function does not modify
any attributes on the tags that you
allow using allowable_tags , including
the style and onmouseover attributes
that a mischievous user may abuse when
posting text that will be shown to
other users.
You CANNOT rely on just this one solution.
You can just use the strip_tags() function
Since the function is defined as
string strip_tags ( string $str [, string $allowable_tags ] )
You can do this:
$html = $_POST['content'];
$html = strip_tags($html, '<b><a><i><u><span>');
But take note that using strip_tags, you won't be able to filter off the attributes. e.g.
link
Try this function "getCleanHTML" below, extract text content from the elements with exceptions of elements with tag name in the whitelist. This code is clean and easy to understand and debug.
<?php
$TagWhiteList = array(
'b', 'i', 'u', 'strong', 'em', 'a', 'img'
);
function getHTMLCode($Node) {
$Document = new DOMDocument();
$Document->appendChild($Document->importNode($Node, true));
return $Document->saveHTML();
}
function getCleanHTML($Node, $Text = "") {
global $TagWhiteList;
$TextName = $Node->tagName;
if ($TextName == null)
return $Text.$Node->textContent;
if (in_array($TextName, $TagWhiteList))
return $Text.getHTMLCode($Node);
$Node = $Node->firstChild;
if ($Node != null)
$Text = getCleanHTML($Node, $Text);
while($Node->nextSibling != null) {
$Text = getCleanHTML($Node->nextSibling, $Text);
$Node = $Node->nextSibling;
}
return $Text;
}
$Doc = new DOMDocument();
$Doc->loadHTMLFile("Test.html");
echo getCleanHTML($Doc->documentElement)."\n";
?>
Hope this helps.
It's a pretty simple aim to achieve actually - you just need to check for anything that's NOT some tags from a list of whitelisted tags and remove them from the source. It can be done quite easily with one regex.
function sanitize($html) {
$whitelist = array(
'b', 'i', 'u', 'strong', 'em', 'a'
);
return preg_replace("/<(^".implode("|", $whitelist).")(.*)>(.*)<\/(^".implode("|", $whitelist).")>/", "", $html);
}
I haven't tested this, and there's probably an error in there somewhere but you get the gist of how it works. You might also want to look at using a formatting language such as Textile or Markdown.
Jamie
so I have the code
function getTagContent($string, $tagname) {
$pattern = "/<$tagname.*?>(.*)<\/$tagname>/";
preg_match($pattern, $string, $matches);
print_r($matches);
}
and then I call
$url = "http://www.freakonomics.com/2008/09/24/wall-street-jokes-please/";
$html = file_get_contents($url);
getTagContent($html,"title");
but then it shows that there are no matches, while if you open the source of the url there clearly exist a title tag....
what did I do wrong?
try DOM
$url = "http://www.freakonomics.com/2008/09/24/wall-street-jokes-please/";
$doc = new DOMDocument();
$dom = $doc->loadHTMLFile($url);
$items = $doc->getElementsByTagName('title');
for ($i = 0; $i < $items->length; $i++)
{
echo $items->item($i)->nodeValue . "\n";
}
The 'title' tag is not on the same line as its closing tag, so your preg_match doesn't find it.
In Perl, you can add a /s switch to make it slurp the whole input as though on one line: I forget whether preg_match will let you do so or not.
But this is just one of the reasons why parsing XML and variants with regexp is a bad idea.
Probably because the title is spread on multiple lines. You need to add the option s so that the dot will also match any line returns.
$pattern = "/<$tagname.*?>(.*)<\/$tagname>/s";
Have your php function getTagContent like this:
function getTagContent($string, $tagname) {
$pattern = '/<'.$tagname.'[^>]*>(.*?)<\/'.$tagname.'>/is';
preg_match($pattern, $string, $matches);
print_r($matches);
}
It is important to use non-greedy match all .*? for matching text between start and end of tag and equally important is to use flags s for DOTALL (matches new line as well) and i for ignore case comparison.
I'm creating a simple search for my application.
I'm using PHP regular expression replacement (preg_replace) to look for a search term (case insensitive) and add <strong> tags around the search term.
preg_replace('/'.$query.'/i', '<strong>$0</strong>', $content);
Now I'm not the greatest with regular expressions. So what would I add to the regular expression to not replace search terms that are in a href of an anchor tag?
That way if someone searched "info" it wouldn't change a link to "http://something.com/this_<strong>info</strong>/index.html"
I believe you will need conditional subpatterns] for this purpose:
$query = "link";
$query = preg_quote($query, '/');
$p = '/((<)(?(2)[^>]*>)(?:.*?))*?(' . $query . ')/smi';
$r = "$1<strong>$3</strong>";
$str = ''."\n".'A Link'; // multi-line text
$nstr = preg_replace($p, $r, $str);
var_dump( $nstr );
$str = 'Its not a Link'; // non-link text
$nstr = preg_replace($p, $r, $str);
var_dump( $nstr );
Output: (view source)
string(61) "<a href="/Link/foo/the_link.htm">
A <strong>Link</strong></a>"
string(31) "Its not a <strong>Link</strong>"
PS: Above regex also takes care of multi-line replacement and more importantly it ignores matching not only href but any other HTML entity enclosed in < and >.
EDIT: If you just want to exclude hrefs and not all html entities then use this pattern instead of above in my answer:
$p = '/((<)(?(2).*?href=[^>]*>)(?:.*?))*?(' . $query . ')/smi';
I'm not 100% what you are ultimately after here, but from what I can, it's a sort of "search phrase" highlighting facility, which highlights keywords so to speak. If so, I suggest having a look at the Text Helper in CodeIgniter. It provides a nice little function called highlight_phrase and this could do what you are looking for.
The function is as follows.
function highlight_phrase($str, $phrase, $tag_open = '<strong>', $tag_close = '</strong>')
{
if ($str == '')
{
return '';
}
if ($phrase != '')
{
return preg_replace('/('.preg_quote($phrase, '/').')/i', $tag_open."\\1".$tag_close, $str);
}
return $str;
}
You may use conditional subpatterns, see explanation here: http://cz.php.net/manual/en/regexp.reference.conditional.php
preg_replace("/(?(?<=href=\")([^\"]*\")|($query))/i","\\1<strong>\\2</strong>",$x);
In your case, if you have whole HTML, not just href="", there is an easier solution using 'e' modifier, which enables you using PHP code in replacing matches
function termReplacer($found) {
$found = stripslashes($found);
if(substr($found,0,5)=="href=") return $found;
return "<strong>$found</strong>";
}
echo preg_replace("/(?:href=)?\S*$query/e","termReplacer('\\0')",$x);
See example #4 here http://cz.php.net/manual/en/function.preg-replace.php
If your expression is even more complex, you can use regExp even inside termReplacer().
There is a minor bug in PHP: the $found parameter in termReplacer() needs to be stripslashed!
I use regex to create html tags in plain text. like this
loop
$SearchArray[] = "/\b(".preg_quote($user['name'], "/").")\b/i";
$ReplaceArray[] = '$1';
-
$str = preg_replace($SearchArray, $ReplaceArray, $str);
I'm looking for a way to not match $user['name'] in a tag.
You could use preg_replace_callback()
for 5.3+:
$callback = function($match) using ($user) {
return ''.$match[1].'';
};
$regex = "/\b(".preg_quote($user['name'], "/").")\b/i";
$str = preg_replace_callback($regex, $callback, $string);
for 5.2+:
$method = 'return \'\'.$match[1].\'\';';
$callback = create_function('$match', $method);
$regex = "/\b(".preg_quote($user['name'], "/").")\b/i";
$str = preg_replace_callback($regex, $callback, $string);
So the problem is that you're making several passes over the document, replacing a different user name in each pass, and you're afraid you'll unintentionally replace a name inside a tag that was created in a previous pass, right?
I would try to do all of the replacements in one pass, using preg_replace_callback as #ircmaxwell suggested, and one regex that can match any legal user name. In the callback function, you look up the matched string to see if it's a real user's name. If it is, return the generated link; if not, return the matched string for reinsertion.
It looks like you're trying to add a bunch of anchors to a document. Have you thought of using SimpleXML. This assumes that the anchor tags are part of a larger xhtml document.
//$xhtml_doc is some xhtml doc's path
$doc = simplexml_load_file($xhtml);
//NOTE: find the parent element for all these anchors (maybe with xpath)
//example: $parent = $doc->xpath('//div[#id=parent]');
foreach($user as $k => $v){
$anchor = $doc->addChild('a', $v['name']);
$anchor->addAttribute('href', $v['url']);
}
return $doc->asXML();
simpleXML helps me a lot in these situations. It'll be a lot faster than regex, even if this isn't exactly what you want to do.
I want to allow a lot of user submitted html for user profiles, I currently try to filter out what I don't want but I am now wanting to change and use a whitelist approach.
Here is my current non-whitelist approach
function FilterHTML($string) {
if (get_magic_quotes_gpc()) {
$string = stripslashes($string);
}
$string = html_entity_decode($string, ENT_QUOTES, "ISO-8859-1");
// convert decimal
$string = preg_replace('/&#(\d+)/me', "chr(\\1)", $string); // decimal notation
// convert hex
$string = preg_replace('/&#x([a-f0-9]+)/mei', "chr(0x\\1)", $string); // hex notation
//$string = html_entity_decode($string, ENT_COMPAT, "UTF-8");
$string = preg_replace('#(&\#*\w+)[\x00-\x20]+;#U', "$1;", $string);
$string = preg_replace('#(<[^>]+[\s\r\n\"\'])(on|xmlns)[^>]*>#iU', "$1>", $string);
//$string = preg_replace('#(&\#x*)([0-9A-F]+);*#iu', "$1$2;", $string); //bad line
$string = preg_replace('#/*\*()[^>]*\*/#i', "", $string); // REMOVE /**/
$string = preg_replace('#([a-z]*)[\x00-\x20]*([\`\'\"]*)[\\x00-\x20]*j[\x00-\x20]*a[\x00-\x20]*v[\x00-\x20]*a[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //JAVASCRIPT
$string = preg_replace('#([a-z]*)([\'\"]*)[\x00-\x20]*v[\x00-\x20]*b[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //VBSCRIPT
$string = preg_replace('#([a-z]*)[\x00-\x20]*([\\\]*)[\\x00-\x20]*#([\\\]*)[\x00-\x20]*i([\\\]*)[\x00-\x20]*m([\\\]*)[\x00-\x20]*p([\\\]*)[\x00-\x20]*o([\\\]*)[\x00-\x20]*r([\\\]*)[\x00-\x20]*t#iU', '...', $string); //#IMPORT
$string = preg_replace('#([a-z]*)[\x00-\x20]*e[\x00-\x20]*x[\x00-\x20]*p[\x00-\x20]*r[\x00-\x20]*e[\x00-\x20]*s[\x00-\x20]*s[\x00-\x20]*i[\x00-\x20]*o[\x00-\x20]*n#iU', '...', $string); //EXPRESSION
$string = preg_replace('#</*\w+:\w[^>]*>#i', "", $string);
$string = preg_replace('#</?t(able|r|d)(\s[^>]*)?>#i', '', $string); // strip out tables
$string = preg_replace('/(potspace|pot space|rateuser|marquee)/i', '...', $string); // filter some words
//$string = str_replace('left:0px; top: 0px;','',$string);
do {
$oldstring = $string;
//bgsound|
$string = preg_replace('#</*(applet|meta|xml|blink|link|script|iframe|frame|frameset|ilayer|layer|title|base|body|xml|AllowScriptAccess|big)[^>]*>#i', "...", $string);
} while ($oldstring != $string);
return addslashes($string);
}
The above works pretty well, I have never had any problems after 2 years of use with it but for a whitelist approach is there anything similars to stackoverflows C# method but in PHP?
http://refactormycode.com/codes/333-sanitize-html
HTML Purifier is a
standards-compliant HTML filter
library written in PHP. HTML Purifier
will not only remove all malicious
code (better known as XSS) with a
thoroughly audited, secure yet
permissive whitelist, it will also
make sure your documents are
standards compliant, something only
achievable with a comprehensive
knowledge of W3C's specifications.
Maybe it is safer to use DOMDocument to analyze it correctly, remove disallowed tags with removeChild() and then get the result.
It is not always safe to filter stuff with regular expressions, specially if things start to get such complexity. Hackers can find a way to cheat your filters, forums and social networks do know that very well.
For instance, browsers ignore spaces after the <. Your regex filter <script, but if I use < script... big FAIL!
HTML Purifier is the best HTML parser/cleaner out there.
For those of you suggesting simply using strip_tags...be aware: strip_tags will NOT strip out tag attributes and broken tags will also mess it up.
From the manual page:
Warning Because strip_tags() does not actually validate the HTML, partial, or broken tags can result in the removal of more text/data than expected.
Warning This function does not modify
any attributes on the tags that you
allow using allowable_tags , including
the style and onmouseover attributes
that a mischievous user may abuse when
posting text that will be shown to
other users.
You CANNOT rely on just this one solution.
You can just use the strip_tags() function
Since the function is defined as
string strip_tags ( string $str [, string $allowable_tags ] )
You can do this:
$html = $_POST['content'];
$html = strip_tags($html, '<b><a><i><u><span>');
But take note that using strip_tags, you won't be able to filter off the attributes. e.g.
link
Try this function "getCleanHTML" below, extract text content from the elements with exceptions of elements with tag name in the whitelist. This code is clean and easy to understand and debug.
<?php
$TagWhiteList = array(
'b', 'i', 'u', 'strong', 'em', 'a', 'img'
);
function getHTMLCode($Node) {
$Document = new DOMDocument();
$Document->appendChild($Document->importNode($Node, true));
return $Document->saveHTML();
}
function getCleanHTML($Node, $Text = "") {
global $TagWhiteList;
$TextName = $Node->tagName;
if ($TextName == null)
return $Text.$Node->textContent;
if (in_array($TextName, $TagWhiteList))
return $Text.getHTMLCode($Node);
$Node = $Node->firstChild;
if ($Node != null)
$Text = getCleanHTML($Node, $Text);
while($Node->nextSibling != null) {
$Text = getCleanHTML($Node->nextSibling, $Text);
$Node = $Node->nextSibling;
}
return $Text;
}
$Doc = new DOMDocument();
$Doc->loadHTMLFile("Test.html");
echo getCleanHTML($Doc->documentElement)."\n";
?>
Hope this helps.
It's a pretty simple aim to achieve actually - you just need to check for anything that's NOT some tags from a list of whitelisted tags and remove them from the source. It can be done quite easily with one regex.
function sanitize($html) {
$whitelist = array(
'b', 'i', 'u', 'strong', 'em', 'a'
);
return preg_replace("/<(^".implode("|", $whitelist).")(.*)>(.*)<\/(^".implode("|", $whitelist).")>/", "", $html);
}
I haven't tested this, and there's probably an error in there somewhere but you get the gist of how it works. You might also want to look at using a formatting language such as Textile or Markdown.
Jamie