I have a PHP script which processes user input. I need to escape all special characters, but also make links clickable (turn them into <a> elements). What I need is:
function specialCharsAndLinks($text) {
// magic goes here
}
$inp = "http://web.page/index.php?a1=hi&a2=hello\n<script src=\"http://bad-website.com/exploit.js\"></script>";
$out = specialCharsAndLinks($inp);
echo $out;
The output should be (in HTML):
http://web.page/index.php?a1=hi&a2=hello
<script src="http://bad-website.com/exploit.js"></script>
Note that the amperstand in the link stays in the href attribute, but is converted to & in the actual content of the link.
When viewed in a browser:
http://web.page/index.php?a1=hi&a2=hello
<script src="http://bad-website.com/exploit.js"></script>
I eventually solved it with:
function process_text($text) {
$text = htmlspecialchars($text);
$url_regex = "/(?:http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+(?:\/\S*)?/";
$text = preg_replace_callback($url_regex, function($matches){
return ''.$matches[0]."";
}, $text);
return $text;
}
The first line html-encodes the input.
The second line defines the URL regex. Could be improved, but working for now.
The 3rd line uses preg_replace_callback, a function which is like preg_replace, but instead of supplying it with a replacement string, you supply a replacement function that returns the replacement string.
The 4th line is the actual function. It's quite self-documenting. htmlspecialchars_decode undoes the actions of htmlspecialchars (therefore making the link valid if it contained an amperstand).
Try this:
$urlEscaped = htmlspecialchars("http://web.page/index.php?a1=hi&a2=hello");
$aTag = 'Hello';
echo $aTag;
Your example doesn't work because if escaping whole html tag, a tag will never get processed by the browser, instead it will just display as plain text.
As you can see, stackoverflow escapes our whole input (questions/answers ...), so we can actually see the code, and not letting browser to process it.
Related
I have written some code to match and parse a Markdown link of this style:
[click to view a flower](http://www.yahoo.com/flower.html)
I have this code that is meant to extract the link text, then the url itself, then stick them in an A HREF link. I am worried though that maybe I am missing a way for someone to inject XSS, because I am leaving in a decent amount of characters. is this safe?
$pattern_square = '\[(.*?)\]';
$pattern_round = "\((.*?)\)";
$pattern = "/".$pattern_square.$pattern_round."/";
preg_match($pattern, $input, $matches);
$words = $matches[1];
$url = $matches[2];
$words = ereg_replace("[^-_#0-9a-zA-Z\.]", "", $words);
$url = ereg_replace("[^-A-Za-z0-9+&##/%?=~_|!:.]","",$url);
$final = "<a href='$url'>$words</a>";
It seems to work okay, and it does exclude some stupid URLs that include semicolons and backslashes, but I don't care about those URLs.
If you have already passed the input through htmlspecialchars (which you are doing, right?) then it is already impossible for the links to contain any characters that could cause XSS.
If you have not already passed the input through htmlspecialchars, then it doesn't matter what filtering you do when parsing the links, because you're already screwed, because one can trivially include arbitrary HTML or XSS outside the links.
This function will safely parse Markdown links in text while applying htmlspecialchars on it:
function doMarkdownLinks($s) {
return preg_replace_callback('/\[(.*?)\]\((.*?)\)/', function ($matches) {
return '' . $matches[1] . '';
}, htmlspecialchars($s));
}
If you need to do anything more complicated than that, I advise you to use an existing parser, because it is too easy to make a mistake with this sort of thing.
I have my own function called xss which return cleaned text. I want to know, if it is enough, or I have some bug there
function xss($str,$html = false)
{
if($html){
//HTML Purfier called here
}else{
return str_replace(array('&','"',"'",'<','>'), array('&','"',''','<','>'), $str);
}
}
I don't want to use strip_tags because it delete all tags. I want to leave them, but replace by the save entity. Is this save when replace these characters?
You should take into account some other characters that might be problematic. For example ':' and ';'. Imagine that you return the text inside a tag attribute (like an href), then the attacker could use something like 'http://yoursite.com/blah/?parameter=javascript:alert(String.fromCharCode(65))' and that would be reflected as:
...
<a href="javascript:alert(String.fromCharCode(65))">
...
Or he could use ';' to start a new line inside a script if anything gets reflected there. Try to cover all possible problematic characters.
I made this function
function echoSanitizer($var)
{
$var = htmlspecialchars($var, ENT_QUOTES);
$var = nl2br($var, false);
$var = str_replace(array("\\r\\n", "\\r", "\\n"), "<br>", $var);
$var = htmlspecialchars_decode($var);
return stripslashes($var);
}
Would it be safe from xss attacks?
htmlspecialchars to take away html tags
nl2br for the new lines
str_replace to convert the \r\n to <br>
htmlspecialchars_decode to convert back the original characters
stripslashes to STRIPSLASHES
Why I need all of that? Because I want to preview what the users inputed in and I wanted a WYSIWYG thing for them to see. Some of the input came from a textarea box and I wanted the spaces to be preserved so the nl2br is needed.
Generally I'm asking about the (htmlspecialchars_decode) because its new to me. Is it safe? As a whole is the function I made safe if I use it to display user input?
(No database involved in this scenario.)
In your case htmlspecialchars_decode() makes the function unsafe. Users must not be allowed to insert < character unescaped, because that allows them to create arbitrary tags (and filtering/blacklisting is a cat and mouse game you can't win).
At very minimum < must be escaped as <.
If you only allow plain text with newlines, then:
nl2br(htmlspecialchars($text_with_newlines, ENT_QUOTES));
is safe to output in HTML (except inside <script> or attributes that expect JavaScript or URLs such as onclick and href (in the latter case somebody could use javascript:… URL)).
If you want to allow users to use HTML tags, but not exploit your page, then correct function to do this won't fit in StackOverflow post (thousands of lines long, requires full HTML parser, processing of URLs and CSS, etc.) — you'll have to use something heavy-weight like HTMLPurifier.
A user enters URLs in a box like this:
google.net
google.com
I then try to validate / check the URLs, so:
function check_input($data) {
$data = trim($data);
$data = mysql_real_escape_string($data);
return $data;
}
After validation:
$flr_array = explode("\n", $flr_post);
So, I can validate each URL. But mysql_real_escape_string finds spaces between URLs and adds:
<--print_r information-->
google.net\r\ngoogle.com
My URLs should and look like these:
google.net
google.com
How do I remove \r, because it breaks everything else?
Is there a better way to validate URLs?
Tried with str_replace, but no luck.
The best way to validate URLs is to use PHP's filter_var()docs function like so:
if( ! filter_var($url, FILTER_VALIDATE_URL))) {
echo 'BAD URL';
} else {
echo 'GOOD_URL';
}
Thats where difference between single and double quotes comes into picture:
$flr_array = explode('\r\n', $flr_post);
Use preg_split instead,
$parts = preg_split('/[\n\r]+/', $data);
That'll split anywhere there's one or more \n or \r.
What are you doing the mysql_real_escape_string for? Is this intended for a database later on? Don't do an escaping BEFORE you do other processing. That processing can break the escaping m_r_e_s() does and still leave you vulnerable to sql injection.
m_r_e_s() MUST be the LAST thing you do to a string before it's used in an sql query string.
You should use regular expressions to validate URL's in your $flr_array.
With preg_match(), if there is a match it it will fill the $matches variable with results (if you provided it in your function call). This is what php.net has to say about it:
"If matches is provided, then it is filled with the results of search. $matches[0] will contain the text that matched the full pattern, $matches1 will have the text that matched the first captured parenthesized subpattern, and so on."
You can use : nl2br() — Inserts HTML line breaks before all newlines in a string
Example :
<?php
echo nl2br("Welcome\r\nThis is my HTML document");
?>
Output :
Welcome<br />
This is my HTML document
Source : http://php.net/manual/en/function.nl2br.php
I have a form where an user can post a global notice into the system (for other users to see).
The system outputs HTML directly from the DB (when a user wanto to see a notice).
I'd like to allow some html tags to stay intact and to have the rest of them with htmlspecialchars() applied.
I already tried to apply
str_replace($search, $replace, htmlspecialchars($str))
strategy but it seems to be really slow. Too slow, actually. And also it's not safe that will always work, Is there an alternative for this?
I wanted something that did the strip_tags() job except that it, instead of striping tags it would apply htmlspecialchars to the not allowed tags.
ADD(ed) info (by request):
$str can be any size you can think of. I thought of using a big string (1M characters (generated rendomly with some allowed and some unallowed tags inside. All tags had attributes) for the reason of testing one of the worst case scenarios With the logic: If it works like this, it should work for simpler cases.
The server took 5s to process the complete str_replace (with htmlspecialchars). This test was made in my computer that has 2GHz CPU and DDR3 RAM.
both $search and $replace have a total of 7 replacements. Still they do not always work. In some cases $search gives false positives or false negatives.
To clarify, I apply these changes while saving to the DB and not while retrieving from the DB.
You might try this code (should be improved):
function callback(array $matches) {
return htmlspecialchars_decode($matches[0]);
}
$str = 'some <i>string</i> <b>with</b> tags '
. 'some link '
. '<img alt="" src="http://sstatic.net/stackoverflow/img/favicon.ico"/><hr/>';
$str = htmlspecialchars($str);
$str = preg_replace_callback('#(<(i|a)(?: .+?)?>.*?</(\1)>|<(?:img)(?: .*?)?/>)#', 'callback', $str);
echo $str;
Regular expression looks (should look) for 2 types of strings:
<tag attributes>content</tag>, with tag part being the same for opening
an closing tag, and attributes and content being optional
<tag attributes/>, with attributes being optional
Tags are listed in (i|a) part for <tag></tag> types of tags and (?:img) for <tag/> types of tags.
If it finds matching tags, it passes content to callback() function which converts it back by using htmlspecialchars_decode(). This is necessary for decoding quotes and other encoded characters in the list of attributes.
I'm not sure if it works in all cases, i.e., if it matches all necessary tags. If this works in general, then pattern and callback() function should be improved so that callback() decodes only <, > characters and list of attributes; content of tags (i.e., some link part in <a href='#'>some link</a>) must not be decoded.
str_replace along with htmlspecialchars ISN'T slow.
Probabily you have some bottleneck somewhere else.