php - Clean user input using preg_replace_callback and ord()?

php - Clean user input using preg_replace_callback and ord()? - php

I have a forum style text box and I would like to sanitize the user input to stop potential xss and code insertion. I have seen htmlentities used, but then others have said that &,#,%,: characters need to be encoded as well, and it seems the more I look, the more potentially dangerous characters pop up. Whitelisting is problematic as there are many valid text options beyond ^a-zA-z0-9. I have come up with this code. Will it work to stop attacks and be secure? Is there any reason not to use it, or a better way?
function replaceHTML ($match) {
return "&#" . ord ($match[0]) . ";";
}
$clean = preg_replace_callback ( "/[^ a-zA-Z0-9]/", "replaceHTML", $userInput );
EDIT:_____________________________
I could of course be wrong, but it is my understanding that htmlentities only replaces & < > " (and ' if ENT_QUOTES is turned on). This is probably enough to stop most attacks (and frankly probably more than enough for my low traffic site). In my obsessive attention to detail, however, I dug further. A book I have warns to also encode # and % for "shutting down hex attacks". Two websites I found warned against allowing : and --. Its all rather confusing to me, and led me to explore converting all non-alphanumeric characters. If htmlentities does this already then great, but it does not seem to. Here are results from code I ran I copied after clicking view source in firefox.
original (random characters to test):
5:gjla#''*&$!j-l:4
preg_replace_callback:
<b>5:</b>gjla<hi>#''*&$!j-l:4
htmlentities (w/ ENT_QUOTES):
<b>5:</b>gjla<hi>#''*&$!j-l:4
htmlentities appears to not be encoding those other characters like :
Sorry for the wall of text. Is this just me being paranoid?
EDIT #2: ___________

All you need to do to stop XSS attacks is use htmlspecialchars().

That is exactly what htmlentities does already:
http://codepad.viper-7.com/NDZMa3
It will convert (spaced to prevent stackoverflow double encoding):
"& # amp ;"
to
"& # amp; # amp ;"

space ' ' can be changed to \s in your regex, also by adding /i at the end of the regex you made it case insensitive, and you don't need manually translate your chars to sequences, it can be done with a callback of htmlentities
$clean = preg_replace_callback('/[^a-z0-9\s]/i', 'htmlentities', $userInput);

Related

What could the purpose of replacing %20 with spaces before doing PHP rawurlencode() be?

It's a pretty silly question, sorry. There is a big and rather complex system that has a bug and I managed to track it down to this piece
return str_replace('%2F', '/', rawurlencode(str_replace('%20', ' ', $key)));
There is a comment explaining why slashes are replaced - to preserve path structure, e.g. encoded1/encoded2/etc. However there is no explanation whatsoever why %20 is replaced with space and that part is the direct cause of a bug. I am tempted to just remove str_replace() but it looks like it was placed there for some reason and I have a feeling that I'll break something else by doing this. Has anyone encountered anything similar? Perhaps it's a dirty fix for some PHP bug? Any guesses and insights are highly appreciated!

Doing so would prevent %20 (encoded space) from being encoded to %2F20. However, it only serves to prevent double escaped spaces; other special characters would still get double encoded.
This is a sign of bad code; strings that are passed into this function shouldn't be allowed to have encoded characters in the first place.
I would recommend creating unit tests that cover all referencing code and then refactor this function to remove the str_replace() to make sure it doesn't break the tests.

First thing that jumps to mind is as a mitigation technique against double encoding.
Not that I would recommend doing such a thing this way, as it would get real messy real quickly (and one would already wonder why only that entity, perhaps 'they' never experienced issues with any others... yet).

It could be the result of a misunderstanding of rawurlencode() vs urlencode()
urlencode() replaces spaces with + signs
If the original author thought that rawurlencode() did the same thing, they would be attempting to pre-encode the spaces so they don't get turned into +s

Mitigate xss attacks when building links

I posted this question a while back and it is working great for finding and 'linkifying' links from user generated posts.
Linkify Regex Function PHP Daring Fireball Method
<?php
if (!function_exists("html")) {
function html($string){
return htmlspecialchars($string, ENT_QUOTES, 'UTF-8');
}
}
if ( false === function_exists('linkify') ):
function linkify($str) {
$pattern = '(?xi)\b((?:(http)s?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))';
return preg_replace_callback("#$pattern#i", function($matches) {
$input = $matches[0];
$url = $matches[2] == 'http' ? $input : "http://$input";
return '' . "$input";
}, $str);
}
endif;
echo "<div>" . linkify(html($row_rsgetpost['userinput'])) . "</div>";
?>
I am concerned that I may be introducing a security risk by inserting user generated content into a link. I am already escaping user content coming from my database with htmlspecialchars($string, ENT_QUOTES, 'UTF-8') before running it through the linkify function and echoing back to the page, but I've read on OWASP that link attributes need to be treated specially to mitigate XSS. I am thinking this function is ok since it places the user-generated content inside double quotes and has already been escaped with htmlspecialchars($string, ENT_QUOTES, 'UTF-8'), but would really appreciate someone with xss expertise to confirm this. Thanks!

First of data must NEVER be escaped before entering the database, this is very serious mistake. This is not only insecure, but it breaks functionality. Chaining the values of strings, is data corruption and affects string comparison. This approach is insecure because XSS is an output problem. When you are inserting data into the database you do not know where it appears on the page. For instance, even if you where this function the following code is still vulnerable to XSS:
For example:
<a href="javascript:alert(1)" \>
In terms of your regular expression. My initial reaction was, well this is a horrible idea. No comments on how its supposed to work and heavy use of NOT operators, blacklisting is always worse than white-listing.
So I loaded up Regex Buddy and in about 3 minutes I bypassed your regex with this input:
https://test.com/test'onclick='alert(1);//
No developer wants to write a vulnerably, so they are caused with a breakdown in how programmer thinks his application is working, and how it actually works. In this case i would assume you never tested this regex, and its a gross oversimplification of the problem.
HTMLPurifer is a php library designed to clean HTML, it consist of THOUSANDS of regular expressions. Its very slow, and is bypassed on a fairly regular basis. So if you go this route, make sure to update regularly.
In terms of fixing this flaw i think your best off using htmlspecialchars($string, ENT_QUOTES, 'UTF-8'), and then enforcing that the string start with 'http'. HTML encoding is a form of escaping, and the value will be automatically decoded such that the URL is unmolested.

Because the data is going into an attribute, it should be url (or percent) encoded:
return '' . "$input";
Technically it should also then be html encoded
return '' . "$input";
but no browsers I know of care and consequently no-one does it, and it sounds like you might be doing this step already and you don't want to do this twice

Your regular expression is looking for urls that are of http or https. That expression seems to be relatively safe as in does not detect anything that is not a url.
The XSS vulnerability comes from the escaping of the url as html argument. That means making sure that the url cannot prematurely escape the url string and then add extra attributes to the html tag that #Rook has been mentioning.
So I cannot really think of a way how an XSS attack could be performed the following code as suggested by #tobyodavies, but without urlencode, which does something else:
$pattern = '(?xi)\b((?:(http)s?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))';
return preg_replace_callback("#$pattern#i", function($matches) {
$input = $matches[0];
$url = $matches[2] == 'http' ? $input : "http://$input";
return '' . "$input";
}, $str);
Note that I have also a added a small shortcut for checking the http prefix.
Now the anchor links that you generate are safe.
However you should also sanitize the rest of the text. I suppose that you don't want to allow any html at all and display all the html as clear text.

Firstly, as the PHP documentation states htmlspecialchars only escapes
" '&' (ampersand) becomes '&'
'"' (double quote) becomes '"' when ENT_NOQUOTES is not set.
"'" (single quote) becomes ''' (or &apos;) only when ENT_QUOTES is set.
'<' (less than) becomes '<'
'>' (greater than) becomes '>'
". javascript: is still used in regular programming, so why : isn't escaped is beyond me.
Secondly, if !html only expects the characters you think will be entered, not the representation of those characters that can be entered and are seen as valid. the utf-8 character set, and every other character set supports multiple representations for the same character. Also, your false statement allows 0-9 and a-z, so you still have to worry about base64 characters. I'd call your code a good attempt, but it needs a ton of refining. That or you could just use htmlpurifier, which people can still bypass. I do think that it is awesome that you set the character set in htmlspecialchars, since most programmers don't understand why they should do that.

preg_replace on xss code

Can this code help to sanitize malicious code in user submit form?
function rex($string) {
$patterns = array();
$patterns[0] = '/=/i';
$patterns[1] = '/javascript:/i';
$replacements = array();
$replacements[0] = '';
$replacements[1] = '';
return preg_replace($patterns, $replacements, $string);
I have included htmlentities() to prevent XSS on client side, is all the code shown is safe enough to prevent attack?

You don't need that if you are using htmlentities. To prevent XSS you can even just use htmlspecialchars.
Just make sure that you use htmlspecialchars on all data that is printed as plain text in your HTML response.
See also: the answers to "Does this set of regular expressions FULLY protect against cross site scripting?"

your substitutions may help. But you're better off using a pre-rolled solution like PHP's data filters. Then you can easily limit datatype to what you expect.

htmlentities alone will do the trick. No need to replace anything at all.

No.
http://ha.ckers.org/xss.html

Your first replacement rule is useless as it can be easily circumvented by using eval and character encoding (and an equal sign isn't necessary for XSS attacks anyway).
Your second rule can be very likely circumvented on at least some browsers by using things like javascript : or java\script:.
In short, it doesn't help much. If you want to show plain text, htmlentities is probably fine (there are exotic attacks which take advantage of unusual character encodings and browser stupidity to launch XSS attacks without any special characters, but that only works on specific browsers - cough IE cough - in specific situations). If you want to put user input in URLs or other attributes, it is not necessarily enough.

How to handle user input with a mixture of HTML and punctuation?

I have a form field that includes a mixture of HTML and text. I want users to be able to use basic HTML and punctuation.
Currently I am using mysql_real_escape_string and preg_replace to sanitise the data and insert it into the database. My understanding is that preg_replace is the best way to strip any characters that are not in a white list of allowed characters and that mysql_real_escape_string protects from SQL injection.
//How I collect and sanitise the data...
$var=mysql_real_escape_string(
preg_replace("/[^A-Za-z0-9-?!$##()\"'.:;\\#,_ =\/<> ]/",'',$_POST['var'])
);
However, it keeps breaking when the hash character is used.
My questions are:
1) Is there a more efficient way to do this?
2) If this is the best way, what am I doing wrong?
The characters that I need to allow are: all alphanumeric characters and:
? ! # # $ % & ( ) - . , : ; ' " < > / + =
Thanks!

Why not just use strip_tags() and limit it to the tags you need?
strip_tags ($str,"<br>")
You could then do other "sanitation" that is not quite as invasive.

Since many non-alphanumeric characters have special meanings in a regex, you should escape all of them. So
preg_replace("/[^A-Za-z0-9-?!$##()\"'.:;\\#,_ =\/<> ]/",'',$_POST['var'])
becomes (there are a few that probably don't need escaping, but it doesn't hurt)
preg_replace("/[^A-Za-z0-9-\?\!\$\#\#\(\)\"\'\.\:\;\\#\,\_ \=\/\<\> ]/",'',$_POST['var'])

PHP Regex for human names

I've run into a bit of a problem with a Regex I'm using for humans names.
$rexName = '/^[a-z' -]$/i';
Suppose a user with the name Jürgen wishes to register? Or Böb? That's pretty commonplace in Europe. Is there a special notation for this?
EDIT:, just threw the Jürgen name against a regex creator, and it splits the word up at the ü letter...
http://www.txt2re.com/index.php3?s=J%FCrgen+Blalock&submit=Show+Matches
EDIT2: Allright, since checking for such specific things is hard, why not use a regex that simply checks for illegal characters?
$rexSafety = "/^[^<,\"#/{}()*$%?=>:|;#]*$/i";
(now which ones of these can actually be used in any hacking attempt?)
For instance. This allows ' and - signs, yet you need a ; to make it work in SQL, and those will be stopped.Any other characters that are commonly used for HTML injection of SQL attacks that I'm missing?

I would really say : don't try to validate names : one day or another, your code will meet a name that it thinks is "wrong"... And how do you think one would react when an application tells him "your name is not valid" ?
Depending on what you really want to achieve, you might consider using some kind of blacklist / filters, to exclude the "not-names" you thought about : it will maybe let some "bad-names" pass, but, at least, it shouldn't prevent any existing name from accessing your application.
Here are a few examples of rules that come to mind :
no number
no special character, like "~{()}#^$%?;:/*§£ø and probably some others
no more that 3 spaces ?
none of "admin", "support", "moderator", "test", and a few other obvious non-names that people tend to use when they don't want to type in their real name...
(but, if they don't want to give you their name, their still won't, even if you forbid them from typing some random letters, they could just use a real name... Which is not their's)
Yes, this is not perfect ; and yes, it will let some non-names pass... But it's probably way better for your application than saying someone "your name is wrong" (yes, I insist ^^ )
And, to answer a comment you left under one other answer :
I could just forbid the most command
characters for SQL injection and XSS
attacks,
About SQL Injection, you must escape your data before sending those to the database ; and, if you always escape those data (you should !), you don't have to care about what users may input or not : as it is escaped, always, there is no risk for you.
Same about XSS : as you always escape your data when ouputting it (you should !), there is no risk of injection ;-)
EDIT : if you just use that regex like that, it will not work quite well :
The following code :
$rexSafety = "/^[^<,\"#/{}()*$%?=>:|;#]*$/i";
if (preg_match($rexSafety, 'martin')) {
var_dump('bad name');
} else {
var_dump('ok');
}
Will get you at least a warning :
Warning: preg_match() [function.preg-match]: Unknown modifier '{'
You must escape at least some of those special chars ; I'll let you dig into PCRE Patterns for more informations (there is really a lot to know about PCRE / regex ; and I won't be able to explain it all)
If you actually want to check that none of those characters is inside a given piece of data, you might end up with something like that :
$rexSafety = "/[\^<,\"#\/\{\}\(\)\*\$%\?=>:\|;#]+/i";
if (preg_match($rexSafety, 'martin')) {
var_dump('bad name');
} else {
var_dump('ok');
}
(This is a quick and dirty proposition, which has to be refined!)
This one says "OK" (well, I definitly hope my own name is ok!)
And the same example with some specials chars, like this :
$rexSafety = "/[\^<,\"#\/\{\}\(\)\*\$%\?=>:\|;#]+/i";
if (preg_match($rexSafety, 'ma{rtin')) {
var_dump('bad name');
} else {
var_dump('ok');
}
Will say "bad name"
But please note I have not fully tested this, and it probably needs more work ! Do not use this on your site unless you tested it very carefully !
Also note that a single quote can be helpful when trying to do an SQL Injection... But it is probably a character that is legal in some names... So, just excluding some characters might no be enough ;-)

PHP’s PCRE implementation supports Unicode character properties that span a larger set of characters. So you could use a combination of \p{L} (letter characters), \p{P} (punctuation characters) and \p{Zs} (space separator characters):
/^[\p{L}\p{P}\p{Zs}]+$/
But there might be characters that are not covered by these character categories while there might be some included that you don’t want to be allowed.
So I advice you against using regular expressions on a datum with such a vague range of values like a real person’s name.
Edit   As you edited your question and now see that you just want to prevent certain code injection attacks: You should better escape those characters rather than rejecting them as a potential attack attempt.
Use mysql_real_escape_string or prepared statements for SQL queries, htmlspecialchars for HTML output and other appropriate functions for other languages.

That's a problem with no easy general solution. The thing is that you really can't predict what characters a name could possibly contain. Probably the best solution is to define an negative character mask to exclude some special characters you really don't want to end up in a name.
You can do this using:
$regexp = "/^[^<put unwanted characters here>]+$/

If you're trying to parse apart a human name in PHP, I recomment Keith Beckman's nameparse.php script.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.