PHP Regex for human names - php

I've run into a bit of a problem with a Regex I'm using for humans names.
$rexName = '/^[a-z' -]$/i';
Suppose a user with the name Jürgen wishes to register? Or Böb? That's pretty commonplace in Europe. Is there a special notation for this?
EDIT:, just threw the Jürgen name against a regex creator, and it splits the word up at the ü letter...
http://www.txt2re.com/index.php3?s=J%FCrgen+Blalock&submit=Show+Matches
EDIT2: Allright, since checking for such specific things is hard, why not use a regex that simply checks for illegal characters?
$rexSafety = "/^[^<,\"#/{}()*$%?=>:|;#]*$/i";
(now which ones of these can actually be used in any hacking attempt?)
For instance. This allows ' and - signs, yet you need a ; to make it work in SQL, and those will be stopped.Any other characters that are commonly used for HTML injection of SQL attacks that I'm missing?

I would really say : don't try to validate names : one day or another, your code will meet a name that it thinks is "wrong"... And how do you think one would react when an application tells him "your name is not valid" ?
Depending on what you really want to achieve, you might consider using some kind of blacklist / filters, to exclude the "not-names" you thought about : it will maybe let some "bad-names" pass, but, at least, it shouldn't prevent any existing name from accessing your application.
Here are a few examples of rules that come to mind :
no number
no special character, like "~{()}#^$%?;:/*§£ø and probably some others
no more that 3 spaces ?
none of "admin", "support", "moderator", "test", and a few other obvious non-names that people tend to use when they don't want to type in their real name...
(but, if they don't want to give you their name, their still won't, even if you forbid them from typing some random letters, they could just use a real name... Which is not their's)
Yes, this is not perfect ; and yes, it will let some non-names pass... But it's probably way better for your application than saying someone "your name is wrong" (yes, I insist ^^ )
And, to answer a comment you left under one other answer :
I could just forbid the most command
characters for SQL injection and XSS
attacks,
About SQL Injection, you must escape your data before sending those to the database ; and, if you always escape those data (you should !), you don't have to care about what users may input or not : as it is escaped, always, there is no risk for you.
Same about XSS : as you always escape your data when ouputting it (you should !), there is no risk of injection ;-)
EDIT : if you just use that regex like that, it will not work quite well :
The following code :
$rexSafety = "/^[^<,\"#/{}()*$%?=>:|;#]*$/i";
if (preg_match($rexSafety, 'martin')) {
var_dump('bad name');
} else {
var_dump('ok');
}
Will get you at least a warning :
Warning: preg_match() [function.preg-match]: Unknown modifier '{'
You must escape at least some of those special chars ; I'll let you dig into PCRE Patterns for more informations (there is really a lot to know about PCRE / regex ; and I won't be able to explain it all)
If you actually want to check that none of those characters is inside a given piece of data, you might end up with something like that :
$rexSafety = "/[\^<,\"#\/\{\}\(\)\*\$%\?=>:\|;#]+/i";
if (preg_match($rexSafety, 'martin')) {
var_dump('bad name');
} else {
var_dump('ok');
}
(This is a quick and dirty proposition, which has to be refined!)
This one says "OK" (well, I definitly hope my own name is ok!)
And the same example with some specials chars, like this :
$rexSafety = "/[\^<,\"#\/\{\}\(\)\*\$%\?=>:\|;#]+/i";
if (preg_match($rexSafety, 'ma{rtin')) {
var_dump('bad name');
} else {
var_dump('ok');
}
Will say "bad name"
But please note I have not fully tested this, and it probably needs more work ! Do not use this on your site unless you tested it very carefully !
Also note that a single quote can be helpful when trying to do an SQL Injection... But it is probably a character that is legal in some names... So, just excluding some characters might no be enough ;-)

PHP’s PCRE implementation supports Unicode character properties that span a larger set of characters. So you could use a combination of \p{L} (letter characters), \p{P} (punctuation characters) and \p{Zs} (space separator characters):
/^[\p{L}\p{P}\p{Zs}]+$/
But there might be characters that are not covered by these character categories while there might be some included that you don’t want to be allowed.
So I advice you against using regular expressions on a datum with such a vague range of values like a real person’s name.
Edit   As you edited your question and now see that you just want to prevent certain code injection attacks: You should better escape those characters rather than rejecting them as a potential attack attempt.
Use mysql_real_escape_string or prepared statements for SQL queries, htmlspecialchars for HTML output and other appropriate functions for other languages.

That's a problem with no easy general solution. The thing is that you really can't predict what characters a name could possibly contain. Probably the best solution is to define an negative character mask to exclude some special characters you really don't want to end up in a name.
You can do this using:
$regexp = "/^[^<put unwanted characters here>]+$/

If you're trying to parse apart a human name in PHP, I recomment Keith Beckman's nameparse.php script.

Related

Regular Expression as Php input filter For Pdo DSN host or host+port

Right up front, I dislike Regular Expressions. It is desired to allow input of domain or domain + port options of the DSN to be set in a single input. Also that localhost is an option as well as subdomains.
The best I could come was acquired from an article called Domain name regular expression example
Which provides this expression for Java
^((?!-)[A-Za-z0-9-]{1,63}(?<!-)\\.)+[A-Za-z]{2,6}$
It was realized that it almost works but the part for a period is \\. and should be \. in Php
From the php.net manual some PDO_MYSQL DSN examples are:
mysql:host=localhost;dbname=testdb
mysql:host=localhost;port=3307;dbname=testdb
The only part I want to perform the regular expression on is
localhost
localhost;port=3307
This is to be used for a filter of a HTML form as part of a Php based installation of a Php app (hope this make sense).
So this is what I came up with:
'/^((?!-)[a-z0-9-]{1,63}(?<!-)(\.){0,1})+([a-z]{0,9})(?<!\.)((;port=){1}[0-9]{2,6}){0,1}$/i'
It is important that the string does not start or end with hyphens or contain whitespace.
Here is something more in depth https://gist.github.com/CrandellWS/bc0cbcbb1df5c4b4361a
and a link to the overall project https://github.com/CrandellWS/ams
Can this expression be shorter or optimized in order to help prevent end-user errors?
More importantly as Regular Expression is not my strongest point any possible gotchas that can be prevented from please explain how and why.
For My reference these 2 sites have been immensely helpful in figuring out Regular Expressions http://www.regexr.com/ and http://txt2re.com/
If you only want to check if it is valid,(without caring on about match groups):
^[^-][a-z0-9-]{0,63}[^-](\.[a-z]{0,9})*(;port=[0-9]{2,6})?$
If you are not so exact you could test:
^[^-][a-z0-9-]*[^-](\.[a-z]+)*(;port=[0-9]+)?$
or
^[^-][\w-]*[^-](\.\w+)*(;port=\d+)?$
But essentially the every time you shrink it you are losing accuracy
Update 1:
[\w\d-]* vs [A-Za-z0-9-]{1,63} here the length of the string will not be checked
? vs {0,1} is equivalent (just shorter)
\d vs [0-9] is equivalent (just shorter)
\w vs [A-Za-z0-9_] is equivalent (just shorter)
And no negative lookbehinds (?<! ...) they make everything a bit complicated
missing accuracy: It now there are some entries possible, that shouldn't be valid, since length checks are missing and now underscore is also allowed(before not)
Update 2:
To prevent spaces at the beginning characters just add this
^[^\s-][\w-]*[^\s-](\.\w+)*(;port=\d+)?$
[^\s-] ... excludes only spaces or hyphens, any other character is allowed (even a dot)
But to get closer to your expression (without lookbehind)
^\w[\w-]*\w(\.\w+)*(;port=\d+)?$
and to remove the underscores, but it is a bit longer
^[a-z0-9][a-z0-9-]*[a-z0-9](\.[a-z0-9-]+)*(;port=\d+)?$
I can suggest try to make it more strict like this:
example
It dosen't consider unix_socket and it's not short but simple to understand. You can try to make it more precise.
UPDATED
Try also this example
Let me surprise you that parameters in DSN can go in random order.
This is to be used for a filter of a HTML form as part of a Php based installation of a Php app
For my life I won't understand why would you torture a user asking them to create a DSN-like string (which they likely have no idea of) and then torture yourself verifying it. Instead of just asking for separate host and (optional) port fields, just like any installation script in the world does.
Let me suggest you to make yourself familiar with some existing installation scripts, before starting for your own. One from Wordpress will do.
It just occured to me that may be you need help with PHP conditionals. Here you go:
if (isset($_POST['dbhost'])) {
if ($_POST['dbport'])
{
$DB_PORT = $_POST['dbport']
} else {
$DB_PORT = 3306;
}
$DB_HOST = $_POST['dbhost'];
$DB_DATABASE = $_POST['dbname'];
$DB_USERNAME = $_POST['dbuser'];
$DB_PASSWORD = $_POST['dbpass'];
$DB_DSN = 'mysql:host=$DB_HOST;port=$DB_PORT;dbname=$DB_DATABASE";
This simple code will solve all your problems without Regular Expressions you don't like. I hope that your dislikes do not extend to simple conditionals though.

Why not backslash every empty space to prevent mysql injections

I've been wondering this for maybe a few months now but I still don't know an answer, other then possible speed performance. So long story short, instead of having all this PDO codes everywhere, why not just put a backslash between every character?
$String = $_POST["attack"]; // SOME THING' OR 1 = 1 --
$String = fFilter( $String ); // \S\O\M\E\ \T\H\I\N\G\'\ \O\R\ \1\ \=\ \1\ \-\-
Now I haven't been into this SQL stuff in awhile, so I can't give a perfect example, but basically the sql string should look like this SELECT * FROM account WHERE id = '\S\O\M\E\ \T\H\I\N\G\'\ \O\R\ \1\ \=\ \1\ \-\-' and something like that just always seemed pretty safe, but I haven't heard of anyone using it, or even why not to use it. I always see things like filtering html and etc isn't good, but I don't see why not just filter every single character. Since any attack would look like \a\t\t\a\c\k.
Because placing a backslash before certain characters changes their meaning entirely. For instance, \t is a tab character, not t, so \a\t\t\a\c\k would be transformed to:
a ack
A full list of such sequences is given at:
http://dev.mysql.com/doc/refman/5.5/en/string-literals.html
As several other people have mentioned, use parameterized queries, not input escaping.

how to replace '\\\' to '\'?

my code is not working ? and i dont want to use str_replace , for there maybe more slashes than 3 to be replaced. how can i do the job using preg_replace?
my code here like this:
<?php
$str='<li>
<span class=\"highlight\">Color</span>
Can\\\'t find the exact color shown on the model pictures? Just leave a message (eg: color as shown in the first picture...) when you place order.
Please note that colors on your computer monitor may differ slightly from actual product colors depending on your monitor settings.
</li>';
$str=preg_replace("#\\+#","\\",$str);
echo $str;
There is merit in the other answers, but to me it looks like what you're actually trying to accomplish is something very different. In the php code \\\' is not three slashes followed by an apostrophe, it's one escaped slash followed by an escaped apostrophe, and in the rendered output, that's exactly what you see—a slash followed by an apostrophe (with no need to escape them in the rendered html). It's important to realize that the escape character is not actually part of the string; it's merely a way to help you represent a character that normally has very different meaning in within php—in this case, an apostrophe normally terminates a string literal. What looks like 4 characters in php is actually only 2 characters in the string.
If this is the extent of your code, there's no need for string manipulation or regular expressions. What you actually need is just this:
<?php
$str='<li>
<span class="highlight">Color</span>
Can\'t find the exact color shown on the model pictures? Just leave a message (eg: color as shown in the first picture...) when you place order.
Please note that colors on your computer monitor may differ slightly from actual product colors depending on your monitor settings.
</li>';
echo $str;
?>
Only one escape character is needed here for the apostrophe, and in the rendered HTML you will see no slashes at all.
Further Reading:
Escape sequences
The root of this problem is actually in how it was written into your database and likely to be caused by magic_quotes_gpc; this was used in older versions and a really bad idea.
The best fix
This requires a few steps:
Fix the script that puts the HTML inside your database by disabling magic_quotes_gpc.
Write a script that reads all existing database entries, applies stripslashes() and saves the changes.
Fix the presentation part (though, that may need no changes at all.
Alternative patch
Use stripslashes() before you present the HTML.
use this pattern
preg_replace('#\\+#', '\\', $text);
This replaces two or more \ symbols preceding an ' symbol with \'
$theConvertedString = preg_replace("/\\{2,}'/", "\'", $theSourceString);
Ideally, you shouldn't have code causing this issue in the first place so I would have a look at why you have \\' in your code to begin with. If you've manually put it in your variables, take it out. Often, this also happens with multiple calls to addslashes() or mysql_real_escape_string() or a cheap hosting providers' automatic transformation of all POST request variables to escape slashes, combined with your server side PHP code to do the same.

User input to database

Suppose that, we're expecting just strings or numbers with the data send by a user. Is it safe enough to check the data with ereg and preg_match functions? Is there a way to fake them? Should we still use mysql_real_escape_string?
This will be short answer...
Use PDO:
Docs: http://php.net/manual/en/book.pdo.php
For example Zend famework is using this engine.
safe enough is relative to your own needs. If you're wanting to avoid mysql_real_escape_string for some reason then I first want to ask why.
My answer is: sure... depending on your conditions
you can preg match against [0-9a-z] and there is nothing to fear. Try passing a multibyte character to be safe. So long as your condition does not allow you to do anything if the match does not fit your requirements then there is no tricky work-around that I know of to slip in malicious characters on such a strict rule.
but the term "string" is very open. does that include punctuation? what kind, etc. If you allow standard injection characters as what you call a "String" then my answer is no longer sure.
But I still recommend mysql_real_escape_string() on all user submitted info, no matter how you try to purify it before hand.
If you use a regex to match against valid input, and it succeeds, then the user input is valid. That being said, if you don't have any malicious characters in valid input (particularly quotes or potentially multibyte characters), then you don't need to call mysql_real_escape_string. The same principle applies to something like:
$user_in_num = intval( $_POST['in_num']); // Don't need mysql_real_escape_string here
So something like the following:
$subject = $_POST['string_input'];
if( !preg_match('/[^a-z0-9]/i', $subject))
{
exit( 'Invalid input');
}
It is fine / safe to use $subject in an SQL query once the preg_match succeeds.

Regex as first line of defense against XSS

I had a regex as the first line of defense against XSS.
public static function standard_text($str)
{
// pL matches letters
// pN matches numbers
// pZ matches whitespace
// pPc matches underscores
// pPd matches dashes
// pPo matches normal puncuation
return (bool) preg_match('/^[\pL\pN\pZ\p{Pc}\p{Pd}\p{Po}]++$/uD', (string) $str);
}
It is actually from Kohana 2.3.
This runs on public entered text (no HTML ever), and denies the input if it fails this test. The text is always displayed with htmlspecialchars() (or more specifically, Kohana's flavour, it adds the char set amongst other things). I also put a strip_tags() on output.
The client had a problem when he wanted to enter some text with parenthesis. I thought about modifying or extending the helper, but I also had a secondary thought - if I allow double quotes, is there really any reason why I need to validate at all?
Can I just rely on the escaping on output?
It's never secure to rely on Regexes for filtering dangerous XSS attacks. And although you are not relying on them, output escaping and input filtering, when used correctly, will kill all kinds of attacks. Therefore, there is no point in having Regexes as a "first line of defense" when their help isn't really needed. As you and your client have discovered, they only complicate things when used like this.
Long story short: if you use html_entities or htmlspecialchars to escape your output, you don't need regexes nor do you really need strip_tags either.

Categories