Regexp character filter

Regexp character filter - php

In my code, I use a regexp I googled somewhere, but I don't understand it. :)
preg_match("/^[\p{L} 0-9\-]{4,25}$/", $login))
What does that p{L} mean? I know what it does -- all characters with national letters included.
And my second question, I want to sanitize user input for ingame chat, so I'm starting with the regexp mentioned above, but I want to allow most special characters. What's the shortest way to do it? Has someone already prepared a regexp to do it?

For \p see Unicode character properties basically it require the character to be in a specific character class (Letter, number, ...).
For your filter it depends on what exactly you want to filter but looking at Unicode character classes is the good way to go i think (adding individually any character that seem useful to you).

The regular expression means:
Each string with length between 4 and 25, starting with a letter, a space, a number or dash.
\p{L} means literally: a character that matches the property "L", where "L" stands for "any letter".
To understand how regexp work:
http://en.wikipedia.org/wiki/Regular_expression
http://www.php.net/manual/en/regexp.reference.unicode.php

If you want to include most characters why not just exclude the ones that you are not allowing?
You can do this with the ^ in your character class
[^characters I don't want]
Disclaimer: Black listing might not be the best approach depending on what you're trying to do, and has to be more thorough than white listing.

Related

Convert regex from gskinner to PHP

I know that I'd likely hear "Don't parse HTML with regex", so let me say that this question is just academic at this point because I actually solved my problem using the DOM, but on my road to a solution, I ran across this pattern that works on the gskinner website, but I can't figure out how to make it work in PHP preg_match().
(?<=href\=")[^]+?(?=")
I think that the [^] is causing the problem, but I'm not certain what to do about it.
What it is intended to do is pull the substring from between the quotes of an href. (One would expect it to be a web-address or at least part of one.)

[^] is a difficult construct. Basically it is an empty negated character class. But what should it match? That depends on the implementation. Some languages are interpreting it as negation of nothing, so it will match every character, that is what gskinner (means ActionScript 3) seems to be doing.
I would never use this, because it is ambiguous.
The most readable way is to use ., the meta character that matches every character (without newlines), if newlines are also wanted, just add the modifier s that enables the dotall mode, this would be exactly what you wanted to achieve with [^].
A workaround that is sometimes used is to use a character class something like this [\s\S] or [\w\W]. Those will also match every character (including newlines), because they are matching some predefined character class and their negation.

how to regex the 'metacharacters'

i want to write a pattern for password field that users must use metachars for them passwords (metacharacters like :!##$%^&*() ), i search about it but didnt find any pattern , is this possible to write such pattern ?
Thanks in Advance

If you are looking for a regexp that will match all special characters, you have a few ways you could go:
You could write a regexp that excludes alphanumerics:
/[^a-zA-Z0-9]/
You could select the special characters that you are interested in and, carefully escaping the ones that have special meaning in regexps with a backslash, write your own regexp for that specific set of characters.
If you know what charset will be involved and if it's reasonably small (so not UTF-8, which will be huge), you can go through it and identify the special chars and then do #2 above. This might be feasible if you are 100% certain that all data will come in as (for example) ASCII chars.

If you mean that the password must contain at least one of those characters, then something like this might work:
if (!preg_match('/['.preg_quote('!##$%^&*()').']/', $password)) {
// fails
}

Yes. You can put most of them into your regular expression like any other character. For those having a special meaning in regular expressions, prefix them with \ or \\ depending on your programming language.

Regex, encoding, and characters that look a like

First, a brief example, let's say I have this /[0-9]{2}°/ RegEx and this text "24º". The text won't match, obviously ... (?) really, it depends on the font.
Here is my problem, I do not have control on which chars the user uses, so, I need to cover all possibilities in the regex /[0-9]{2}[°º]/, or even better, assure that the text has only the chars I'm expecting °. But I can't just remove the unknown chars otherwise the regex won't work, I need to change it to the chars that looks like it and I'm expecting. I have done this through a little function that maps the "look like" to "what I expect" and change it, the problem is, I have not covered all possibilities, for example, today I found a new -, now we got three of them, just like latex =D - -- --- ,cool , but the regex didn't work.
Does anyone knows how I might solve this?

There is no way to include characters with a "similar appearance" in a regular expression, so basically you can't.
For a specific character, you may have luck with the Unicode specification, which may list some of the most common mistakes, but you have no guarantee. In case of the degree sign, the Unicode code chart lists four similar characters (\u02da, \u030a, \u2070 and \u2218), but not your problematic character, the masculine ordinal indicator.

Unfortunately not in PHP. ASP.NET has unicode character classes that cover things like this, but as you can see here, :So covers too much. Also as it's not PHP doesn't help anyway. :)
In PHP you are going to be limited to selecting the most common character sets and using them.
This should help:
http://unicode.org/charts/charindex.html
There is only one degree symbol. Using something that looks similar is not correct. There are also symbols for degree Fahrenheit and celsius. There are tons of minus signs unfortunately.

Your regular expression will indeed need to list all the characters that you want to accept. If you can't know the string's encoding in advance, you can specify your regular expression to be UTF-8 using the /u modifier in PHP: "/[0-9]{2}[°º]/u" Then you can include all Unicode characters that you want to accept in your character class. You will need to convert the subject string to UTF-8 also before using the regex on it.

I just stumbled into good references for this question:
http://www.unicode.org/Public/6.3.0/ucd/NameAliases.txt
https://docs.python.org/3.4/library/unicodedata.html#unicodedata.normalize
https://www.rfc-editor.org/rfc/rfc3454.html

Ok, if you're looking to pull temp you'll probably need to start with changing a few things first.
temperatures can come in 1 to 3 digits so [0-9]{1,3} (and if someone is actually still alive to put in a four digit temperature then we are all doomed!) may be more accurate for you.
Now the degree signs are the tricky part as you've found out. If you can't control the user (more's the pity), can you just pull whatever comes next?
[0-9]{1,3}.
You might have to beef up the first part though with a little position handling like beginning of the string or end.
You may also exclude all the regular characters you don't want.
[0-9]{1,3}[^a-zA-Z]
That will pick up all the punctuation marks (only one though).

how locale aware is preg_replace in php?

If I do, preg_replace('/[^a-zA-Z0-9\s-_]/','',$val) in a multi-lingual application, will it handle things like accented characters or russian characters? If not, how can I filter user input to only allow the above characters but with locale awareness?
thanks!
codecowboy.

The only useful information I can find is from this page of the manual, which states :
A "word" character is any letter or
digit or the underscore character,
that is, any character which can be
part of a Perl "word". The definition
of letters and digits is controlled by
PCRE's character tables, and may vary
if locale-specific matching is taking
place. For example, in the "fr"
(French) locale, some character codes
greater than 128 are used for accented
letters, and these are matched by \w.
Still, I wouldn't bet that it's working as you want...
But, to be sure :
maybe using unicode matching would be better
You'll probably have to try to be certain...
About unicode, the manual says this :
Matching characters by Unicode
property is not fast, because PCRE has
to search a structure that contains
data for over fifteen thousand
characters. That is why the
traditional escape sequences such as
\d and \w do not use Unicode
properties in PCRE.
So, it might be a safer solution... curious about it, should I add ^^

No, it will only match the ASCII character A-Z. To match any letter/number in any language, you need to use the unicode properties of the regex engine:
preg_replace('/[^\p{L}\p{N}]/', '', $string);

How to check real names and surnames - PHP

here's my problem:
I want to check if a user insert a real name and surname by checking if they have only letters (of any alphabet) and ' or - in PHP.
I've found a solution here (but I don't remember the link) on how to check if a string has only letters:
preg_match('/^[\p{L} ]+$/u',$name)
but I'd like to admit ' and - too. (Charset is UTF8)
Can anyone help me please?

A little off-topic, but what exactly is the point of validating names?
It's not to prevent fraud; if people are trying to give you a fake name, they can easily type a string of random letters.
It's not to prevent mistakes; typing a punctuation character is only one of the many mistakes you could make, and an unlikely one at that.
It's not to prevent code injection; you should be preventing that by properly encoding your outputs, regardless of what characters they contain.
So why do we all do it?

Looks like you just need to modify the regex: [\p{L}' -]+

(International) names can contain many characters: spaces, 's, dashes, normal letters, umlauts, accents, ...
EDIT: The point is: How to be sure all letters (of all languages), dash, ' and space are enough? Are there no names which contain a dot (What about "Dr. No"?), a colon or some char else?
EDIT2: Thanks to the user 'some' probably from Sweden (left a comment) we now know that there is an swedish name 'Andreas J:son Friberg'. Remember the colon!

Depending on the character set you want to permit, you'll just need to make sure that characters you want to support are inside the '[]' portion of the regex. Since the '-' character has special meaning in this context (it creates a range), it needs to be the last item in the list.
The \p{L} means match any character with the property of being a letter. \w has a similar meaning, but also includes the '_' character, which you probably don't want.
preg_match('/^[A-Za-z \'-]+$/i',$name);
Would match most common names, though if you want to support foreign character sets, you'll need more a exotic regex.

This should also do it
/[\w'-]+/gi

if charset is UTF-8, then you have a problem - how are you able to check for Central and Eastern European Latin characters (diacritics) or names in Cyrillic, Chinese or Japanese names? that would be a hell of a regex.

Note that the example you provided does not check to ensure that the user has both a surname and given names, though I would argue that that is how it should be. You shouldn't assume a person has more than one name. I am currently working on a PHP application which deals with people's names in context, and if I have discovered anything it's that you cannot make such assumptions :) Even many non-celebrities have just one name.
Using the Unicode categories as in \p{L} was a good idea, as yes obviously people will have all sorts of characters from other languages in their names. However, as well as \p{L} you will also have to take into account combining marks - ie accents, umlauts etc that people add as extra characters.
So, maybe immediately after \p{L} I'd add \p{Mc}
I'd end up with
preg_match('/^[\pL\p{Mc} \'-]+$/u', $name)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regexp character filter - php

Related

Convert regex from gskinner to PHP

how to regex the 'metacharacters'

Regex, encoding, and characters that look a like

how locale aware is preg_replace in php?

How to check real names and surnames - PHP

Categories

Resources