Difference between these two regular expressions? - php

What's the difference between these two regular expressions?(using php preg_match())
/^[0-9\x{06F0}-\x{06F9}]{1,}$/u
/^[0-9\x{06F0}-\x{06F9}\x]{1,}$/u
What's the meaning of the last \x in the second pattern?

It's interpreted as \x00 (the null character) but it's almost certainly a bug caused by sloppy editing or copy and paste.

http://www.regular-expressions.info/unicode.html
...Since \x by itself is not a valid regex token...

I think the second pattern is not valid.
According to this page http://www.regular-expressions.info/unicode.html, the \x is only useful followed by the unicode number:
Since \x by itself is not a valid regex token, \x{1234} can never be
confused to match \x 1234 times.

This is weird. Php notation for a unicode character is \x{}. In perl, it is the same thing.
But php has the //u modifier in regex's. I asume that means unicode. No such modifier in perl.
In perl regex, \x## is parsed, where ## is required to denote an ascii character. If its \x or \x#, its a warning of illeagal hex digit ignored (because it requires 2 digits, no more no less) and it takes only the valid hex digits in the sequence. If you have no digits as in \x, it uses \0 ascii char etc..
However, any \x{} notation is ok, and \x{0} is equivalend to \x{}. And \x{0}-\x{ff} is considered ascii, \x{100}- is considered unicode.
So, \x is a valid hex/unicode escape sequence but by itself its asumed hex and is incomplete and probably not something that should be left to parser default mechanisms.

As far as I can tell, the second \x is actually an invalid character. Do both expressions work?

Related

When do I need u-modifier in PHP regex?

I know, that PHP PCRE functions treat strings as byte sequences, so many sites suggest to use /u modifier for handling input and regex as UTF-8.
But, do I really need this always? My tests show, that this flag makes no difference, when I don't use escape sequences or dot or something like this.
For example
preg_match('/^[\da-f]{40}$/', $string); to check if string has format of a SHA1 hash
preg_replace('/[^a-zA-Z0-9]/', $spacer, $string); to replace every char that is non-ASCII letter or number
preg_replace('/^\+\((.*)\)$/', '\1', $string); for getting inner content of +(XYZ)
These regex contain only single byte ASCII symbols, so it should work on every input, regardless of encoding, shouldn't it? Note that third regex uses dot operator, but as I cut off some ASCII chars at beginning and end of string, this should work on UTF-8 also, correct?
Cannot anyone tell me, if I'm overlooking something?
There is no problem with the first expression. The characters being quantified are explicitly single-byte, and cannot occur in a UTF-8 multibyte sequence.
The second expression may give you more spacers than you expect; for example:
echo preg_replace('/[^a-zA-Z0-9]/', "0", "💩");
// => 0000
The third expression also does not pose a problem, as the repeated character is limited by parentheses (which is ASCII-safe).
This is more dangerous:
echo preg_replace('/^(.)/', "0", "💩");
// => 0???
Typically, without knowing more about how UTF-8 works, it may be tricky to predict which regexps are safe, and which are not, so using /u for all text that might contain a character above U+007F is the best practice.
Unicode modifier u allows proper detection of accented characters, which are always multibyte.
preg_match('/([\w ]{2,})/', 'baz báz báž', $match);
// $match[0] = "baz b" ... wrong, accented/multibyte chars silently ignored
preg_match('/([\w ]{2,})/u', 'baz báz báž', $match);
// $match[0] = "baz báz báž" ... correct
Use it also for safe detection of whitespaces:
preg_replace(''/\s+/u', ' ', $txt); // works reliably e.g. with EOLs (line endings)
u (PCRE_UTF8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and subject strings are treated as UTF-8. An invalid subject will cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been regarded as valid UTF-8.
You will need this when you have to compare Unicode characters, such as Korean or Japanese.
In other words, unless you are not comparing strings that is not Unicode (such as English), You don't need to use this flag.

PHP - regex to allow unicode charcaters

I was using the following regex with preg_replace to filter inputs:
/[^A-Za-z0-9[:space:][:blank:]_<>=##£€$!?:;%,.\\'\\\"()&+\\/-]/
However this does not allow accented characters like umlauts so I changed it to:
/[^\w[:space:][:blank:]_<>=##$£€!?:;%,.\\'\\\"()&+\\/-]/u
This however does work with the £ or € characters, nothing is returned, but I need to accept these characters, I have tried escaping them but that doesn't work.
Also I want to create an regex that is similar to just A-Za-z but will allow accented characters, how can I do that?
From http://php.net/manual/en/reference.pcre.pattern.modifiers.php
u (PCRE_UTF8) This modifier turns on additional functionality of PCRE
that is incompatible with Perl. Pattern and subject strings are
treated as UTF-8. An invalid subject will cause the preg_* function to
match nothing; an invalid pattern will trigger an error of level
E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid
since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been
regarded as valid UTF-8.
That means that first you have to make sure the input string is proper UTF-8 text.
Secondly, have you heard of unicode categories? If not, head to http://www.regular-expressions.info/unicode.html and search for Unicode categories. For example you could use \p{S} to match all currency symbols, or \p{L} for all letters. Your regex could (probably) be written as follows: /[^\p{L}\p{P}\p{N}\p{S}\p{M}]/.
This will though match pretty much nothing, as it allows pretty much all characters to be used - ^ at the start of a regex character class (something between [ and ]) means "everything that is not what is in this class will be matched".
On top of that, your regex will only match input that has a length of exactly one - if you want to match everything, you should begin adding a + after your closing ] to keep matching characters until the pattern fails.
So, for that sake, what exactly are you trying to achieve? Maybe we can suggest you some more regex improvements if we know what you're trying to do.

Regex for word characters in any language

Testing the PHP regex engine, I see that it considers only [0-9A-Za-z_] to be word characters. Letters of non-ASCII languages, such as Hebrew, are not matched as word characters with [\w]. Are there any PHP or Perl regex escape sequences which will match a letter in any language? I could add ranges for each alphabet that I expect to be used, but users will always surprise us with unexpected languages!
Note that this is not for security filtering but rather for tokenizing a text.
Try [\pL_] - see the reference at
http://php.net/manual/en/regexp.reference.unicode.php
Try \p{L}. It matches any kind of letter from any language. If you don't want to use char set [].

Generating all Unicode characters not in ASCII scheme in PHP?

This regular expression is supposed to match all non-ASCII characters, 0-128 code points:
/[^x00-x7F]/i
Imagine I want to test (just out of curiosity) this regular expression with all Unicode characters, 0-1114111 code points.
Generating this range maybe simple with range(0, 1114111). Then I should covert each decimal number to hexadecimal with dechex() function.
After that, how can i convert the hexadecimal number to the actual character? And how can exclude characters already in ASCII scheme?
It depends on how you are going to do the matching and whether you are going to put the PCRE regex engine into UTF-8 mode with the /u modifier.
If you do use the /u modifier then first of all you must use UTF-8 encoding for both the regular expression and the subject and the regex engine will automatically interpret legal UTF-8 byte sequences as just one character. In this mode the regular expression [^x00-x7F] will match all characters outside the Latin-1 supplement block, including those with code points greater than 255. You will also need to generate the UTF-8 representations of each character (given its code point) manually.
If you do not use the /u modifier then the regex engine will be dumb: it will consider each byte as a separate character, which means that you have to work at byte rather than character level. On the other hand, you will now be able to work with any encoding you prefer. However, you will have to ditch the [^x00-x7F] regex (because it's only going to be matching random bytes in the string) and work with a regular expression that embodies the rules of your chosen encoding (example for UTF-8). To generate the encoded forms of random characters you will again need to use custom code that depends on the specific encoding.
I think the hex2bin(string) function will convert a hex string into a binary string. To exclude ASCII character codepoints, just begin from the x80 hex codepoint (skipping x00 to x7F).
But it does sort of sound like you're trying to unit test the regex library, which seems unnecessary unless you are developing the regex library, or you need to be extremely paranoid.

Would this regular expression work?

^([a-zA-Z0-9!##$%^&*|()_\-+=\[\]{}:;\"',<.>?\/~`]{4,})$
Would this regular expression work for these rules?
Must be atleast 4 characters
Characters can be a mix of alphabet (capitalized/non-capitalized), numeric, and the following characters: ! # # $ % ^ & * ( ) _ - + = | [ { } ] ; : ' " , < . > ? /
It's intended to be a password validator. The language is PHP.
Yes?
Honestly, what are you asking for? Why don't you test it?
If, however, you want suggestions on improving it, some questions:
What is this regex checking for?
Why do you have such a large set of allowed characters?
Why don't you use /\w/ instead of /0-9a-zA-Z_/?
Why do you have the whole thing in ()s? You don't need to capture the whole thing, since you already have the whole thing, and they aren't needed to group anything.
What I would do is check the length separately, and then check against a regex to see if it has any bad characters. Your list of good characters seems to be sufficiently large that it might just be easier to do it that way. But it may depend on what you're doing it for.
EDIT: Now that I know this is PHP-centric, /\w/ is safe because PHP uses the PCRE library, which is not exactly Perl, and in PCRE, \w will not match Unicode word characters. Thus, why not check for length and ensure there are no invalid characters:
if(strlen($string) >= 4 && preg_match('[\s~\\]', $string) == 0) {
# valid password
}
Alternatively, use the little-used POSIX character class [[:graph:]]. It should work pretty much the same in PHP as it does in Perl. [[:graph:]] matches any alphanumeric or punctuation character, which sounds like what you want, and [[:^graph:]] should match the opposite. To test if all characters match graph:
preg('^[[:graph:]]+$', $string) == 1
To test if any characters don't match graph:
preg('[[:^graph:]]', $string) == 0
You forgot the comma (,) and full stop (.) and added the tilde (~) and grave accent (`) that were not part of your specification. Additionally just a few characters inside a character set declaration have to be escaped:
^([a-zA-Z0-9!##$%^&*()|_\-+=[\]{}:;"',<.>?/~`]{4,})$
And that as a PHP string declaration for preg_match:
'/^([a-zA-Z0-9!##$%^&*()|_\\-+=[\\]{}:;"\',<.>?\\/~`]{4,})$/'
I noticed that you essentially have all of ASCII, except for backslash, space and the control characters at the start, so what about this one, instead?
^([!-\[\]-~]{4,})$
You are extra escaping and aren't using some predefined character classes (such as \w, or at least \d).
Besides of that and that you are anchoring at the beginning and at the end, meaning that the regex will only match if the string starts and ends matching, it looks correct:
^([a-zA-Z\d\-!$##$%^&*()|_+=\[\]{};,."'<>?/~`]{4,})$
If you really mean to use this as a password validator, it reeks of insecurity:
Why are you allowing 4 chars passwords?
Why are you forbidding some characters? PHP can't handle some? Why would you care? Let the user enter the characters he pleases, after all you'll just end up storing a hash + salt of it.
No. That regular expression would not work for the rules you state, for the simple reason that $ by default matches before the final character if it is a newline. You are allowing password strings like "1234\n".
The solution is simple. Either use \z instead of $, or apply the D modifier to the regex.

Categories