Can someone explain this regular expression? - php

/^[\p{Ll}\p{Lm}\p{Lo}\p{Lt}\p{Lu}\p{Nd}]+$/mu
This is the regular expression validation that cakePHP uses to validate alphanumeric strings. I am unable to understand what Ll, Lm, Lt etc are? This is to validate alphanumeric strings, so they should test for numbers and characters. Could someone explain this expression a little.
Thank you.

Ll, Lm, Lo, Lt, Lu, Nd are unicode character classes.
See here at around 1/3 of the page:
http://www.regular-expressions.info/unicode.html
\p{Ll} or \p{Lowercase_Letter}: a
lowercase letter that has an uppercase
variant.
\p{Lu} or \p{Uppercase_Letter}: an uppercase
letter that has a lowercase variant.
\p{Lt} or \p{Titlecase_Letter}: a
letter that appears at the start of a
word when only the first letter of the
word is capitalized.
\p{L&} or \p{Letter&}: a letter that exists in
lowercase and uppercase variants
(combination of Ll, Lu and Lt).
\p{Lm} or \p{Modifier_Letter}: a special
character that is used like a letter.
\p{Lo} or \p{Other_Letter}: a letter
or ideograph that does not have
lowercase and uppercase variants.

The code between the curly brackets (Li, Lm, Lt, etc) are classes of Unicode characters. A quick google for Unicode character classes produces for example the following list: http://www.siao2.com/2005/04/23/411106.aspx

If you regularily stumble upon weird regular expressions, try one of these: https://stackoverflow.com/questions/89718/is-there-anything-like-regexbuddy-in-the-open-source-world - albeit I'm not sure if they explain those (mostly Unicode?) placeholders. Otherwise check out the list on http://regular-expressions.info/

Related

REGEX - how to do diacritic-insensitive in preg_match?

Is there a way to use preg_match (e.g. perhaps via a flag) to do diacritic-insensitive matches?
For example, say I'd like it to match:
cafe
café
I know I can do a regex like this: caf[eé]. This regex will work as long as I don't come across any other diacritic variations of e, like: ê è ë ē ĕ ě ẽ ė ẹ ę ẻ.
Of course, I could just list all of those diacritic variations in my regex, such as caf[eêéèëēĕěẽėẹęẻ]. And as long as I don't miss anything, I'll be good. I would just need to do this for all the letters in the alphabet, which is a tedious and prone-to-error solution.
It is not an option for me to find and replace the diacritic letters in the subject with their non-diacritic counterparts. I need to preserve the subject as-is.
The ideal solution for me is to have regex to be diacritic-insensitive. With the example above, I want my regex to simply be: cafe. Is this possible?
If you're open to matching a letter from any language (which includes characters with dicritic), then you could use \p{L} or \p{Letter} as shown here: https://regex101.com/r/UBGQI6/3
According to regular-expressions.info,
\p{L} or \p{Letter}: any kind of letter from any language.
\p{Ll} or \p{Lowercase_Letter}: a lowercase letter that has an uppercase variant.
\p{Lu} or \p{Uppercase_Letter}: an uppercase letter that has a lowercase variant.
\p{Lt} or \p{Titlecase_Letter}: a letter that appears at the start of a word when only the first letter of the word is capitalized.
\p{L&} or \p{Cased_Letter}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
\p{Lm} or \p{Modifier_Letter}: a special character that is used like a letter.
\p{Lo} or \p{Other_Letter}: a letter or ideograph that does not have lowercase and uppercase variants.
The only catch is that you can't search for particular letters with a diacritic such as È, and so you can't limit your search to English letters.

Regex - Match only unicode alphabet not numbers

I'm using PHP, and trying to write a regular expression that matches any alphabet in any language but not numbers.
I've tried /\p{L}+/ But it matches unicode alphabets and numbers too. I'm checking against Arabic and English languages. English numbers doesn't pass which is normal, but Arabic numbers pass which is not normal.
Is there another regular expression that matches only alphabets in any language ?
The regex engine need to know that the target string is an unicode string (to avoid interpretation errors). To do that you can use the u modifier, that has two functions:
it expands classical shorthand character classes like \w \d to unicode characters (and not only ascii characters)
it forces the string to be seen as an unicode string
So you can use: /\pL+/u
Note that in your particular case, the first behavior is not needed, but you can only switch on the second behavior with: /(*UTF8)\pL+/ ((*UTF8) must be placed at the very begining of the pattern)

Replace all but letters - explaination

I would like to modify a string and remove all but English letters (a-z, A-Z). Note that white space should also be removed.
This post provides two answers Remove everything except letters from PHP string
$new_string = preg_replace('/\PL/u', '', $old_string)
$new_string = preg_replace('/[^a-z]/i','',$old_string);
I understand the second answer, but not the first. The first had the highest votes.
Is the first the better answer? Please explain what it is doing.
That means special unicode-character class qualifier. In this particular case, L means "letter". In PHP, \P{xx} is available so that's why /\PL/u will work.
Note, that L includes the following properties: Ll, Lm, Lo, Lt and Lu (check full list in documentation). That means, L will include:
Lower case letter (Ll)
Modifier letter (Lm)
Other letter (Lo)
Title case letter (Lt)
Upper case letter (Lu)
That means, \PL fits requirement "all except letters" better, but it will keep such things as French letters (because of Lm), while [a-zA-Z] (same as /[a-z]/i) is more strict and will leave only letters, specified in group.
And, of course, \P{xx} has sense only in terms of unicode, thus - /u modifier is mandatory there.
\pL is the unicode property for letters
\pN is the unicode property for numbers
[a-z] doesn't take care of éàçè....
how can i use preg_match with alphanumeric and unicode acceptance?

regex unable to allow apostrophe

I am experiencing a strange problem with a regular expression I have already used before.
The goal is to allow the user to enter his name, with letters, hyphen, and apostrophes if needed in a php form.
My regex is:
"/^[\w\s'àáâãäåçèéêëìíîïðòóôõöùúûüýÿ-]+$/i"
But... everything is allowed but the apostrophe. Escaping it will not change. Why?
To deal with unicode characters, you can do:
/^[\pN\pL\pP\pZ]+$/
where:
\pN stands for any number
\pL stands for any letter
\pP stands for any punctuation
\pZ stands for any space
It matches names like:
d'Alembert
d’Alembert (note the different apos from above)
Jean-François
O'Connors

Regex for word characters in any language

Testing the PHP regex engine, I see that it considers only [0-9A-Za-z_] to be word characters. Letters of non-ASCII languages, such as Hebrew, are not matched as word characters with [\w]. Are there any PHP or Perl regex escape sequences which will match a letter in any language? I could add ranges for each alphabet that I expect to be used, but users will always surprise us with unexpected languages!
Note that this is not for security filtering but rather for tokenizing a text.
Try [\pL_] - see the reference at
http://php.net/manual/en/regexp.reference.unicode.php
Try \p{L}. It matches any kind of letter from any language. If you don't want to use char set [].

Categories