REGEX - how to do diacritic-insensitive in preg_match? - php

Is there a way to use preg_match (e.g. perhaps via a flag) to do diacritic-insensitive matches?
For example, say I'd like it to match:
cafe
café
I know I can do a regex like this: caf[eé]. This regex will work as long as I don't come across any other diacritic variations of e, like: ê è ë ē ĕ ě ẽ ė ẹ ę ẻ.
Of course, I could just list all of those diacritic variations in my regex, such as caf[eêéèëēĕěẽėẹęẻ]. And as long as I don't miss anything, I'll be good. I would just need to do this for all the letters in the alphabet, which is a tedious and prone-to-error solution.
It is not an option for me to find and replace the diacritic letters in the subject with their non-diacritic counterparts. I need to preserve the subject as-is.
The ideal solution for me is to have regex to be diacritic-insensitive. With the example above, I want my regex to simply be: cafe. Is this possible?

If you're open to matching a letter from any language (which includes characters with dicritic), then you could use \p{L} or \p{Letter} as shown here: https://regex101.com/r/UBGQI6/3
According to regular-expressions.info,
\p{L} or \p{Letter}: any kind of letter from any language.
\p{Ll} or \p{Lowercase_Letter}: a lowercase letter that has an uppercase variant.
\p{Lu} or \p{Uppercase_Letter}: an uppercase letter that has a lowercase variant.
\p{Lt} or \p{Titlecase_Letter}: a letter that appears at the start of a word when only the first letter of the word is capitalized.
\p{L&} or \p{Cased_Letter}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
\p{Lm} or \p{Modifier_Letter}: a special character that is used like a letter.
\p{Lo} or \p{Other_Letter}: a letter or ideograph that does not have lowercase and uppercase variants.
The only catch is that you can't search for particular letters with a diacritic such as È, and so you can't limit your search to English letters.

Related

How to build a regular expression in PHP, containing alphabetic and accent/circumflex/etc characters?

I have never worked with regular expressions. And now, at my job, I need a regular expression that accepts a string with:
(It is a full name that can have so many variants and many languages)
Blank spaces
Any quantity of numbers
Any quantity of alphabetic characters, including grave, acute, circumflex, tilde, diaresis, ring above, cedilla, etc. This is all variants of each letter. Example (A, À, Á, Â, Ã, Ä, Å)
Latin special characters (ñ Ñ, ç Ç)
German special character (ß)
En dash (-)
I am reading and studying documentation now, but I am stuck.
You could use something like this:
^[0-9\wÀ-ž\s\-]+$
0-9 for numbers
\w for word characters
À-ž for the special characters
\s for spaces
\- for the -
wrapping this inside [] makes an class, which maches everything inside and putting a + after the class, says at least one element of this character class.

PHP REGEX to find uppercase sentence in html tag

I am trying to create regex to find uppercase sentence in html tag. Here is an example:
<span style="font-family:Arial; font-size:11pt; font-weight:bold">RESSONÂNCIA MAGNÉTICA</span></p>
I got this regex: ^<span style="font-family:Arial; font-size:11pt; font-weight:bold">+[A-Z]+<\/span><\/p>
However it is not working properly. It is missing spaces and letters with accentuation.
You seem to have a very specific case in mind. #Mariano pointed out a sweet way to grab uppercases characters that is unicode safe (nice work!) but maybe coming at this a little differently will help.
You mentioned wanting uppercase sentences... I assume that's more than uppercase letters, that includes punctuation, and all matter of other characters being okay. Maybe think about what isn't okay? If all that is not allowed to be inside that tag is lowercase letters, maybe your match (inside the tag) is [^a-z]+ which will match anything that isn't a lowercase letter from a to z.
preg_replace("/^<span style=\"font-family:Arial; font-size:11pt; font-weight:bold\">([^a-z]+)<\/span><\/p>/u", "\1", $input_lines);
And if you want to grab the contents of any span, you could use something like this:
preg_replace("/^<span[^>]+>([^a-z]+)<\/span>/u", "\1", $input_lines);
Or to handle lowercase letters with accents:
preg_replace("/^<span[^>]+>([^\{Ll}]+)<\/span>/u", "\1", $input_lines);
You're using [A-Z] that only matches A to Z. This can be solved using Unicode categories
Use \p{Lu} to match characters with the Uppercase_Letter Unicode property.
In order to use the above, set the /u (Unicode modifier) in your pattern.
Don't forget to include spaces (your example has 1).
This will match what you want: [\p{Lu} ]+
Code:
preg_replace("/^<span style=\"font-family:Arial; font-size:11pt; font-weight:bold\">([\p{Lu} ]+)<\/span><\/p>/u", "\1", $input_lines);
Demo online
I suggested using \p{Lu} in a previous answer, but you're probably not interested in matching Arabic, German special chars or whatever Uppercase_Letter category matches.
Keep it simple:
Just add the special chars you want inside the character class. For example, and I'm guessing it's Portuguese you're matching:
[A-ZÁÂÃÀÇÉÊÍÓÔÕÚ ]+

Replace all but letters - explaination

I would like to modify a string and remove all but English letters (a-z, A-Z). Note that white space should also be removed.
This post provides two answers Remove everything except letters from PHP string
$new_string = preg_replace('/\PL/u', '', $old_string)
$new_string = preg_replace('/[^a-z]/i','',$old_string);
I understand the second answer, but not the first. The first had the highest votes.
Is the first the better answer? Please explain what it is doing.
That means special unicode-character class qualifier. In this particular case, L means "letter". In PHP, \P{xx} is available so that's why /\PL/u will work.
Note, that L includes the following properties: Ll, Lm, Lo, Lt and Lu (check full list in documentation). That means, L will include:
Lower case letter (Ll)
Modifier letter (Lm)
Other letter (Lo)
Title case letter (Lt)
Upper case letter (Lu)
That means, \PL fits requirement "all except letters" better, but it will keep such things as French letters (because of Lm), while [a-zA-Z] (same as /[a-z]/i) is more strict and will leave only letters, specified in group.
And, of course, \P{xx} has sense only in terms of unicode, thus - /u modifier is mandatory there.
\pL is the unicode property for letters
\pN is the unicode property for numbers
[a-z] doesn't take care of éàçè....
how can i use preg_match with alphanumeric and unicode acceptance?

PHP and regexp to accept only Greek characters in form

I need a regular expression that accepts only Greek chars and spaces for a name field in my form (PHP).
I've tried several findings on the net but no luck. Any help will be appreciated.
Full letters solution, with accented letters:
/^[A-Za-zΑ-Ωα-ωίϊΐόάέύϋΰήώ]+$/
I'm not too current on the Greek alphabet, but if you wanted to do this with the Roman alphabet, you would do this:
/^[a-zA-Z\s]*$/
So to do this with Greek, you replace a and z with the first and last letters of the Greek alphabet. If I remember right, those are α and ω. So the code would be:
/^[α-ωΑ-Ω\s]*$/
The other answers here didn't work for me. Greek Unicode characters are included in the following two blocks
Greek and Coptic U+0370 to U+03FF (normal Greek letters)
Greek Extended U+1F00 to U+1FFF (Greek letters with diacritics)
The following regex matches whole Greek words:
[\u0370-\u03ff\u1f00-\u1fff]+
I will let the reader translate that to whichever programming language format they may be using.
See also
Unicode charts
To elaborate on leo pal's answer, an even more complete regex, which would accept even capital accented Greek characters, would be the following:
/^[α-ωΑ-ΩίϊΐόάέύϋΰήώΊΪΌΆΈΎΫΉΏ\s]+$/
With this, you get:
α-ω - lowercase letters
Α-Ω - uppercase letters
ίϊΐόάέύϋΰήώ - lowercase letters with all (modern) diacritics
ΊΪΌΆΈΎΫΉΏ - uppercase letters with all (modern) diacritics
\s - any whitespace character
Note: The above does not take into account ancient Greek diacritics (ᾶ, ἀ, etc.).
What worked for me was /^[a-zA-Z\p{Greek}]+$/u
source: http://php.net/manual/fr/function.preg-match.php#105324
Greek & Coptic in utf-8 seem to be in the U+0370 - U+03FF range. Be aware: a space, a -, a . etc. are not....
Just noticed at the excellent site https://regexr.com/ that the range of Greek characters are from "Ά" (902) to "ώ" (974) with 3 characters that are not aphabet characters: "·" (903) and unprintable characters 0907, 0909
So a range [Ά-ώ] will cover 99.99% of the cases!
With (?![·\u0907\u0909])[Ά-ώ] covers 100%. (I don't check this at PHP though)
The modern Greek alphabet in UTF-8 is in the U+0386 - U+03CE range.
So the regex you need to accept Greek only characters is:
$regex_gr = '/^[\x{0386}-\x{03CE}]+$/u';
or (with spaces)
$regex_gr_with_spaces = '/^[\x{0386}-\x{03CE}\s]+$/u';

Can someone explain this regular expression?

/^[\p{Ll}\p{Lm}\p{Lo}\p{Lt}\p{Lu}\p{Nd}]+$/mu
This is the regular expression validation that cakePHP uses to validate alphanumeric strings. I am unable to understand what Ll, Lm, Lt etc are? This is to validate alphanumeric strings, so they should test for numbers and characters. Could someone explain this expression a little.
Thank you.
Ll, Lm, Lo, Lt, Lu, Nd are unicode character classes.
See here at around 1/3 of the page:
http://www.regular-expressions.info/unicode.html
\p{Ll} or \p{Lowercase_Letter}: a
lowercase letter that has an uppercase
variant.
\p{Lu} or \p{Uppercase_Letter}: an uppercase
letter that has a lowercase variant.
\p{Lt} or \p{Titlecase_Letter}: a
letter that appears at the start of a
word when only the first letter of the
word is capitalized.
\p{L&} or \p{Letter&}: a letter that exists in
lowercase and uppercase variants
(combination of Ll, Lu and Lt).
\p{Lm} or \p{Modifier_Letter}: a special
character that is used like a letter.
\p{Lo} or \p{Other_Letter}: a letter
or ideograph that does not have
lowercase and uppercase variants.
The code between the curly brackets (Li, Lm, Lt, etc) are classes of Unicode characters. A quick google for Unicode character classes produces for example the following list: http://www.siao2.com/2005/04/23/411106.aspx
If you regularily stumble upon weird regular expressions, try one of these: https://stackoverflow.com/questions/89718/is-there-anything-like-regexbuddy-in-the-open-source-world - albeit I'm not sure if they explain those (mostly Unicode?) placeholders. Otherwise check out the list on http://regular-expressions.info/

Categories