PHP preg_match with Croatian Characters - php

I'm new to regular expressions, but with little bit of searching on StackOverflow I managed to get what I want (If 2+ words are seperated by comma then it returns true and returns false if that isn't the case or the word ends with comma but nothing after), except I am having a problem with Croatian characters (č,ć,ž,đ,š upper and lowercase). My current preg_match looks like
if (preg_match('/^(([a-zA-Z0-9]+\\s*,\\s*)+(\\s*)([a-zA-Z0-9]+))$/', $data))
{
//do stuff
}
But the problem with this approach is it won't return true if it has Č, ć, ž... and I know that is because of [a-zA-Z] that doesn't "look" for this characters. So, my question is how to write a regex that will return true with Croatian characters. And also if this could be done easier feel free to comment, as I would like to hear your suggestions on that. BTW, I have done this with the help of regex101.com

The \p{L} shorthand class and u option makes it possible to match Unicode letters.
This program returns FOUND!:
$data = "Čdd, ćdd, žddd";
if (preg_match('/^(([\\p{L}0-9]+\\s*,\\s*)+(\\s*)([\\p{L}0-9]+))$/u', $data))
{
echo "<h1>FOUND!</h1>";
}
As per Regular-Expressions.info:
You can match a single character belonging to the "letter" category
with \p{L}.
and its another page devoted to PHP regex:
You should specify /u for regular expressions that use \x{FFFF}, \X or
\p{L} to match Unicode characters, graphemes, properties or scripts.
PHP will interpret '/regex/u' as a UTF-8 string rather than as an
ASCII string.
Also, see one of the examples at preg_match function documentation page:
For those who search for a unicode regular expression example using
preg_match here it is:
Check for Persian digits preg_match( "/[^\x{06F0}-\x{06F9}\x]+/u" ,
'۱۲۳۴۵۶۷۸۹۰' );

Related

Regular expression to match unicode block, or index range

I'm trying to create a regular expression that will match any characters in a unicode block - specifically the Mathematical Alphanumeric Symbols block.
The intention here is to identify the use of content using Unicode characters to get different formatting on their text, like bold or italic text when it's not supported generally. There are plenty of websites, like this one that help users convert text.
I've tried using the shorthand property code, but it doesn't seem to match all characters I'd expect from the block.
preg_match('/\p{Sm}/i', '𝟮') === 1; // false
It doesn't appear as though PHP supports the named variants either, so I can't do something like \p{Math}.
I believe I need to target the block range - which is from U+1D400 - U+1D7FF, but I cannot work out how to correctly build this regex. This is how I thought I would have it work, but it doesn't appear to work.
preg_match('/\x{1D400}-\x{1D7FF}/i', '𝗮') === 1; // false
I would expect none of these characters to match (typed straight on my keyboard):
abcdefghijklmnopqrstuvwxyz0123456789
I would expect every single one of these characters to match (same as above, converted to Math bold using the link above):
𝐚𝐛𝐜𝐝𝐞𝐟𝐠𝐡𝐢𝐣𝐤𝐥𝐦𝐧𝐨𝐩𝐪𝐫𝐬𝐭𝐮𝐯𝐰𝐱𝐲𝐳𝟎𝟏𝟐𝟑𝟒𝟓𝟔𝟕𝟖𝟗
I'm guessing that this expression might work, not sure though:
$re = '/[\x{1D400}-\x{1D7FF}]+/su';
$str = '𝐚𝐛𝐜𝐝𝐞𝐟𝐠𝐡𝐢𝐣𝐤𝐥𝐦𝐧𝐨𝐩𝐪𝐫𝐬𝐭𝐮𝐯𝐰𝐱𝐲𝐳𝟎𝟏𝟐𝟑𝟒𝟓𝟔𝟕𝟖𝟗';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
var_dump($matches);
\p{S} or \p{Symbol}: math symbols, currency signs, dingbats, box-drawing characters, etc.
\p{Sm} or \p{Math_Symbol}: any mathematical symbol.
\p{Sc} or \p{Currency_Symbol}: any currency sign.
\p{Sk} or \p{Modifier_Symbol}: a combining character (mark) as a full character on its own.
\p{So} or \p{Other_Symbol}: various symbols that are not math symbols, currency signs, or combining characters.
The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.
Reference
RegEx for Mathematical Alphanumeric Symbols
Unicode Regular Expressions

Regex for validating and sanitizing all english and non-english unicode alphabet characters in PHP

While there have been many questions regarding the non-english characters regex issue I have not been able to find a working answer. Moreover, there does not seem to be any simple PHP library which would help me to filter non-english input.
Could you please suggest me a regular expression which would allow
all english alphabet characters (abc...)
all non-english alphabet characters (šýüčá...)
spaces
case insensitive
in validation as well as sanitization. Essentially, I want either preg_match to return false when the input contains anything else than the 4 points above or preg_replace to get rid of everything except these 4 categories.
I was able to create
'/^((\p{L}\p{M}*)|(\p{Cc})|(\p{Z}))+$/ui' from http://www.regular-expressions.info/unicode.html. This regular expression works well when validating input but not when sanitizing it.
EDIT:
User enters 'český [jazyk]' as an input. Using '/^[\p{L}\p{Zs}]+$/u' in preg_match, the script determines that the string contains unallowed characters (in this case '[' and ']'). Next I would like to use preg_replace, to delete those unwanted characters. What regular expression should I pass into preg_replace to match all characters that are not specified by the regular expression stated above?
I think all you need is a character class like:
^[\p{L}\p{Zs}]+$
It means: The whole string (or line, with (?m) option) can only contain Unicode letters or spaces.
Have a look at the demo.
$re = "/^[\\p{L}\\p{Zs}]+$/um";
$str = "all english alphabet characters (abc...)\nall non-english alphabet characters (šýüčá...)\nspace s\nšýüčá šýüčá šýüčá ddd\nšýüčá eee 4\ncase insensitive";
preg_match_all($re, $str, $matches);
To remove all symbols that are not Unicode letters or spaces, use this code:
$re = "/[^\\p{L}\\p{Zs}]+/u";
$str = "český [jazyk]";
echo preg_replace($re, "", $str);
The output of the sample program:
český jazyk

Regex - Match only unicode alphabet not numbers

I'm using PHP, and trying to write a regular expression that matches any alphabet in any language but not numbers.
I've tried /\p{L}+/ But it matches unicode alphabets and numbers too. I'm checking against Arabic and English languages. English numbers doesn't pass which is normal, but Arabic numbers pass which is not normal.
Is there another regular expression that matches only alphabets in any language ?
The regex engine need to know that the target string is an unicode string (to avoid interpretation errors). To do that you can use the u modifier, that has two functions:
it expands classical shorthand character classes like \w \d to unicode characters (and not only ascii characters)
it forces the string to be seen as an unicode string
So you can use: /\pL+/u
Note that in your particular case, the first behavior is not needed, but you can only switch on the second behavior with: /(*UTF8)\pL+/ ((*UTF8) must be placed at the very begining of the pattern)

JavaScript/PHP Regular Expression

I'm trying to match first names and Lastname with something like this.
$pattern = '/[a-zA-Z\-]{3,30} +[a-zA-Z]+/';
This works great, except when I have a first name like this Mélissa Smith
My match becomes Lissa Smith
How do I match for all special characters like é
in javascript, you can use a unicode char range instead of A-Za-z:
"Mélissa Smith".match( /[\u80-\uffff]{3,30} +[\u80-\uffff]+/ )
equals: ["Mélissa Smith"]
Put the regex into Unicode mode with the /u modifier and use an appropriate Unicode character class instead of hardcoding just latin letters:
$pattern = '/^(\pL|-){3,30}\s+\pL+$/u';
I also anchored the pattern between ^ and $ because otherwise it could end up matching things you didn't intend it to.
You have to keep in mind that when you do this, the input (as well as the pattern itself) must be encoded in UTF-8.
However, it has to be said that naively parsing names like this is not going to give you very good results. People's full names are way too involved for something this simple to work across the board.
Try using the POSIX expression [:alpha:] instead of [a-zA-Z-] to catch the characters. [:alpha:] will catch equivalent characters such as accents.
http://www.regular-expressions.info/posixbrackets.html

Regex for word characters in any language

Testing the PHP regex engine, I see that it considers only [0-9A-Za-z_] to be word characters. Letters of non-ASCII languages, such as Hebrew, are not matched as word characters with [\w]. Are there any PHP or Perl regex escape sequences which will match a letter in any language? I could add ranges for each alphabet that I expect to be used, but users will always surprise us with unexpected languages!
Note that this is not for security filtering but rather for tokenizing a text.
Try [\pL_] - see the reference at
http://php.net/manual/en/regexp.reference.unicode.php
Try \p{L}. It matches any kind of letter from any language. If you don't want to use char set [].

Categories