Php mb_ereg_match faulty match - php

I am trying to match some text with mb_ereg_match of php and I am using this piece of regex to match all non Word chats:
/[^-\w.]|[_]/u
I want to be able to look up unicode chars that's why I am using mb_ereg.
With this input:
'γιωρ;γος.gr'
Which containes chars from Greek alphabet.
I want to match the ';' and if it is matched to return -1 else return the input.
Whatever I do it doesn't match the ';' and returns the input.
I tried to use preg_match but it doesn't work as I work.
Any suggestions?
Edit 1
I did a test and I found that it matches corectly if I convert my input to:
';γος.gr'
Also works fine with latin chars.
Edit 2
If I get one of the following I want to print -1.
'γιωρ;γος.gr'
';γος.gr'
'γιωρ;.gr'
';.gr'
Else I want to get whatever the input is.
Edit 3
I did some more tests and it doesn't match any special char that is surounded of utf-8 chars.

You need to use \X with preg_match_all to match all Unicode chars:
\X
- an extended Unicode sequence
Also, see this \X description from Regular-Expression.info:
Matching a single grapheme, whether it's encoded as a single code point, or as multiple code points using combining marks, is easy in Perl, PCRE, PHP, and Ruby 2.0: simply use \X. You can consider \X the Unicode version of the dot. There is one difference, though: \X always matches line break characters, whereas the dot does not match line break characters unless you enable the dot matches newline matching mode.
And you can use the following snippet then:
$re = '/\X/u';
$str = "γιωρ;γος.gr";
preg_match_all($re, $str, $matches);
if (in_array(";", $matches[0])) {
echo -1;
}
else {
print_r($matches[0]);
}
See IDEONE demo

Related

How do i match with regex special chars that are not alphanumeric whilst ignoring emojis?

i'm currently having an problem, i don't know how to make regex match special characters whilst ignoring emojis.
Example, i want to match the special chars that are not emojis in this string: ❤️𝓉𝑒𝓈𝓉𝒾𝓃𝑔❤️
currently as my regex i have
[^\x00-\x7F]+
Current output: ❤️𝓉𝑒𝓈𝓉𝒾𝓃𝑔❤️
Wanted output: 𝓉𝑒𝓈𝓉𝒾𝓃𝑔
How would i go around fixing this?
Maybe, this expression might work:
$re = '/[\x{1f300}-\x{1f5ff}\x{1f900}-\x{1f9ff}\x{1f600}-\x{1f64f}\x{1f680}-\x{1f6ff}\x{2600}-\x{26ff}\x{2700}-\x{27bf}\x{1f1e6}-\x{1f1ff}\x{1f191}-\x{1f251}\x{1f004}\x{1f0cf}\x{1f170}-\x{1f171}\x{1f17e}-\x{1f17f}\x{1f18e}\x{3030}\x{2b50}\x{2b55}\x{2934}-\x{2935}\x{2b05}-\x{2b07}\x{2b1b}-\x{2b1c}\x{3297}\x{3299}\x{303d}\x{00a9}\x{00ae}\x{2122}\x{23f3}\x{24c2}\x{23e9}-\x{23ef}\x{25b6}\x{23f8}-\x{23fa}]/u';
$str = '❤️𝓉𝑒𝓈𝓉𝒾𝓃𝑔❤️';
$subst = '';
echo preg_replace($re, $subst, $str);
Output
𝓉𝑒𝓈𝓉𝒾𝓃𝑔️
The expression is explained on the top right panel of this demo if you wish to explore/simplify/modify it.
Reference:
javascript unicode emoji regular expressions
Use the following unicode regex:
[^\p{M}\p{S}]+
\p{M} matches characters intended to be combined with another character (here ️).
\p{S} matches symbols (❤ in this case).
Demo
I think that your posts' title does not match it's body.
There is virtually no overlap between emoji and AlphaNum characters.
There are a couple of keycap emoji but since their sequence beyond
the first digits don't overlap the alphanum, it's enough just to put
a negative look ahead in front of the alphanum class.
'~(?![0-9]\x{FE0F}\x{20E3}|\x{2139})[\pL\pN]+~'
https://regex101.com/r/1JcUqY/1

Regex for validating and sanitizing all english and non-english unicode alphabet characters in PHP

While there have been many questions regarding the non-english characters regex issue I have not been able to find a working answer. Moreover, there does not seem to be any simple PHP library which would help me to filter non-english input.
Could you please suggest me a regular expression which would allow
all english alphabet characters (abc...)
all non-english alphabet characters (šýüčá...)
spaces
case insensitive
in validation as well as sanitization. Essentially, I want either preg_match to return false when the input contains anything else than the 4 points above or preg_replace to get rid of everything except these 4 categories.
I was able to create
'/^((\p{L}\p{M}*)|(\p{Cc})|(\p{Z}))+$/ui' from http://www.regular-expressions.info/unicode.html. This regular expression works well when validating input but not when sanitizing it.
EDIT:
User enters 'český [jazyk]' as an input. Using '/^[\p{L}\p{Zs}]+$/u' in preg_match, the script determines that the string contains unallowed characters (in this case '[' and ']'). Next I would like to use preg_replace, to delete those unwanted characters. What regular expression should I pass into preg_replace to match all characters that are not specified by the regular expression stated above?
I think all you need is a character class like:
^[\p{L}\p{Zs}]+$
It means: The whole string (or line, with (?m) option) can only contain Unicode letters or spaces.
Have a look at the demo.
$re = "/^[\\p{L}\\p{Zs}]+$/um";
$str = "all english alphabet characters (abc...)\nall non-english alphabet characters (šýüčá...)\nspace s\nšýüčá šýüčá šýüčá ddd\nšýüčá eee 4\ncase insensitive";
preg_match_all($re, $str, $matches);
To remove all symbols that are not Unicode letters or spaces, use this code:
$re = "/[^\\p{L}\\p{Zs}]+/u";
$str = "český [jazyk]";
echo preg_replace($re, "", $str);
The output of the sample program:
český jazyk

Search for repeated arabic (hindi) numerals in a string

I am trying to determine whether a given strings contains more than 4 consecutive arabic (hindi) numerals. to be specific, arabic (hindi) numerals are :
١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩
which are unicode 661 to 669
I tried :
if (preg_match("/\b(?:(?:١|٢|٣|٤|٥|٦|٧|٨|٩)\b\s*?){4}/", $str, $matches) > 0)
return true;
But it doesn't work at all (always returns false).
You can try the following regular expression. \p{N} matches any kind of numeric character in any script.
preg_match('~(?:\p{N}\s?){4,}~u', $str, $matches)
If you just want to match those specific characters, you could use the following instead.
preg_match('~(?:[\x{0660}-\x{0669}]\s?){4,}~u, $str, $matches)
Use a character class and quantify it. See this regex:
/[١٢٣٤٥٦٧٨٩]{4,}/
Your characters are not word characters, so \b would assert a word character in front of / behind your match, remove it.
Here is a regex demo.
As a note, if you are matching more than 4 characters, use {5,} instead.

PHP preg_match: any letter but no numbers (and symbols)

I am trying to set a validation rule for a field in my form that checks that the input only contains letters.
At first I tried to make a function that returned true if there were no numbers in the string, for that I used preg_match:
function my_format($str)
{
return preg_match('/^([^0-9])$', $str);
}
It doesn't matter how many times I look at the php manual, it seems like I won't get to understand how to create the pattern I want. What's wrong with what I made?
But I'd like to extend the question: I want the input text to contain any letter but no numbers nor symbols, like question marks, exclamation marks, and all those you can imagine. BUT the letters I want are not only a-z, I want letters with all kinds of accents, as those used in Spanish, Portuguese, Swedish, Polish, Serbian, Islandic...
I guess this is no easy task and hard or impossible to do with preg_match. It there any library that covers my exact needs?
If you're using utf-8 encoded input, go for unicode regex. Using the u modifier.
This one would match a string that only consists of letters and any kind of whitespace/invisible separators:
preg_match('~^[\p{L}\p{Z}]+$~u', $str);
function my_format($str)
{
return preg_match('/^\p{L}+$/', $str);
}
Simpler than you think about!
\p{L} matches any kind of letter from any language
First of all,Merry Christmas.
You are on the right track with the first one, just missing a + to match one or more non-number characters:
preg_match('/^([^0-9]+)$/', $str);
As you can see, 0-9 is a range, from number 0 to 9. This applies to some other cases, like a-z or A-Z, the '-' is special and it indicates that it is a range. for 0-9, you can use shorthand of \d like:
preg_match('/^([^\d]+)$/', $str);
For symbols, if your list is punctuations . , " ' ? ! ; : # $ % & ( ) * + - / < > = # [ ] \ ^ _ { } | ~, there is a shorthand.
preg_match('/^([^[:punct:]]+)$/', $str);
Combined you get:
preg_match('/^([^[:punct:]\d]+)$/', $str);
Use the [:alpha:] POSIX expression.
function my_format($str) {
return preg_match('/[[:alpha:]]+/u', $str);
}
The extra [] turns the POSIX into a range modified by the + to match 1 or more alphabetical characters. As you can see, the :alpha: POSIX matches accented characters as well
If you want to include whitespace, just add \s to the range:
preg_match('/[[:alpha:]\s]+/u', $str);
EDIT: Sorry, I misread your question when I looked over it a second time and thought you wanted punctuation. I've taken it back out.

regular expression to detect consecutive numbers - not working for non-English input

Hi All I have this code that checks for 5 or more consecutive numbers :
if (preg_match("/\d{5}/", $input, $matches) > 0)
return true;
It works fine for input that is English, but it's tripping up when the input string contains Arabic/multibyte characters - it returns true sometimes even if there aren't numbers in the input text.
Any ideas ?
You appear to be using PHP.
Do this:
if (preg_match("/\d{5}/u", $input, $matches) > 0)
return true;
Note the 'u' modifier at the end of expression. It tells preg_* to use unicode mode for matching.
You have to set yourself up properly when you want to deal with UTF-8.
You can recompile php with the PCRE UTF-8 flag enabled.
Or, you can add the sequence (*UTC8) to the start of your regex. For example:
/(*UTF8)[[:alnum:]]/, input é, output TRUE
/[[:alnum:]]/, input é, output FALSE.
Check out http://www.pcre.org/pcre.txt, which contains lots of information about UTF-8 support in the PCRE library.
Even in UTF-8 mode, predefined character classes like \d and [[:digit:]] only match ASCII characters. To match potentially non-ASCII digits you have to use the equivalent Unicode property, \p{Nd}:
$s = "12345\xD9\xA1\xD9\xA2\xD9\xA3\xD9\xA4\xD9\xA5";
preg_match_all('~\p{Nd}{5}~u', $s, $matches);
See it in action on ideone.com
If you need to match specific characters or ranges, you can either use the \x{HHHH} escape sequence with the appropriate code points:
preg_match_all('~[\x{0661}-\x{0665}]{5}~u', $s, $matches);
...or use the \xHH form to input their UTF-8 encoded byte sequences:
preg_match_all("~[\xD9\xA1-\xD9\xA5]{5}~u", $s, $matches);
Notice that I switched to double-quotes for this last example. The \p{} and \x{} forms were passed through to be processed by the regex compiler, but this time we want the PHP compiler to expand the escape sequences. That doesn't happen in single-quoted strings.

Categories