regular expression to detect numbers written as words - UTF-8 input - php

thanks for the answers to :
"regular expression to detect numbers written as words" :
regular expression to detect numbers written as words
I now have this working, however I have the same requirement but the numbers as words are in Arabic (or any other UTF-8) and not English, so :
if (preg_match("/\p{L}\b(?:(?:واحد|اثنان|ثلاثة|أربعة|خمسة|ستة|سبعة|ثمانية|تسعة|صفر|عشرة)\b\s*?){4}/", $str, $matches) > 0)
return true;
Does not work - I've googled and there seems to be quite a few issues with preg_match and UTF-8 string but I couldn't get any of the suggestions found to work. Any help much appreciated.

Note that \b may not be working as you expect. \b specifies a word boundary, but what is considered a word character by PCRE depends on what locale the script is running in (take a look towards the bottom of the PCRE escape sequences manual page):
A "word" character is any letter or digit or the underscore character, that is, any character which can be part of a Perl "word". The definition of letters and digits is controlled by PCRE's character tables, and may vary if locale-specific matching is taking place. For example, in the "fr" (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w.
You might also want to read Handling UTF-8 with PHP (the section on PCRE in particular).
Instead, you could use a lookaround in conjunction with a Unicode character property to emulate a word boundary: (?<=\P{L}). This asserts that the previous character is not a unicode "letter".
So all together it would look like:
/(?<=\P{L})(?:(?:واحد|اثنان|ثلاثة|أربعة|خمسة|ستة|سبعة|ثمانية|تسعة|صفر|عشرة)\s*?){4}/

convert both pattern and $str to windows-1256, do the matching, then convert $matches items back (if needed), this is the solution I came to after suffering for some time.
$pattern="/\p{L}\b(?:(?:واحد|اثنان|ثلاثة|أربعة|خمسة|ستة|سبعة|ثمانية|تسعة|صفر|عشرة)\b\s*?){4}/";
$pattern_windows1265 = iconv('utf-8', 'windows-1256', $pattern);
$str_windows1265 = iconv('utf-8', 'windows-1256', $str);
if (preg_match($pattern_windows1265, $str_windows1265, $matches) > 0)
return true;
Here's a test example to check if unicode conversion is allowing Arabic letters match in preg_match:
<?php
$pattern="/(واحد|اثنان|ثلاثة|أربعة|خمسة|ستة|سبعة|ثمانية|تسعة|صفر|عشرة)/";
$pattern_windows1265 = iconv('utf-8', 'windows-1256', $pattern);
$test_cases=array(
'لدي أربعة أولاد',
'قفز الثعلب فوق الشجرة',
'عندي خمسة أرانب',
);
foreach ($test_cases as $str) {
$str_windows1265 = iconv('utf-8', 'windows-1256', $str);
if (preg_match($pattern_windows1265, $str_windows1265, $matches) > 0) {
echo $str, '<br />';
}
}
when executing, it will output:
لدي أربعة أولاد
لدي خمسة أرانب
I removed some of the pattern to check if the plain check against Arabic works, which seems to be working.

You can use the pattern modifier u to use any UTF-8 supported language.
if (preg_match("/\p{L}\b(?:(?:واحد|اثنان|ثلاثة|أربعة|خمسة|ستة|سبعة|ثمانية|تسعة|صفر|عشرة)\b\s*?){4}/u", $str, $matches) > 0)
Resources :
Pattern modifiers

Related

Php mb_ereg_match faulty match

I am trying to match some text with mb_ereg_match of php and I am using this piece of regex to match all non Word chats:
/[^-\w.]|[_]/u
I want to be able to look up unicode chars that's why I am using mb_ereg.
With this input:
'γιωρ;γος.gr'
Which containes chars from Greek alphabet.
I want to match the ';' and if it is matched to return -1 else return the input.
Whatever I do it doesn't match the ';' and returns the input.
I tried to use preg_match but it doesn't work as I work.
Any suggestions?
Edit 1
I did a test and I found that it matches corectly if I convert my input to:
';γος.gr'
Also works fine with latin chars.
Edit 2
If I get one of the following I want to print -1.
'γιωρ;γος.gr'
';γος.gr'
'γιωρ;.gr'
';.gr'
Else I want to get whatever the input is.
Edit 3
I did some more tests and it doesn't match any special char that is surounded of utf-8 chars.
You need to use \X with preg_match_all to match all Unicode chars:
\X
- an extended Unicode sequence
Also, see this \X description from Regular-Expression.info:
Matching a single grapheme, whether it's encoded as a single code point, or as multiple code points using combining marks, is easy in Perl, PCRE, PHP, and Ruby 2.0: simply use \X. You can consider \X the Unicode version of the dot. There is one difference, though: \X always matches line break characters, whereas the dot does not match line break characters unless you enable the dot matches newline matching mode.
And you can use the following snippet then:
$re = '/\X/u';
$str = "γιωρ;γος.gr";
preg_match_all($re, $str, $matches);
if (in_array(";", $matches[0])) {
echo -1;
}
else {
print_r($matches[0]);
}
See IDEONE demo

php preg_match get word with cyrillic characters

I try to get some word from string, but this word maybe will have cyrillic characters, I try to get it, but all what I to do - not working.
Please help me;
My code
$str= "Продавец:В KrossАдын рассказать друзьям var addthis_config = {'data_track_clickback':true};";
$pattern = '/\s(\w*|.*?)\s/';
preg_match($pattern, $str, $matches);
echo $matches[0];
I need to get KrossАдын.
Thaks!
You can change the meaning of \w by using the u modifier. With the u modifier, the string is read as an UTF8 string, and the \w character class is no more [a-zA-Z0-9_] but [\p{L}\p{N}_]:
$pattern = '/\s(\w*|.*?)\s/u';
Note that the alternation in the pattern is a non-sense:
you use an alternation where the second member can match the same thing than the first. (i.e. all that is matched by \w* can be matched by .*? because there is a whitespace on the right. The two subpatterns will match the characters between two whitespaces)
Writting $pattern = '/\s(.*?)\s/u'; does exactly the same, or better:
$pattern = '/\s(\S*)\s/u';
that avoids to use a lazy quantifier.
If your goal is only to match ASCII and cyrillic letters, the most efficient (because for character classes the smaller is the faster) will be:
$pattern = '~(*UTF8)[a-z\p{Cyrillic}]+~i';
(*UTF8) will inform the regex engine that the original string must be read as an UTF8 string.
\p{Cyrillic} is a character class that only contains cyrillic letters.
The issue is that your string uses UTF-8 characters, which \w will not match. Check this answer on StackOverflow for a solution: UTF-8 in PHP regular expressions
Essentially, you'll want to add the u modifier at the end of your expression, and use \p{L} instead of \w.

Search for repeated arabic (hindi) numerals in a string

I am trying to determine whether a given strings contains more than 4 consecutive arabic (hindi) numerals. to be specific, arabic (hindi) numerals are :
١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩
which are unicode 661 to 669
I tried :
if (preg_match("/\b(?:(?:١|٢|٣|٤|٥|٦|٧|٨|٩)\b\s*?){4}/", $str, $matches) > 0)
return true;
But it doesn't work at all (always returns false).
You can try the following regular expression. \p{N} matches any kind of numeric character in any script.
preg_match('~(?:\p{N}\s?){4,}~u', $str, $matches)
If you just want to match those specific characters, you could use the following instead.
preg_match('~(?:[\x{0660}-\x{0669}]\s?){4,}~u, $str, $matches)
Use a character class and quantify it. See this regex:
/[١٢٣٤٥٦٧٨٩]{4,}/
Your characters are not word characters, so \b would assert a word character in front of / behind your match, remove it.
Here is a regex demo.
As a note, if you are matching more than 4 characters, use {5,} instead.

PHP: remove small words from string ignoring german characters in the words

I am trying to create slugs for urls.
I have the following test string :
$kw='Test-Tes-Te-T-Schönheit-Test';
I want to remove small words less than three characters from this string.
So, I want the output to be
$kw='test-tes-schönheit-test';
I have tried this code :
$kw = strtolower($kw);
$kw = preg_replace("/\b[^-]{1,2}\b/", "-", $kw);
$kw = preg_replace('/-+/', '-', $kw);
$kw = trim($kw, '-');
echo $kw;
But the result is :
test-tes-sch-nheit-test
so, the German character ö is getting removed from the string
and German word Schönheit is being treated as two words.
Please suggest how to solve this.
Thank you very much.
I assume, your string is not UTF-8. With Umlauts/NON-ASCII characters and regex I think, its easier first to encode to UTF-8 and then - after applying the regex with u-modifier (unicode) - if you need the original encoding, decode again (according to local). So you would start with:
$kw = utf8_encode(strtolower($kw));
Now you can use the regex-unicode functionalities. \p{L} is for letters and \p{N} for numbers. If you consider all letters and numbers as word-characters (up to you) your boundary would be the opposite:
[^\p{L}\p{N}]
You want all word-characters:
[\p{L}\p{N}]
You want the word, if there is a start ^ or boundary before. You can use a positive lookbehind for that:
(?<=[^\p{L}\p{N}]|^)
Replace max 2 "word-characters" followed by a boundary or the end:
[\p{L}\p{N}]{1,2}([^\p{L}\p{N}]|$)
So your regex could look like this:
'/(?<=[^\p{L}\p{N}]|^)[\p{L}\p{N}]{1,2}([^\p{L}\p{N}]|$)/u'
And decode to your local, if you like:
echo utf8_decode($kw);
Good luck! Robert
Your \b word boundaries trip over the ö, because it's not an alphanumeric character. Per default PCRE works on ASCII letters.
The input string is in UTF-8/Latin-1. To treat other non-English letter symbols as such, use the /u Unicode modifer:
$kw = preg_replace("/\b[^-]{1,2}\b/u", "-", $kw);
I would use preg_replace_callback or /e btw, and instead search for [A-Z] for replacing. And strtr for the dashes or just [-+]+ for collapsing consecutive ones.

regular expression to detect consecutive numbers - not working for non-English input

Hi All I have this code that checks for 5 or more consecutive numbers :
if (preg_match("/\d{5}/", $input, $matches) > 0)
return true;
It works fine for input that is English, but it's tripping up when the input string contains Arabic/multibyte characters - it returns true sometimes even if there aren't numbers in the input text.
Any ideas ?
You appear to be using PHP.
Do this:
if (preg_match("/\d{5}/u", $input, $matches) > 0)
return true;
Note the 'u' modifier at the end of expression. It tells preg_* to use unicode mode for matching.
You have to set yourself up properly when you want to deal with UTF-8.
You can recompile php with the PCRE UTF-8 flag enabled.
Or, you can add the sequence (*UTC8) to the start of your regex. For example:
/(*UTF8)[[:alnum:]]/, input é, output TRUE
/[[:alnum:]]/, input é, output FALSE.
Check out http://www.pcre.org/pcre.txt, which contains lots of information about UTF-8 support in the PCRE library.
Even in UTF-8 mode, predefined character classes like \d and [[:digit:]] only match ASCII characters. To match potentially non-ASCII digits you have to use the equivalent Unicode property, \p{Nd}:
$s = "12345\xD9\xA1\xD9\xA2\xD9\xA3\xD9\xA4\xD9\xA5";
preg_match_all('~\p{Nd}{5}~u', $s, $matches);
See it in action on ideone.com
If you need to match specific characters or ranges, you can either use the \x{HHHH} escape sequence with the appropriate code points:
preg_match_all('~[\x{0661}-\x{0665}]{5}~u', $s, $matches);
...or use the \xHH form to input their UTF-8 encoded byte sequences:
preg_match_all("~[\xD9\xA1-\xD9\xA5]{5}~u", $s, $matches);
Notice that I switched to double-quotes for this last example. The \p{} and \x{} forms were passed through to be processed by the regex compiler, but this time we want the PHP compiler to expand the escape sequences. That doesn't happen in single-quoted strings.

Categories