php - How to match Japanese regular expression with u flag? - php

Something weird is happening with preg_match when I feed it the following strings. I am using the 'u' flag because I am trying to match a mixed Japanese string.
<?php
$subject="/hello/カメラ/";
$pattern='#^/hello/([\p{Han}\p{Katakana}\p{Hiragana}\w\-]+)/#u';
$result=preg_match($pattern,$subject);
echo $result; // 1
$subject="/hello/カレンダー/";
$pattern='#^/hello/([\p{Han}\p{Katakana}\p{Hiragana}\w\-]+)/#u';
$result=preg_match($pattern,$subject);
echo $result; // 0
?>
Notice that both $pattern variables have the same construction '/hello/katakana/'. Then, why is the first $result 1 and the second one 0?
Is that a bug?
Update:
I am running PHP Version 5.5.24 on a Mac.

Many thanks to David Vartanian for his help.
To make the regular expression work for both cases, I had to update the pattern the following way.
$pattern='#^/hello/([\x{30A0}-\x{30FF}\x{3040}-\x{309F}\x{4E00}-\x{9FBF}\w\-]+)/#u';
However, it seems like the older pattern works on PHP 5.5.9 and newer as mentioned by chris.

You could use mb_ereg_match(), this function is especially for multibyte regex, do no confuse with ereg_* deprecated. To use it just remove the delimiters and the modifier u.
<?php
$subject="/hello/カメラ/";
$pattern='^/hello/([\p{Han}\p{Katakana}\p{Hiragana}\w\-]+)/';
$result = mb_ereg_match($pattern, $subject);
echo "<pre>";
print_r($result);
$subject="/hello/カレンダー/";
$pattern='^/hello/([\p{Han}\p{Katakana}\p{Hiragana}\w\-]+)/';
$result = mb_ereg_match($pattern, $subject);
print_r($result);

Related

Fixed-length regex lookbehind complains of variable-length lookbehind

Here is the code I am trying to run:
$str = 'a,b,c,d';
return preg_split('/(?<![^\\\\][\\\\]),/', $str);
As you can see, the regexp being used here is:
/(?<![^\\][\\]),/
Which is a simple fixed-length negative lookbehind for "preceded by something that isn't a backslash, then something that is!".
This regex works just fine on http://www.phpliveregex.com
But when I go and actually attempt to run the above code, I am spat back the error:
Warning: preg_split() [function.preg-split]: Compilation failed: lookbehind assertion is not fixed length at offset 13
To make matters worse, a fellow programmer tested the code on his 5.4.24 PHP server, and it worked fine.
This leads me to believe that my issues are related to the configuration of my server, which I have very little control over. I am told that my PHP version if 5.2.*
Are there any workarounds/alternatives to preg_replace() that might not have this issue?
The problem is caused by the bug fixed in PCRE 6.7. Quoting the changelog:
A negated single-character class was not being recognized as
fixed-length in lookbehind assertions such as (?<=[^f]), leading to an
incorrect compile error "lookbehind assertion is not fixed length"
PCRE 6.7 was introduced in PHP 5.2.0, in Nov 2006. As you still have this bug, it means it's not still there at your server - so for a preg-split based workaround you have to use a pattern without a negative character class. For example:
$patt = '/(?<!(?<!\\\\)\\\\),/';
// or...
$patt = '/(?<![\x00-\x5b\x5d-\xFF]\x5c),/';
However, I find the whole approach a bit weird: what if , symbol is preceded by exactly three backslashes? Or five? Or any odd number of them? The comma in this case should be considered 'escaped', but obviously you cannot create a lookbehind expression of variable length to cover these cases.
On the second thought, one can use preg_match_all instead, with a common alternation trick to cover the escaped symbols:
$str = 'e ,a\\,b\\\\,c\\\\\\,d\\\\';
preg_match_all('/(?:[^\\\\,]|\\\\(?:.|$))+/', $str, $matches);
var_dump($matches[0]);
Demo.
I really think I covered all the issues here, those trailing slashes were a killer )
Way to avoid the negated character class (I write \x5c instead of a lot of backslashes to be more clear)
$result = preg_split('/(?<!(?!\x5c).\x5c),/s', $str);
About the approach itself:
If you are trying to split on comma that are not escaped, you are in the wrong way with a lookbehind since you can't check and undefined number of backslash before the comma. You have several possibilities to solve this problem:
$result = preg_split('/(?:[^\x5c]|\A)(?:\x5c.)*\K,/s', $str);
or
$result = preg_split('/(?<!\x5c)(?:\x5c.)*\K,/s', $str);
or for PHP > 5.2.4
$result = preg_split('/\x5c{2}(*SKIP)(?!)|(?<!\x5c),/s', $str);
I think you are using an older php version since I your error rises on PHP 5.1.6 or lower.
You can check a non working demo here
On the other hand it works for PHP 5.2.16 or higher:
Working demo

PHP regex lookahead not working as expected

I'm trying to match a version number like 2.3.3 Release fhf47fh and stripping out the periods to get a desired result of 233
Using the pattern /\d+(?=\.)\d+(?=\.)\d+/ with preg_match
The lookahead for the period does not seem to work as expected.
thanks!
If you're looking to compare the version, you can strip on the space and then use version_compare().
If you just want the numeric representation, use a regex to simply use preg_replace() all non digits in the original version string.
$version = '2.3.3 Release';
echo preg_replace('/\D+/', '', $version);
This seemed to work for all my test cases.
preg_replace('/^(\d+)\.(\d+)\.(\d+).*$/', '$1$2$3', $version);
I'd use something like this:
$pattern = '~\d+(?:\.*\d*)*~';
For the string you've provided in the question:
if (preg_match('~\d+(?:\.*\d*)*~', $version, $matches))
echo $matches[0]; // => 2.3.3
Regex101 demo.

PHP Regular Expression for Bengali Word/Sentence

I am developing a web application using PHP 5.3.x. Everything is working fine, but unable to solve an issue due to regular expression problem with Bengali Punctuation. Following is my code:
$value = '\u09AC\u09BE\u0982\u09B2\u09BE\u09A6\u09C7\u09B6';
$value = mb_convert_encoding($value, 'UTF-8', 'UTF-16BE');
//$value = 'বাংলাদেশ';
//$value = 'Bangladesh';
$pattern = '/^[\p{Bengali}]{0,100}$/';
//$pattern = '/^[\p{Latin}]{0,45}$/';
echo preg_match($pattern, $value);
Whether I pass Bengali word or not, it always returns false. In JavaEE application I used this Regular Expression
\p{InBengali}
But in PHP it not working! Anyways how do I solve this problem?
Maybe this will help you:
The PHP preg functions, which are based on PCRE, support Unicode when the /u option is appended to the regular expression.
From regex in Unicode
Just append u with the expression as following
$value = 'বাংলাদেশ';
//$pattern = '/^[\p{Bengali}]{0,100}$'; wrong
$pattern = '/^[\p{Bengali}]{0,100}$/u'; //right
echo preg_match($pattern, $value);
Those are facing problem like me could be enjoy with us.

php use regex with mb_split to achieve str_split functionality with UTF8

I'm trying to split a string into an array.
I've tried str_split() but the problem is that characters like "äüöÄÜÖß" don't work (they become questionmarks)
So I'm trying to do the same with mb_split(), but I don't know how to get the right Regex for it.
Can you please help me?
Here is the code:
$arr = mb_split("\.", $str);
You might try:
$arr = preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY);
For the /u modifier, see http://php.net/manual/en/reference.pcre.pattern.modifiers.php :
"u (PCRE8) This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5."
ok. that's it:
$arr = preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY);

PHP Split Problem

I am trying to use (and I've tried both) preg_split() and split() and neither of these have worked for me. Here are the attempts and outputs.
preg_split("^", "ItemOne^ItemTwo^Item.Three^");
//output - null or false when attempting to implode() it.
preg_split("\^", "ItemOne^ItemTwo^Item.Three^");
//output - null or false when attempting to implode() it. Attempted to escape the needle.
//SAME THING WITH split().
Thanks for your help...
Christian Stewart
split is deprecated. You should use explode
$arr = explode('^', "ItemOne^ItemTwo^Item.Three^");
Try
explode("^", "ItemOne^ItemTwo^Item.Three^");
since your search pattern isn't a regex.
Are you sure you're not just looking for explode?
explode('^', 'ItemOne^ItemTwo^Item.Three^');
Since you are using preg_split you are trying to split the string by a given regular expresion. The circumflex (^) is a regular expression metacharacter and therefore not working in your example.
btw: preg_split is an alternative to split and not deprecated.

Categories