PHP Regular Expression for Bengali Word/Sentence - php

I am developing a web application using PHP 5.3.x. Everything is working fine, but unable to solve an issue due to regular expression problem with Bengali Punctuation. Following is my code:
$value = '\u09AC\u09BE\u0982\u09B2\u09BE\u09A6\u09C7\u09B6';
$value = mb_convert_encoding($value, 'UTF-8', 'UTF-16BE');
//$value = 'বাংলাদেশ';
//$value = 'Bangladesh';
$pattern = '/^[\p{Bengali}]{0,100}$/';
//$pattern = '/^[\p{Latin}]{0,45}$/';
echo preg_match($pattern, $value);
Whether I pass Bengali word or not, it always returns false. In JavaEE application I used this Regular Expression
\p{InBengali}
But in PHP it not working! Anyways how do I solve this problem?

Maybe this will help you:
The PHP preg functions, which are based on PCRE, support Unicode when the /u option is appended to the regular expression.
From regex in Unicode

Just append u with the expression as following
$value = 'বাংলাদেশ';
//$pattern = '/^[\p{Bengali}]{0,100}$'; wrong
$pattern = '/^[\p{Bengali}]{0,100}$/u'; //right
echo preg_match($pattern, $value);
Those are facing problem like me could be enjoy with us.

Related

Regular expression not working as intended when I use an emoji at the beginning of a string

My code is written in PHP. I am trying to store in my database subjects of the emails that I send, only after I remove the emojis that I include in the subject lines of those emails. I created this regular expression:
$cleansubject = preg_replace("/[^a-zA-Z0-9\s]/", "", $subject);
It works when I have the emoji at the end of the string, such as:
But if the emoji I have it at the beginning of the string, it does not work, the entry is not even stored in my database:
Any issues that you can identify in my regular expression to achieve what I want?
UPDATE 1: Apparently the regular expression is just fine:
Add the "u" modifier to your regular expression to make it treat strings as UTF-8.
$cleansubject = preg_replace("/[^a-zA-Z0-9\s]/u", "", $subject);
Or use a built-in function to remove the Unicode characters from your string, eg iconv, utf8_decode, mb_convert_encoding, or recode.
$cleansubject = trim(iconv('UTF-8', 'ASCII//IGNORE', $subject));
This could be an encoding problem (3v4l example):
echo utf8_encode('⌨️,🖥,🖨, Learning Online: Digital Marketing Course');
// Output: ⌨ï¸,🖥,🖨, Learning Online: Digital Marketing Course
When you try to match using your pattern this fails (see here), but if you instead match any number of non-word characters without the global flag like here you match the whole emoji.
And using preg_match() this becomes:
$re = '/\W*/';
$str = 'â¨ï¸,ð¥,ð¨, Learning online: Digital Marketing Course';
$subst = '';
$result = preg_replace($re, $subst, $str, 1);
echo "The result of the substitution is ".$result;
// Output: Learning online: Digital Marketing Course

php - How to match Japanese regular expression with u flag?

Something weird is happening with preg_match when I feed it the following strings. I am using the 'u' flag because I am trying to match a mixed Japanese string.
<?php
$subject="/hello/カメラ/";
$pattern='#^/hello/([\p{Han}\p{Katakana}\p{Hiragana}\w\-]+)/#u';
$result=preg_match($pattern,$subject);
echo $result; // 1
$subject="/hello/カレンダー/";
$pattern='#^/hello/([\p{Han}\p{Katakana}\p{Hiragana}\w\-]+)/#u';
$result=preg_match($pattern,$subject);
echo $result; // 0
?>
Notice that both $pattern variables have the same construction '/hello/katakana/'. Then, why is the first $result 1 and the second one 0?
Is that a bug?
Update:
I am running PHP Version 5.5.24 on a Mac.
Many thanks to David Vartanian for his help.
To make the regular expression work for both cases, I had to update the pattern the following way.
$pattern='#^/hello/([\x{30A0}-\x{30FF}\x{3040}-\x{309F}\x{4E00}-\x{9FBF}\w\-]+)/#u';
However, it seems like the older pattern works on PHP 5.5.9 and newer as mentioned by chris.
You could use mb_ereg_match(), this function is especially for multibyte regex, do no confuse with ereg_* deprecated. To use it just remove the delimiters and the modifier u.
<?php
$subject="/hello/カメラ/";
$pattern='^/hello/([\p{Han}\p{Katakana}\p{Hiragana}\w\-]+)/';
$result = mb_ereg_match($pattern, $subject);
echo "<pre>";
print_r($result);
$subject="/hello/カレンダー/";
$pattern='^/hello/([\p{Han}\p{Katakana}\p{Hiragana}\w\-]+)/';
$result = mb_ereg_match($pattern, $subject);
print_r($result);

PHP regex case-insensitive on cyrillic charset

I am using preg_replace and preg_match with PHP, working in this charset: Cyrillic Windows 1251.
I am trying to match a word using the case-insensitive modifier.
I made these tests :
$pattern = '/myCyrillicWord1|myCyrillicWord2/i';
$subject = 'Am I able to find MYCyrILlicWord1?';
$res = preg_replace($pattern, 'matched', $subject);
On UTF-8 :
With the utf-8 modifier in the pattern :
$pattern = '/myCyrillicWord1|myCyrillicWord2/iu';
$output = 'Am I able to find matched or not';
Without :
$pattern = '/myCyrillicWord1|myCyrillicWord2/i';
$output = 'Am I able to find MYCyrILlicWord1 or not';
On Windows 1251 :
$pattern = '/myCyrillicWord1|myCyrillicWord2/i';
$output = 'Am I able to find MYCyrILlicWord1 or not';
The regex is functionnal on utf-8 but not on Windows 1251.
Please notice that I had tested with cyrillics characters like 'х' and 'Х' (which look like latin letters 'x' and 'X').
My question is to know if that behavior is normal ?
How can I match my cyrillics words in Windows 1251 charset with the case-insensitive modifier ?
Many thanks.
I don't think PCRE supports charsets, so your options are basically
convert everything to utf8, process and then convert back, or
use manually crafted regexes for case-insensitivity, like /[Дд][Ыы][Кк]/ to match Дык, дыК etc

Using delimiters with preg_match

I am having difficulties to understand preg_match function.An e.g is way better
$subject="XY=abC%3Fedr%3Damp;35"
I am trying to extract
bC%3Fed
using preg_match and store it in variable
if(preg_match($pattern, $subject, $matches))
{
$string = $matches[1];
}
echo $string;
Here are the different variation that i use for $pattern
I want to use # as a delimeter
#bC(.*?)#
#bC.*?#
I just don't understand why its not working , i guess something is wrong in the $pattern.
Please don't use complicated regex and try to fix my attempt as the aim here is to understand how preg_match works and what is wrong here.
Regards
Using # as the delimiter is OK, but the regex is wrong. I guess you want:
#(bC.*?)r# // matches #bC and the following characters unless and 'r' (see comments)
A good starting point to learn the regex syntax is the PCRE manual
Example:
$subject="XY=abC%3Fedr%3Damp;35";
$pattern="#(bC.*?)r#";
preg_match($pattern, $subject, $matches);
$string = $matches[1];
echo $string; // bC%3Fed
The ? after .* switches the greediness of the pattern. By default patterns are greedy, they try to find the longest match. So you .*? means any char, any count, smallest match. Because here is nothing after that will anchor it, the smallest possible match is an empty string.

Problems with PHP, preg_replace & regular expressions

I'm trying to run this php command:
preg_replace($regexp, $replace, $text, $maxsingle);
Where the vars are:
$regexp = '/(?!(?:[^<\\[]+[>\\]]|[^>\\]]+<\\/a>))\\b(שלום)\\b/imsU';
$replace = '<a title="$1" href="http://stackoverflow.com">$1</a>';
$text is a long post
$maxsingle = 3;
When the text I'm trying to match (in the above case "שלום") is in english everything works. However, when the text is Hebrew, it doesn't matches anything...
Any ideas how to make Hebrew work with preg_replace?
Thanks.
Try using the /u (utf-8) flag

Categories