PHP regex case-insensitive on cyrillic charset - php

I am using preg_replace and preg_match with PHP, working in this charset: Cyrillic Windows 1251.
I am trying to match a word using the case-insensitive modifier.
I made these tests :
$pattern = '/myCyrillicWord1|myCyrillicWord2/i';
$subject = 'Am I able to find MYCyrILlicWord1?';
$res = preg_replace($pattern, 'matched', $subject);
On UTF-8 :
With the utf-8 modifier in the pattern :
$pattern = '/myCyrillicWord1|myCyrillicWord2/iu';
$output = 'Am I able to find matched or not';
Without :
$pattern = '/myCyrillicWord1|myCyrillicWord2/i';
$output = 'Am I able to find MYCyrILlicWord1 or not';
On Windows 1251 :
$pattern = '/myCyrillicWord1|myCyrillicWord2/i';
$output = 'Am I able to find MYCyrILlicWord1 or not';
The regex is functionnal on utf-8 but not on Windows 1251.
Please notice that I had tested with cyrillics characters like 'х' and 'Х' (which look like latin letters 'x' and 'X').
My question is to know if that behavior is normal ?
How can I match my cyrillics words in Windows 1251 charset with the case-insensitive modifier ?
Many thanks.

I don't think PCRE supports charsets, so your options are basically
convert everything to utf8, process and then convert back, or
use manually crafted regexes for case-insensitivity, like /[Дд][Ыы][Кк]/ to match Дык, дыК etc

Related

PHP trim non-letters Unicode

I need to trim a string of all characters except letters from any languages in UTF-8. For an early test this was working fine until obviously I started using UTF-8 non-Latin letters:
<?php
$s = '\$5ı龢abc';
echo '<p>'.$s.'</p>';
while (!preg_match('/([\p{L}]+)/u', $s[0]))
{
$s = substr($s, 1);
echo '<p>'.$s.'</p>';
}
?>
This currently outputs the following:
$5ı龢abc
$5ı龢abc
5ı龢abc
ı龢abc
�龢abc
龢abc
��abc
�abc
abc
I would like the final output to be: ı龢abc. I'm not quite sure what I'm missing however?
Using individual character indexing doesn't work, since PHP isn't aware of "characters" in strings, and merely indexes bytes. This is obviously a problem with multi-byte characters. But you're doing it way too manually anyway; just replace all non-letter characters at the beginning of the string:
$s = preg_replace('/^\P{L}*/u', '', $s);

Regular expression not working as intended when I use an emoji at the beginning of a string

My code is written in PHP. I am trying to store in my database subjects of the emails that I send, only after I remove the emojis that I include in the subject lines of those emails. I created this regular expression:
$cleansubject = preg_replace("/[^a-zA-Z0-9\s]/", "", $subject);
It works when I have the emoji at the end of the string, such as:
But if the emoji I have it at the beginning of the string, it does not work, the entry is not even stored in my database:
Any issues that you can identify in my regular expression to achieve what I want?
UPDATE 1: Apparently the regular expression is just fine:
Add the "u" modifier to your regular expression to make it treat strings as UTF-8.
$cleansubject = preg_replace("/[^a-zA-Z0-9\s]/u", "", $subject);
Or use a built-in function to remove the Unicode characters from your string, eg iconv, utf8_decode, mb_convert_encoding, or recode.
$cleansubject = trim(iconv('UTF-8', 'ASCII//IGNORE', $subject));
This could be an encoding problem (3v4l example):
echo utf8_encode('⌨️,🖥,🖨, Learning Online: Digital Marketing Course');
// Output: ⌨ï¸,🖥,🖨, Learning Online: Digital Marketing Course
When you try to match using your pattern this fails (see here), but if you instead match any number of non-word characters without the global flag like here you match the whole emoji.
And using preg_match() this becomes:
$re = '/\W*/';
$str = 'â¨ï¸,ð¥,ð¨, Learning online: Digital Marketing Course';
$subst = '';
$result = preg_replace($re, $subst, $str, 1);
echo "The result of the substitution is ".$result;
// Output: Learning online: Digital Marketing Course

Replace All Special Characters Expect Language Specific

Remove everything from the string expect the language-specific special signs and characters etc.
I've been using this method:
$string = preg_replace('/[^A-Za-z0-9\-]/', ' ', $string);
Now it's obvious that it's not working with the following languages:
1. Arabic
2. Hindi
3. With Spanish characters.
And all the languages outside English.
Now my question is simple, what will be the best way to remove all the special characters from the string.
Try this:
$string = "abcßöäü #.,}* हिंदी عربى";
$string = preg_replace('/[^\w0-9 \-]/u', '', $string);
var_dump($string);
//string(28) "abcßöäü हद عربى"
Whether \w works depends on the system configuration.

Problems with PHP, preg_replace & regular expressions

I'm trying to run this php command:
preg_replace($regexp, $replace, $text, $maxsingle);
Where the vars are:
$regexp = '/(?!(?:[^<\\[]+[>\\]]|[^>\\]]+<\\/a>))\\b(שלום)\\b/imsU';
$replace = '<a title="$1" href="http://stackoverflow.com">$1</a>';
$text is a long post
$maxsingle = 3;
When the text I'm trying to match (in the above case "שלום") is in english everything works. However, when the text is Hebrew, it doesn't matches anything...
Any ideas how to make Hebrew work with preg_replace?
Thanks.
Try using the /u (utf-8) flag

PHP Regex Problem:

$string1 = preg_replace('/[^A-Za-z0-9äöü!&_=\+-]/', ' ', $string4);
This Regex shouldn't replace the chars äöü.
In Ruby it worked as expected.
But in PHP it replaces also the ä ö and ü.
Can someone give me a hint how to fix it?
Set the u pattern modifier (to tell php to treat the regex as a UTF-8 string).
'/[^A-Za-z0-9äöü!&_=\+-]/u'
i think this should work:
$string1 = preg_replace('/\[^A-Za-z0-9\pL!&_=\+-]/u', ' ', $string4 );
Unicode support is one of the features promised for PHP 6.
Currently in php5
use the multibyte string functions like mb_ereg
PHP will interpret '/regex/u' as a UTF-8 string, with preg_match,preg_replace

Categories