$string1 = preg_replace('/[^A-Za-z0-9äöü!&_=\+-]/', ' ', $string4);
This Regex shouldn't replace the chars äöü.
In Ruby it worked as expected.
But in PHP it replaces also the ä ö and ü.
Can someone give me a hint how to fix it?
Set the u pattern modifier (to tell php to treat the regex as a UTF-8 string).
'/[^A-Za-z0-9äöü!&_=\+-]/u'
i think this should work:
$string1 = preg_replace('/\[^A-Za-z0-9\pL!&_=\+-]/u', ' ', $string4 );
Unicode support is one of the features promised for PHP 6.
Currently in php5
use the multibyte string functions like mb_ereg
PHP will interpret '/regex/u' as a UTF-8 string, with preg_match,preg_replace
Related
I have some strings containing characters such as \x{1f601} which I want to replace with some text.
When I do this using preg_replace, it would be something like:
preg_replace('/\x{1f601}/u', '######', $str)
However, this doesn't seem to work with str_replace:
str_replace("\x{1f601}", '######', $str)
How can I make such replacements work with str_replace?
preg_replace is a Regex parser/replacer, which is a Perl Regular expression engine, but str_replace is NOT and replaces things with a plaintext method
The Preg_replace you have got can be seen here in regex101, stating that:
matches the character 😁 with position 0x1f601 (128513 decimal or 373001 octal) in the character set
But this could be transferable to a non-regex find and replace,by copy and pasting that face smiley symbol into the str_replace directly.
$str = str_replace("😁", '######', $str)
Or, by reading deceze's comment which gives you a clean, small solution.
Additional:
You are using a character set that is non-standard so it may be useful for you to explore Mb_Str_replace (gitHub) which is an accompanyment (but not directly from) the mb_string collection of PHP functions.
Finally:
Why do you need to do string replace whe you are already doing regex preg_replace? Also please read the manual which states all of this fairly clearly.
I am using preg_replace and preg_match with PHP, working in this charset: Cyrillic Windows 1251.
I am trying to match a word using the case-insensitive modifier.
I made these tests :
$pattern = '/myCyrillicWord1|myCyrillicWord2/i';
$subject = 'Am I able to find MYCyrILlicWord1?';
$res = preg_replace($pattern, 'matched', $subject);
On UTF-8 :
With the utf-8 modifier in the pattern :
$pattern = '/myCyrillicWord1|myCyrillicWord2/iu';
$output = 'Am I able to find matched or not';
Without :
$pattern = '/myCyrillicWord1|myCyrillicWord2/i';
$output = 'Am I able to find MYCyrILlicWord1 or not';
On Windows 1251 :
$pattern = '/myCyrillicWord1|myCyrillicWord2/i';
$output = 'Am I able to find MYCyrILlicWord1 or not';
The regex is functionnal on utf-8 but not on Windows 1251.
Please notice that I had tested with cyrillics characters like 'х' and 'Х' (which look like latin letters 'x' and 'X').
My question is to know if that behavior is normal ?
How can I match my cyrillics words in Windows 1251 charset with the case-insensitive modifier ?
Many thanks.
I don't think PCRE supports charsets, so your options are basically
convert everything to utf8, process and then convert back, or
use manually crafted regexes for case-insensitivity, like /[Дд][Ыы][Кк]/ to match Дык, дыК etc
I've been googling for a bit, also search here but can find a solution. I'm using PHP. I'm reading a text string (part of X509 cert) and it encoded é to \xC3\xA9 (André => Andr\xC3\xA9).
I've tried MonkeyPhysics's solution:
preg_replace("#(\\\x[0-9A-F]{2})#ei", "chr(hexdec('\\1'))", $string);
but then I get André
I've played around with the replacement part;
mb_convert_encoding('&#' . hexdec('\\1') . ';', 'ISO-8859-1', 'UTF-8')
(Also the to_encoding and from_encoding)
I've also looked at How to transliterate non-latin scripts? but got no closer.
Surely this should be a standard conversion?
Use of e modifier is deprecated in PHP now. You need to use preg_replace_callback instead with /u modifier for handling unicode strings.
$string = 'His nickname was \xE2\x80\x98the Angel\xE2\x80\x99,
which is kind of a clich\xC3\xA9 in my opinion.';
$repl = preg_replace_callback("#(\\\x[0-9A-F]{2})#ui",
function ($m) { return chr(hexdec($m[1])); }, $string);
OUTPUT:
His nickname was ‘the Angel’,
which is kind of a cliché in my opinion.
how can i to remove all characters non-language ?
i want to remove characters like this below, and all other of not language characters:
i using this:
preg_replace("/[^a-z0-9A-Z\-\'\|\!\.\?\:\)\(\;\*\"]/u", " ", $text );
this is good for english,
i need to approve all language characters, like Russian,arabic,hebrew,japan...
Are there any string functions I can use to leave all language characters?
thanks
No regex will be perfect for what you want - language and writing are just too complex for this. But an approximation could be
preg_replace('/[^\p{L}\p{M}\p{Z}\p{N}\p{P}]/u', ' ', $text);
This will replace anything by a space that's not a Unicode character with one of the properties “letter”, “mark”, “separator”, “number” or “punctuation”.
Tim Pietzcker's answer not working in my case.
This works.
$after = preg_replace('/[^\w\s]+/u','' , $before);
this is my current regex code to validate english & numbers:
const CANONICAL_FMT = '[0-9a-z]{1,64}';
public static function isCanonical($str)
{
return preg_match('/^(?:' . self::CANONICAL_FMT . ')$/', $str);
}
Pretty straight forward. Now i want to change that to validate only hebrew, underscore
and numbers. So i changed the code to:
public static function isCanonical($str)
{
return preg_match('/^(?:[\u0590-\u05FF\uFB1D-\uFB40]+|[\w]+)$/i', $str);
}
But it doesn't work. I basically took the hebrew UTF range out of Wikipedia.
What is Wrong here?
I was able to get it to work much more easily, using the /u flag and the \p{Hebrew} Unicode character property:
return preg_match('/^(?:\p{Hebrew}+|\w+)$/iu', $str);
Working example: http://ideone.com/gSlmh
If you want preg_match() to work properly with UTF-8, you might have to enable the u modifier (quoting) :
This modifier turns on additional functionality of PCRE that is
incompatible with Perl. Pattern strings are treated as UTF-8.
In your case, instead of using the following regex :
/^(?:[\u0590-\u05FF\uFB1D-\uFB40]+|[\w]+)$/i
I suppose you'd be using :
/^(?:[\u0590-\u05FF\uFB1D-\uFB40]+|[\w]+)$/iu
(Note the additionnal u at the end)
You need the /u modifier to add support for UTF-8.
Make sure you convert your hebrew input to UTF-8 if it's in some other codepage/character set.