I discovered that using the u modifier is helping sometimes when working with UTF-8 strings but on my Linux server it replaces the umlaut with - instead leaving it like on my Windows server.
mb_internal_encoding('UTF-8');
function clean($string) {
return preg_replace('/[^[:alnum:]]/ui', '-', $string);
}
echo clean("Test: föG");
Linux:
Test--f-G
Windows (as it should):
Test--föG
From the PHP documentation of the PCRE module:
In UTF-8 mode, characters with values greater than 128 do not match any of the POSIX character classes.
This is probably because of efficiency reasons: there are many Unicode characters. You can write your regular expression using the Unicode character properties instead of the POSIX character class. This will be somewhat slower though.
<?php
mb_internal_encoding('UTF-8');
function clean($string) {
return preg_replace('/[^\\p{L}\\p{N}]/ui', '-', $string);
}
echo clean("Test: föG");
Related
I need to trim a string of all characters except letters from any languages in UTF-8. For an early test this was working fine until obviously I started using UTF-8 non-Latin letters:
<?php
$s = '\$5ı龢abc';
echo '<p>'.$s.'</p>';
while (!preg_match('/([\p{L}]+)/u', $s[0]))
{
$s = substr($s, 1);
echo '<p>'.$s.'</p>';
}
?>
This currently outputs the following:
$5ı龢abc
$5ı龢abc
5ı龢abc
ı龢abc
�龢abc
龢abc
��abc
�abc
abc
I would like the final output to be: ı龢abc. I'm not quite sure what I'm missing however?
Using individual character indexing doesn't work, since PHP isn't aware of "characters" in strings, and merely indexes bytes. This is obviously a problem with multi-byte characters. But you're doing it way too manually anyway; just replace all non-letter characters at the beginning of the string:
$s = preg_replace('/^\P{L}*/u', '', $s);
I've been googling for a bit, also search here but can find a solution. I'm using PHP. I'm reading a text string (part of X509 cert) and it encoded é to \xC3\xA9 (André => Andr\xC3\xA9).
I've tried MonkeyPhysics's solution:
preg_replace("#(\\\x[0-9A-F]{2})#ei", "chr(hexdec('\\1'))", $string);
but then I get André
I've played around with the replacement part;
mb_convert_encoding('&#' . hexdec('\\1') . ';', 'ISO-8859-1', 'UTF-8')
(Also the to_encoding and from_encoding)
I've also looked at How to transliterate non-latin scripts? but got no closer.
Surely this should be a standard conversion?
Use of e modifier is deprecated in PHP now. You need to use preg_replace_callback instead with /u modifier for handling unicode strings.
$string = 'His nickname was \xE2\x80\x98the Angel\xE2\x80\x99,
which is kind of a clich\xC3\xA9 in my opinion.';
$repl = preg_replace_callback("#(\\\x[0-9A-F]{2})#ui",
function ($m) { return chr(hexdec($m[1])); }, $string);
OUTPUT:
His nickname was ‘the Angel’,
which is kind of a cliché in my opinion.
this is my current regex code to validate english & numbers:
const CANONICAL_FMT = '[0-9a-z]{1,64}';
public static function isCanonical($str)
{
return preg_match('/^(?:' . self::CANONICAL_FMT . ')$/', $str);
}
Pretty straight forward. Now i want to change that to validate only hebrew, underscore
and numbers. So i changed the code to:
public static function isCanonical($str)
{
return preg_match('/^(?:[\u0590-\u05FF\uFB1D-\uFB40]+|[\w]+)$/i', $str);
}
But it doesn't work. I basically took the hebrew UTF range out of Wikipedia.
What is Wrong here?
I was able to get it to work much more easily, using the /u flag and the \p{Hebrew} Unicode character property:
return preg_match('/^(?:\p{Hebrew}+|\w+)$/iu', $str);
Working example: http://ideone.com/gSlmh
If you want preg_match() to work properly with UTF-8, you might have to enable the u modifier (quoting) :
This modifier turns on additional functionality of PCRE that is
incompatible with Perl. Pattern strings are treated as UTF-8.
In your case, instead of using the following regex :
/^(?:[\u0590-\u05FF\uFB1D-\uFB40]+|[\w]+)$/i
I suppose you'd be using :
/^(?:[\u0590-\u05FF\uFB1D-\uFB40]+|[\w]+)$/iu
(Note the additionnal u at the end)
You need the /u modifier to add support for UTF-8.
Make sure you convert your hebrew input to UTF-8 if it's in some other codepage/character set.
I have big problems to match this character: –
It's something called a "en dash" U+2013 (according to http://www.fileformat.info/info/unicode/char/search.htm)
It's a match with - in my test environment (windows and php 5.2.11) but fails on the production servers (ubuntu and php 5.3.2). Even \x2013 fails there.
Any suggestions how to match this strange character? Or how to config php to make it work?
You can also try use the "u" flag on the expression which makes the expression compatible with utf-8: regex pattern modifiers
so your expression would be "/[somepatter]/u"
if (preg_match ('~\xe2\x80\x93~', $string))
{
echo "En Dash found";
}
I believe you've got an UTF-8 encoding, don't you?
$string1 = preg_replace('/[^A-Za-z0-9äöü!&_=\+-]/', ' ', $string4);
This Regex shouldn't replace the chars äöü.
In Ruby it worked as expected.
But in PHP it replaces also the ä ö and ü.
Can someone give me a hint how to fix it?
Set the u pattern modifier (to tell php to treat the regex as a UTF-8 string).
'/[^A-Za-z0-9äöü!&_=\+-]/u'
i think this should work:
$string1 = preg_replace('/\[^A-Za-z0-9\pL!&_=\+-]/u', ' ', $string4 );
Unicode support is one of the features promised for PHP 6.
Currently in php5
use the multibyte string functions like mb_ereg
PHP will interpret '/regex/u' as a UTF-8 string, with preg_match,preg_replace