why php's str_replace and many other string functions mess up the strings with special chars such ('é' 'à' ..) ? and how to fix this problem ?
str_replace is not multi-byte (unicode) aware. use the according mb_* functions instead
in your place mb_ereg_replace sounds like the right option. you could as well just use the PCRE regex functions and specifying the X flag
PHP wasn't developed from the ground up to natively support UTF8. It may be useful to instead of specify the character literal, specify the entity reference / hex code of that in your replacement, eg \x3094 and replace that, I think it's more consistently supported.
Though it would help seeing your direct issue at hand, with more code.
Related
I was looking for this for a while, but was not able to find any answer. I need to change a string to lowercase in PHP.
Off course, this can be done by using strtolower(), but I was wondering if its possible to do it via preg_replace().
I noticed that in vim one can use \L or \U modifiers in the back references to change the case to lower or upper.
Is something like that possible to do in PHP, i.e. in the second argument in preg_replace()? The reason why I wanna change the case via preg_replace() is that I heard that it might work better for UTF8 strings (not sure if its true).
Thanks.
You should actually just use
mb_strtolower($str, 'UTF-8')
That way you specify utf-8 is the encoding, and all should work well.
Edit: sorry had strtoupper, changed to lower. Also, you can leave off utf-8 and it should automatically detect the encoding and use the right one.
Doing with preg_replace is practically impossible.
This is because you need to pass the strtolower() / strtoupper() as a parameter to preg_replace function. Since preg_replace cannot act on their own.
Go with the function what Dave suggested.
I know that if I use multibyte(UTF-8) characters for the pattern, I have to use mb_ functions or have to use u option for pattern of preg_ functions.
But when I use multibyte(UTF-8) characters only for the subject of preg_ functions and use only ascii characters for the pattern, do preg_ functions (without u option) work correctly?
I know that in this case I have to use mb_ function or add u option to the pattern:
$str = preg_replace("/$utf8_multibyte_pattern/", '', $str);
I want to know if this code(u option is not used) is safe or not:
$ascii_pattern = "[a-zA-Z0-9'$#\\\"%&()\-~|~=!#`{}[]:;+*/.,_<>?_\n\t\r]";
$multibyte_str = preg_replace("/$ascii_pattern/", '', $utf8_multibyte_str);
Maybe I found the answer by myself.
But someone who knows about character code well, please comment to this answer or post another answer.
According to wikipedia, UTF-8 character codes don't contain ascii code.
http://en.wikipedia.org/wiki/UTF-8#Advantages
The ASCII characters are represented by themselves as single bytes that do not appear anywhere else, which makes UTF-8 work with the majority of existing APIs that take bytes strings but only treat a small number of ASCII codes specially. This removes the need to write a new Unicode version of every API, and makes it much easier to convert existing systems to UTF-8 than any other Unicode encoding.
I think this means preg function with ascii pattern without u option is safe for multibyte(UTF8) subject.
And this code (without u option)
$multibyte_str = preg_replace("/$ascii_pattern/", '', $utf8_multibyte_str);
and this code (with u option)
$multibyte_str = preg_replace("/$ascii_pattern/u", '', $utf8_multibyte_str);
are the same.
Both correctly works.
Am I correct?
It is safe as far as I know as long as you use the unicode property (/u) like so:
$ascii_pattern = "[a-zA-Z0-9'$#\\\"%&()\-~|~=!#`{}[]:;+*/.,_<>?_\n\t\r]";
$multibyte_str = preg_replace("/$ascii_pattern/u", '', $utf8_multibyte_str);
To see more information on unicode characters, see here
I've been searching for UTF8-safe alternatives for string manipulation functions. I've found many different opinions and suggestions. I would like to ask if following functions can cause problems in UTF-8 and if does, what should I use instead. I know the list of mb_ prefixed functions in PHP manual, but there are not all functions I am using.
Functions are: implode, explode, str_replace, preg_match, preg_replace
Thank you
explode just looks for an identical byte sequence and separates the string at that point. Since UTF-8 is safely backwards compatible with ASCII, there's no concern and it will work fine. implode just assembles strings together, which works fine as well due to the properties of UTF-8. str_replace works for the same reasons. The preg_ functions work fine as long as you are using the /u modifier.
If you need to safely manipulate with UTF8 characters, you can do it like this:
mb_internal_encoding('UTF-8');
preg_replace( '`...`u', '...', $string ) // with the u (unicode) modifier
I'm using the following regex to check an image filename only contains alphanumeric, underscore, hyphen, decimal point:
preg_match('!^[\w.-]*$!',$filename)
This works ok. But I have concerns about multibyte characters. Should I specifically handle them to prevent undetermined errors, or should this regex reject mb filenames ok?
PHP does not have "native" support for multibyte characters; you need to use the "mbstring" extensionDocs (which may or may not be available). Furthermore, it would appear that there is no way to create a "multibyte-character string", as such -- rather, one chooses to treat a native string as multibyte-character string by using special "mbstring" functions. In other words, a PHP string does not know its own character encoding -- you have to keep track of it manually.
You may be able to get away with it so long as you use UTF-8 (or similar) encoding. UTF-8 always encodes multibyte characters to "high" bytes (for instance, ß is encoded as 0xcf 0x9f), so PHP will probably treat them just like any other character. You would not be able to use an encoding that might potentially encode a multibyte character into "special" PHP bytes, such as 0x22, the "double-quote" symbol.
The only regular expression functions in PHP that know how to deal with specific multibyte characters out of a range of multiple character-sets are mb_eregDocs, mb_eregiDocs, mb_ereg_replaceDocs and mb_eregi_replaceDocs.
PCRE based regular expression functions like preg_matchDocs support UTF-8 by using the u-modifier (PCRE8)Docs.
But of course, as described above PHP strings don't know their own encoding, so you first need to instruct the "mbstring" library using the mb_regex_encoding function. Note that that function specifies the encoding of the string you're matching, not the string containing the regular expression itself.
In PHP, what is the difference between strtolower and mb_strtolower?
If I want to convert submitted email address, to be converted to lower-case, which one should I use? Is there any email like this: Name#Domain-Test.com
If there are such email, should I still convert the submitted email address to lower case?
strtolower(); doesn't work for polish chars
<?php strtolower("mĄkA"); ?>
will return: mĄka;
the best solution - use mb_strtolower()
<?php mb_strtolower("mĄkA",'UTF-8'); ?>
will return: mąka
See strtolower() & mb_strtolower() in PHP Manual
whats is the different between strtolower and mb_strtolower?
The mb_* functions work with multi-byte string. The manual says:
By contrast to strtolower(), 'alphabetic' is determined by the Unicode character properties. Thus the behaviour of this function is not affected by locale settings and it can convert any characters that have 'alphabetic' property, such as A-umlaut (Ä).
-
Is there any email like this : Name#Domain-Test.com
Yes, I suppose there could be email addresses like that. I've found that in general, email addresses are case-insensitive, so I don't bother changing their case.
The mb_ functions work with Multi-Byte (unicode) strings as well. E-Mail addresses shouldn't be case sensitive - there isn't much reason to convert them to lower.
If you use this function on a unicode string without telling PHP that it is unicode, then you will corrupt your string. In particular, the uppercase 'A' with tilde, common in 2-byte UTF-8 characters, is converted to lowercase 'a' with tilde.
mb_strtolower() is very SLOW, if you have a database connection, you may want to use it to convert your strings to lower case. Even latin1/9 (iso-8859-1/15) and other encodings are possible.