In PHP, what is the difference between strtolower and mb_strtolower?
If I want to convert submitted email address, to be converted to lower-case, which one should I use? Is there any email like this: Name#Domain-Test.com
If there are such email, should I still convert the submitted email address to lower case?
strtolower(); doesn't work for polish chars
<?php strtolower("mĄkA"); ?>
will return: mĄka;
the best solution - use mb_strtolower()
<?php mb_strtolower("mĄkA",'UTF-8'); ?>
will return: mąka
See strtolower() & mb_strtolower() in PHP Manual
whats is the different between strtolower and mb_strtolower?
The mb_* functions work with multi-byte string. The manual says:
By contrast to strtolower(), 'alphabetic' is determined by the Unicode character properties. Thus the behaviour of this function is not affected by locale settings and it can convert any characters that have 'alphabetic' property, such as A-umlaut (Ä).
-
Is there any email like this : Name#Domain-Test.com
Yes, I suppose there could be email addresses like that. I've found that in general, email addresses are case-insensitive, so I don't bother changing their case.
The mb_ functions work with Multi-Byte (unicode) strings as well. E-Mail addresses shouldn't be case sensitive - there isn't much reason to convert them to lower.
If you use this function on a unicode string without telling PHP that it is unicode, then you will corrupt your string. In particular, the uppercase 'A' with tilde, common in 2-byte UTF-8 characters, is converted to lowercase 'a' with tilde.
mb_strtolower() is very SLOW, if you have a database connection, you may want to use it to convert your strings to lower case. Even latin1/9 (iso-8859-1/15) and other encodings are possible.
Related
I know that if I use multibyte(UTF-8) characters for the pattern, I have to use mb_ functions or have to use u option for pattern of preg_ functions.
But when I use multibyte(UTF-8) characters only for the subject of preg_ functions and use only ascii characters for the pattern, do preg_ functions (without u option) work correctly?
I know that in this case I have to use mb_ function or add u option to the pattern:
$str = preg_replace("/$utf8_multibyte_pattern/", '', $str);
I want to know if this code(u option is not used) is safe or not:
$ascii_pattern = "[a-zA-Z0-9'$#\\\"%&()\-~|~=!#`{}[]:;+*/.,_<>?_\n\t\r]";
$multibyte_str = preg_replace("/$ascii_pattern/", '', $utf8_multibyte_str);
Maybe I found the answer by myself.
But someone who knows about character code well, please comment to this answer or post another answer.
According to wikipedia, UTF-8 character codes don't contain ascii code.
http://en.wikipedia.org/wiki/UTF-8#Advantages
The ASCII characters are represented by themselves as single bytes that do not appear anywhere else, which makes UTF-8 work with the majority of existing APIs that take bytes strings but only treat a small number of ASCII codes specially. This removes the need to write a new Unicode version of every API, and makes it much easier to convert existing systems to UTF-8 than any other Unicode encoding.
I think this means preg function with ascii pattern without u option is safe for multibyte(UTF8) subject.
And this code (without u option)
$multibyte_str = preg_replace("/$ascii_pattern/", '', $utf8_multibyte_str);
and this code (with u option)
$multibyte_str = preg_replace("/$ascii_pattern/u", '', $utf8_multibyte_str);
are the same.
Both correctly works.
Am I correct?
It is safe as far as I know as long as you use the unicode property (/u) like so:
$ascii_pattern = "[a-zA-Z0-9'$#\\\"%&()\-~|~=!#`{}[]:;+*/.,_<>?_\n\t\r]";
$multibyte_str = preg_replace("/$ascii_pattern/u", '', $utf8_multibyte_str);
To see more information on unicode characters, see here
At the moment, I don't understand why it is really important to use mbstring functions in PHP when dealing with UTF-8? My locale under linux is already set to UTF-8, so why doesn't functions like strlen, preg_replace and so on don't work properly by default?
All of the PHP string functions do not handle multibyte strings regardless of your operating system's locale. That is why you need to use the multibyte string functions.
From the Multibyte String Introduction:
When you manipulate (trim, split, splice, etc.) strings encoded in a
multibyte encoding, you need to use special functions since two or
more consecutive bytes may represent a single character in such
encoding schemes. Otherwise, if you apply a non-multibyte-aware string
function to the string, it probably fails to detect the beginning or
ending of the multibyte character and ends up with a corrupted garbage
string that most likely loses its original meaning.
Here is my answer in plain English.
A single Japanese and Chinese and Korean character take more than a single byte. Eg., a typical charactert say x is takes 1 byte in English it will take more than 1 byte in Japanese and Chinese and Korean. Now PHP's standard string functions are meant to treat a single character as 1 byte. So in case you are trying to do compare two Japanese or Chinese or Korean characters they will not work as expected. For example the length of "Hello World!" in Japanese or Chinese or Korean will have more than 12 bytes.
Read http://www.php.net/manual/en/intro.mbstring.php
You do not need to use UTF-8 aware code to process UTF-8. For the most part.
I've even written a Unicode uppercaser/lowercaser, and NFC and NFD transforms, using only byte-aware functions. It's hard to think of anything more complicated than that, that needs such delicate and detailed treatment of UTF-8. And yet it still works with byte-only functions.
It's very rare that you need UTF-8 aware code. Maybe to count the number of characters, or to move an insertion point forward by 1 character. But actually, even then your code won't work ;) because of decomposed characters.
But if all you are doing is replacements, finding stuff, or even parsing syntax, you just need byte-aware functions.
I'll explain why.
It's because no UTF-8 character can be found inside any other UTF-8 character. That's how it is designed.
Try to explain to me how you can get text processing errors, in terms of a multi-byte system where no character can be found inside another character? Just one example case! The simplest you can think of.
PHP strings are just plain byte sequences. They have no meaning by themselves. And they do not use any particular character encoding either.
So if you read a file using file_get_contents() you get a binary-safe representation of the file. May it be the (binary) representation of an image or a human-readable text file - PHP doesn't care.
Now, as long as you just need to do basic processing of the string, you do not need to know the character encoding at all. So if you want to store the string back into a file using file_put_contents() or want to get its length (not the number of characters) using strlen(), you're fine.
However, as soon as you start doing more fancy string manipulation, you need to know the character encoding! There is no way to store it as part of the string, so you either have to track it separately, or, what most people do, use the convention of having all (text) strings in a common character encoding, like US-ASCII or nowadays UTF-8.
So because there is no way to set a character encoding for a string, PHP has no idea which character encoding the string is using. Due to that, the only sane thing for strlen() to do is to return the number of bytes, as this is the only thing PHP knows for sure.
If you provide the additional information of the used character encoding, you need to use another function - the function is called mb_strlen() in this case.
The same applies to preg_replace(): If you want to replace umlaut-a, or match three identical characters in a row, you need to know how umlaut-a is encoded, and in general, how characters are encoded.
So if you have a hypothetical character encoding, which encodes a lower-case a as a1 and an upper-case A as a2, a b as b1 and B as b2 (and so on), you can have an (encoded) string a1a1a1 which consists of three identical characters in a row. However, without knowing the encoding and by just looking at the byte sequence, there is no way to detect this.
Summary:
No sane 'default' is possible as PHP strings do not contain the character encoding. And even if, a single function like strlen() cannot return the length of the byte sequence as required for Content-Length HTTP header and at the same time the number of characters as useful to denote the length of a blog article.
That's why the Function Overloading Feature is inherently broken and even if it looks nice at first, will break your code in a hard-to-debug way.
multibyte => multi + byte.
1) It is use to work with string which is in other language(means not in English) format.
2) Default PHP string functions only work proper with English (or releted to it) language.
3) If you want to use strlen() or strpos() or uppercase() or strreplace() for special character,
Suppose We need to apply string functions on "Hello".
In chines (你好), Arabic (مرحبا), Japanese (こんにちは), Hindi (
नमस्ते), Gujarati (હેલો).
Different language can it's own character sets
so that mbstring introduced for communicate with various languages like (chines,Japanese etc).
Raul González is a perfect example of why:
It is about shortening too long user names for MySQL database, say we have 10 character limit and Raul González.
The unit test below is an example how you can get an error like this
General error: 1366 Incorrect string value: '\xC3' for column 'name' at row 1 (SQL: update users set name = Raul Gonz▒, updated_at = 2019-03-04 04:28:46 where id = 793)
and how you can avoid it
public function test_substr(): void
{
$name = 'Raul González';
$user = factory(User::class)->create(['name' => $name]);
try {
$name1 = substr($name, 0, 10);
$user->name = $name1;
$user->save();
} catch (Exception $ex) {
}
$this->assertTrue(isset($ex));
$name2 = mb_substr($name, 0, 10);
$user->name = $name2;
$user->save();
$this->assertTrue(true);
}
PHP Laravel and PhpUnit was used for illustration.
I'm using the following regex to check an image filename only contains alphanumeric, underscore, hyphen, decimal point:
preg_match('!^[\w.-]*$!',$filename)
This works ok. But I have concerns about multibyte characters. Should I specifically handle them to prevent undetermined errors, or should this regex reject mb filenames ok?
PHP does not have "native" support for multibyte characters; you need to use the "mbstring" extensionDocs (which may or may not be available). Furthermore, it would appear that there is no way to create a "multibyte-character string", as such -- rather, one chooses to treat a native string as multibyte-character string by using special "mbstring" functions. In other words, a PHP string does not know its own character encoding -- you have to keep track of it manually.
You may be able to get away with it so long as you use UTF-8 (or similar) encoding. UTF-8 always encodes multibyte characters to "high" bytes (for instance, ß is encoded as 0xcf 0x9f), so PHP will probably treat them just like any other character. You would not be able to use an encoding that might potentially encode a multibyte character into "special" PHP bytes, such as 0x22, the "double-quote" symbol.
The only regular expression functions in PHP that know how to deal with specific multibyte characters out of a range of multiple character-sets are mb_eregDocs, mb_eregiDocs, mb_ereg_replaceDocs and mb_eregi_replaceDocs.
PCRE based regular expression functions like preg_matchDocs support UTF-8 by using the u-modifier (PCRE8)Docs.
But of course, as described above PHP strings don't know their own encoding, so you first need to instruct the "mbstring" library using the mb_regex_encoding function. Note that that function specifies the encoding of the string you're matching, not the string containing the regular expression itself.
PHP's str_replace() was intended only for ANSI strings and as such can mangle UTF-8 strings. However, given that it's binary-safe would it work properly if it was only given valid UTF-8 strings as arguments?
Edit: I'm not looking for a replacement function, I would just like to know if this hypothesis is correct.
Yes. UTF-8 is deliberately designed to allow this and other similar non-Unicode-aware processing.
In UTF-8, any non-ASCII byte sequence representing a valid character always begins with a byte in the range \xC0-\xFF. This byte may not appear anywhere else in the sequence, so you can't make a valid UTF-8 sequence that matches part of a character.
This is not the case for older multibyte encodings, where different parts of a byte sequence are indistinguishable. This caused a lot of problems, for example trying to replace an ASCII backslash in a Shift-JIS string (where byte \x5C might be the second byte of a character sequence representing something else).
It's correct because UTF-8 multibyte characters are exclusively non-ASCII (128+ byte value) characters beginning with a byte that defines how many bytes follow, so you can't accidentally end up matching a part of one UTF-8 multibyte character with another.
To visualise (abstractly):
a for an ASCII character
2x for a 2-byte character
3xx for a 3-byte character
4xxx for a 4-byte character
If you're matching, say, a2x3xx (a bytes in ASCII range), since a < x, and 2x cannot be a subset of 3xx or 4xxx, et cetera, you can be safe that your UTF-8 will match correctly, given the prerequisite that all strings are definitely valid UTF-8.
Edit: See bobince's answer for a less abstract explanation.
Well, I do have a counter example: I have a UTF8 encoded settings ".ini' file specifying appliation settings like email sender name. it says something like:
email_from = Märta
and I read it from there to variable $sender. Now that I replace the message body (UTF8 again)
regards
{sender}
$message = str_replace("{sender}",$sender_name,$message);
The email is absolutely correct in every respect but the sender is totally broken. There are other cases (like explode() ) when something goes wrong with a UTF string. It is healthy before the conversion but not after it. Sorry to say there seems to be no way of correcting this behaviour.
Edit: Actually, explode() is involved in parsing the .ini file so the problem may well lie in that very function so the str_replace() may well be innocent.
No you cannot.
From practice I am telling you if you have some multibyte symbols like ◊ etc, and others are non-multibyte it wont work correctly, because there are symbols that take 2-4 to place them,
str_replace takes fixed bytes, and replaces... In result we have something that isn't any symbols trash etc.
Yes, I think this is correct, at least I couldn't find any counter-example.
why php's str_replace and many other string functions mess up the strings with special chars such ('é' 'à' ..) ? and how to fix this problem ?
str_replace is not multi-byte (unicode) aware. use the according mb_* functions instead
in your place mb_ereg_replace sounds like the right option. you could as well just use the PCRE regex functions and specifying the X flag
PHP wasn't developed from the ground up to natively support UTF8. It may be useful to instead of specify the character literal, specify the entity reference / hex code of that in your replacement, eg \x3094 and replace that, I think it's more consistently supported.
Though it would help seeing your direct issue at hand, with more code.