The PHP strtolower() function is supposed to convert strings to lowercase. But, it says in the PHP Manual (emphasis added):
Returns string with all alphabetic characters converted to lowercase.
Note that 'alphabetic' is determined by the current locale. This means
that in i.e. the default "C" locale, characters such as umlaut-A (Ä)
will not be converted.
The manual is silent about encodings here, but it is known that strtolower() will corrupt UTF-8 strings, where you are supposed to use mb_strtolower() instead.
I'm looking for a solution in cases where the mbstring extension is not available, and wanted to know when it is safe to use strtolower().
Thanks to pointers given to me by people commenting this question, it seems that the relevant part of the PHP source is to the call to the tolower() function in the ctype.h library. The library documentation says (emphasis added):
If the
argument of tolower() represents an uppercase letter, and there exists
a corresponding lowercase letter (as defined by character type information in the
program locale category LC_CTYPE ), the result shall be the corresponding
lowercase letter.
According to my tests, in PHP with set_locale( LC_CTYPE, 'C' ); characters such as Ä (encoded in ISO-8859-1) are left untouched. But in some other locales, the function returns the lowercase ä (again, in ISO-8859-1). Anyway, changing the locale to one that uses a UTF-8 character set does not make PHP strtolower() work on the UTF-8 character Ä.
Considering the increasing amount of I18N-related issues and multilingual environments, this information can be critically important. Many applications rely on strtolower() for a simple case-insensitive check. Consider:
$_POST['username'] = 'Michèlle';
if ( strtolower( $_POST['username'] ) == $database['username'] ) ...
Now, depending on the encoding, locales and maybe some other variables, the above code will work in some environments, but not in others.
The question is: Given that the PHP strtolower() function uses ctype.h library's tolower function, which depends on the "program locale category", when is it safe to count on this function? Can the behaviour be counted on in the following cases?
The string is ASCII
The string is encoded in ISO-8859-1
The string is encoded in some other encoding with the corresponding locale set.
(Edit: Question reworded completely on 26 Nov 2013.)
The strtolower() PHP function does use the tolower() C function within its implementation that operates on each single byte (octet) of the passed string parameter.
This is the reason why setlocale(LC_CTYPE, 'C' ); does not corrupt UTF-8 encoded strings because it won't change bytes > 127. That is it does only change the case of the US-ASCII characters A-Z.
The "C" locale is set by default and you do not need to set it explicitly with setlocale(), only if other parts of the application have set it to a different value.
This also explains why setting LC_CTYPE to an UTF8 locale like "de_DE.UTF-8" would not convert "Ä" to "ä": That letter is encoded with two bytes 0xC3 0x84 of which both are passed as a single character (octet) to the tolower() C function - therefore they are unchanged as on a single byte an UTF-8 to lower processing could only deal with characters < 128 which again is effectively A-Z only. Which is effectively like the C locale.
So setting LC_CTYPE to "C" prevents breaking UTF-8 strings in use with strtolower().
It uses the C function tolower (ref: http://www.acm.uiuc.edu/webmonkeys/book/c_guide/2.2.html) from the ctype.h library.
You can view the relevant sections of the source here:
where strtolower is defined: http://lxr.php.net/xref/PHP_TRUNK/ext/standard/string.c#1393
where tolower is called in php_strtolower - http://lxr.php.net/xref/PHP_TRUNK/ext/standard/string.c#1376
Related
I need to convert uploaded filenames with an unknown encoding to Windows-1252 whilst also keeping UTF-8 compatibility.
As I pass on those files to a controller (on which I don't have any influence), the files have to be Windows-1252 encoded. This controller then again generates a list of valid file(names) that are stored via MySQL into a database - therefore I need UTF-8 compatibility. Filenames passed to the controller and filenames written to the database MUST match. So far so good.
In some rare cases, when converting to "Windows-1252" (like with te character "ï"), the character is converted to something invalid in UTF-8. MySQL then drops those invalid characters - as a result filenames on disk and filenames stored to the database don't match anymore. This conversion, which failes sometimes, is achieved with simple recoding:
$sEncoding = mb_detect_encoding($sOriginalFilename);
$sTargetFilename = iconv($sEncoding, "Windows-1252//IGNORE", $sOriginalFilename);
To prevent invalid characters being generated by the conversion, I then again can remove all invalid UTF-8 characters from the recoded string:
ini_set('mbstring.substitute_character', "none");
$sEncoding = mb_detect_encoding($sOriginalFilename);
$sTargetFilename = iconv($sEncoding, "Windows-1252//TRANSLIT", $sOriginalFilename);
$sTargetFilename = mb_convert_encoding($sTargetFilename, 'UTF-8', 'Windows-1252');
But this will completely remove / recode any special characters left in the string. For example I lose all "äöüÄÖÜ" etc., which are quite regular in german language.
If you know a cleaner and simpler way of encoding to Windows-1252 (without losing valid special characters), please let me know.
Any help is very appreciated. Thank you in advance!
I think the main problem is that mb_detect_encoding() does not do exactly what you think it does. It attempts to detect the character encoding but it does it from a fairly limited list of predefined encodings. By default, those encodings are the ones returned by mb_detect_order(). In my computer they are:
ASCII
UTF-8
So this function is completely useless unless you take care of compiling a list of candidate encodings and feeding the function with it.
Additionally, there's basically no reliable way to guess the encoding of an arbitrary input string, even if you restrict yourself to a small subset of encodings. In your case, Windows-1252 is so close to ISO-8859-1 and ISO-8859-15 that you have no way to tell them apart other than visual inspection of key characters like ¤ or €.
You can't have a string be Windows-1252 and UTF-8 at the same time. The character sets are identical for the first 128 characters (they contain e.g. the basic latin alphabet), but when it goes beyond that (like for Umlauts), it's either one or the other. They have different code points in UTF-8 than they have in Windows-1252.
Keep to ASCII in the filesystem - if you need to sustain characters outside ASCII in a filename, there are
schemes you can use to represent unicode characters while keeping to ASCII.
For example, percent encoding:
äöüÄÖÜ.txt <-> %C3%A4%C3%B6%C3%BC%C3%84%C3%96%C3%9C.txt
Of course this will hit the file name limit pretty fast and is not very optimal.
How about punycode?
äöüÄÖÜ.txt <-> xn--4caa7cb2ac.txt
I'm currently trying to remove all special characters and accents from an UTF-8 string by turning them into their equivalent ASCII character if possible.
So I'm simply using this code:
$result = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $input);
The problem is that for example the word "début" turns into "dbut" instead of "debut".
To make it work, I need to add a call to setlocale, like this:
setlocale(LC_ALL, 'en_US.UTF8');
$result = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $input);
And I don't understand why. I thought UTF-8 and ASCII were always the same, whatever locale you use.
EDIT: I didn't mean UTF-8 equals ASCII, I meant UTF-8 always equals UTF-8 and ASCII always equals ASCII
The subset of UTF-8 that overlaps with ASCII (which is code points 0-127) is indeed identical with ASCII. However, accented latin characters are not part of the ASCII character set and if you don't setlocale yourself, the system's default locale (which evidently does not contain these accented characters) is used to get a character set to work with.
In general, iconv can be a little iffy; this is mentioned in the introduction of the extension:
This module contains an interface to iconv character set conversion
facility. With this module, you can turn a string represented by a
local character set into the one represented by another character set,
which may be the Unicode character set. Supported character sets
depend on the iconv implementation of your system. Note that the iconv
function on some systems may not work as you expect. In such case,
it'd be a good idea to install the GNU libiconv library. It will
most likely end up with more consistent results.
At the moment, I don't understand why it is really important to use mbstring functions in PHP when dealing with UTF-8? My locale under linux is already set to UTF-8, so why doesn't functions like strlen, preg_replace and so on don't work properly by default?
All of the PHP string functions do not handle multibyte strings regardless of your operating system's locale. That is why you need to use the multibyte string functions.
From the Multibyte String Introduction:
When you manipulate (trim, split, splice, etc.) strings encoded in a
multibyte encoding, you need to use special functions since two or
more consecutive bytes may represent a single character in such
encoding schemes. Otherwise, if you apply a non-multibyte-aware string
function to the string, it probably fails to detect the beginning or
ending of the multibyte character and ends up with a corrupted garbage
string that most likely loses its original meaning.
Here is my answer in plain English.
A single Japanese and Chinese and Korean character take more than a single byte. Eg., a typical charactert say x is takes 1 byte in English it will take more than 1 byte in Japanese and Chinese and Korean. Now PHP's standard string functions are meant to treat a single character as 1 byte. So in case you are trying to do compare two Japanese or Chinese or Korean characters they will not work as expected. For example the length of "Hello World!" in Japanese or Chinese or Korean will have more than 12 bytes.
Read http://www.php.net/manual/en/intro.mbstring.php
You do not need to use UTF-8 aware code to process UTF-8. For the most part.
I've even written a Unicode uppercaser/lowercaser, and NFC and NFD transforms, using only byte-aware functions. It's hard to think of anything more complicated than that, that needs such delicate and detailed treatment of UTF-8. And yet it still works with byte-only functions.
It's very rare that you need UTF-8 aware code. Maybe to count the number of characters, or to move an insertion point forward by 1 character. But actually, even then your code won't work ;) because of decomposed characters.
But if all you are doing is replacements, finding stuff, or even parsing syntax, you just need byte-aware functions.
I'll explain why.
It's because no UTF-8 character can be found inside any other UTF-8 character. That's how it is designed.
Try to explain to me how you can get text processing errors, in terms of a multi-byte system where no character can be found inside another character? Just one example case! The simplest you can think of.
PHP strings are just plain byte sequences. They have no meaning by themselves. And they do not use any particular character encoding either.
So if you read a file using file_get_contents() you get a binary-safe representation of the file. May it be the (binary) representation of an image or a human-readable text file - PHP doesn't care.
Now, as long as you just need to do basic processing of the string, you do not need to know the character encoding at all. So if you want to store the string back into a file using file_put_contents() or want to get its length (not the number of characters) using strlen(), you're fine.
However, as soon as you start doing more fancy string manipulation, you need to know the character encoding! There is no way to store it as part of the string, so you either have to track it separately, or, what most people do, use the convention of having all (text) strings in a common character encoding, like US-ASCII or nowadays UTF-8.
So because there is no way to set a character encoding for a string, PHP has no idea which character encoding the string is using. Due to that, the only sane thing for strlen() to do is to return the number of bytes, as this is the only thing PHP knows for sure.
If you provide the additional information of the used character encoding, you need to use another function - the function is called mb_strlen() in this case.
The same applies to preg_replace(): If you want to replace umlaut-a, or match three identical characters in a row, you need to know how umlaut-a is encoded, and in general, how characters are encoded.
So if you have a hypothetical character encoding, which encodes a lower-case a as a1 and an upper-case A as a2, a b as b1 and B as b2 (and so on), you can have an (encoded) string a1a1a1 which consists of three identical characters in a row. However, without knowing the encoding and by just looking at the byte sequence, there is no way to detect this.
Summary:
No sane 'default' is possible as PHP strings do not contain the character encoding. And even if, a single function like strlen() cannot return the length of the byte sequence as required for Content-Length HTTP header and at the same time the number of characters as useful to denote the length of a blog article.
That's why the Function Overloading Feature is inherently broken and even if it looks nice at first, will break your code in a hard-to-debug way.
multibyte => multi + byte.
1) It is use to work with string which is in other language(means not in English) format.
2) Default PHP string functions only work proper with English (or releted to it) language.
3) If you want to use strlen() or strpos() or uppercase() or strreplace() for special character,
Suppose We need to apply string functions on "Hello".
In chines (你好), Arabic (مرحبا), Japanese (こんにちは), Hindi (
नमस्ते), Gujarati (હેલો).
Different language can it's own character sets
so that mbstring introduced for communicate with various languages like (chines,Japanese etc).
Raul González is a perfect example of why:
It is about shortening too long user names for MySQL database, say we have 10 character limit and Raul González.
The unit test below is an example how you can get an error like this
General error: 1366 Incorrect string value: '\xC3' for column 'name' at row 1 (SQL: update users set name = Raul Gonz▒, updated_at = 2019-03-04 04:28:46 where id = 793)
and how you can avoid it
public function test_substr(): void
{
$name = 'Raul González';
$user = factory(User::class)->create(['name' => $name]);
try {
$name1 = substr($name, 0, 10);
$user->name = $name1;
$user->save();
} catch (Exception $ex) {
}
$this->assertTrue(isset($ex));
$name2 = mb_substr($name, 0, 10);
$user->name = $name2;
$user->save();
$this->assertTrue(true);
}
PHP Laravel and PhpUnit was used for illustration.
I'm using the following regex to check an image filename only contains alphanumeric, underscore, hyphen, decimal point:
preg_match('!^[\w.-]*$!',$filename)
This works ok. But I have concerns about multibyte characters. Should I specifically handle them to prevent undetermined errors, or should this regex reject mb filenames ok?
PHP does not have "native" support for multibyte characters; you need to use the "mbstring" extensionDocs (which may or may not be available). Furthermore, it would appear that there is no way to create a "multibyte-character string", as such -- rather, one chooses to treat a native string as multibyte-character string by using special "mbstring" functions. In other words, a PHP string does not know its own character encoding -- you have to keep track of it manually.
You may be able to get away with it so long as you use UTF-8 (or similar) encoding. UTF-8 always encodes multibyte characters to "high" bytes (for instance, ß is encoded as 0xcf 0x9f), so PHP will probably treat them just like any other character. You would not be able to use an encoding that might potentially encode a multibyte character into "special" PHP bytes, such as 0x22, the "double-quote" symbol.
The only regular expression functions in PHP that know how to deal with specific multibyte characters out of a range of multiple character-sets are mb_eregDocs, mb_eregiDocs, mb_ereg_replaceDocs and mb_eregi_replaceDocs.
PCRE based regular expression functions like preg_matchDocs support UTF-8 by using the u-modifier (PCRE8)Docs.
But of course, as described above PHP strings don't know their own encoding, so you first need to instruct the "mbstring" library using the mb_regex_encoding function. Note that that function specifies the encoding of the string you're matching, not the string containing the regular expression itself.
PHP's str_replace() was intended only for ANSI strings and as such can mangle UTF-8 strings. However, given that it's binary-safe would it work properly if it was only given valid UTF-8 strings as arguments?
Edit: I'm not looking for a replacement function, I would just like to know if this hypothesis is correct.
Yes. UTF-8 is deliberately designed to allow this and other similar non-Unicode-aware processing.
In UTF-8, any non-ASCII byte sequence representing a valid character always begins with a byte in the range \xC0-\xFF. This byte may not appear anywhere else in the sequence, so you can't make a valid UTF-8 sequence that matches part of a character.
This is not the case for older multibyte encodings, where different parts of a byte sequence are indistinguishable. This caused a lot of problems, for example trying to replace an ASCII backslash in a Shift-JIS string (where byte \x5C might be the second byte of a character sequence representing something else).
It's correct because UTF-8 multibyte characters are exclusively non-ASCII (128+ byte value) characters beginning with a byte that defines how many bytes follow, so you can't accidentally end up matching a part of one UTF-8 multibyte character with another.
To visualise (abstractly):
a for an ASCII character
2x for a 2-byte character
3xx for a 3-byte character
4xxx for a 4-byte character
If you're matching, say, a2x3xx (a bytes in ASCII range), since a < x, and 2x cannot be a subset of 3xx or 4xxx, et cetera, you can be safe that your UTF-8 will match correctly, given the prerequisite that all strings are definitely valid UTF-8.
Edit: See bobince's answer for a less abstract explanation.
Well, I do have a counter example: I have a UTF8 encoded settings ".ini' file specifying appliation settings like email sender name. it says something like:
email_from = Märta
and I read it from there to variable $sender. Now that I replace the message body (UTF8 again)
regards
{sender}
$message = str_replace("{sender}",$sender_name,$message);
The email is absolutely correct in every respect but the sender is totally broken. There are other cases (like explode() ) when something goes wrong with a UTF string. It is healthy before the conversion but not after it. Sorry to say there seems to be no way of correcting this behaviour.
Edit: Actually, explode() is involved in parsing the .ini file so the problem may well lie in that very function so the str_replace() may well be innocent.
No you cannot.
From practice I am telling you if you have some multibyte symbols like ◊ etc, and others are non-multibyte it wont work correctly, because there are symbols that take 2-4 to place them,
str_replace takes fixed bytes, and replaces... In result we have something that isn't any symbols trash etc.
Yes, I think this is correct, at least I couldn't find any counter-example.