I've read Wikipedia's article on Windows-1252 character encoding. For characters whose byte value is < 128, it should be the same as ASCII/UTF-8.
This makes sense:
php -r "var_export(mb_detect_encoding(\"\x92\", 'windows-1252', true));"
'Windows-1252'
A left curly apostrophe is detected properly.
php -r "var_export(mb_detect_encoding(\"a\", 'windows-1252', true));"
false
Huh? The letter "a" isn't Windows-1252?
My terminal, where I"m running this, is set to UTF-8. So that should be the same byte sequence as ASCII for the letter 'a'. For the sake of minimizing the variables, if I specify the right Windows-1252 byte sequence:
php -r "var_export(mb_detect_encoding(\"\x61\", 'windows-1252', true));"
false
Changing the "strict" parameter (which has pretty useless documentation) does nothing in these cases.
Encoding detection is not supported for windows-1252. According to the mb_detect_order documentation:
mbstring currently implements the following encoding detection
filters. If there is an invalid byte sequence for the following
encodings, encoding detection will fail.
UTF-8, UTF-7, ASCII,
EUC-JP,SJIS, eucJP-win, SJIS-win, JIS, ISO-2022-JP
For ISO-8859-,
mbstring always detects as ISO-8859-.
For UTF-16, UTF-32, UCS2 and
UCS4, encoding detection will fail always.
Related
I am struggling at understanding character encoding in PHP.
Consider the following script (you can run it here):
$string = "\xe2\x82\xac";
var_dump(mb_internal_encoding());
var_dump($string);
var_dump(unpack('C*', $string));
$utf8string = mb_convert_encoding($string, "UTF-8");
var_dump($utf8string);
var_dump(unpack('C*', $utf8string));
mb_internal_encoding("UTF-8");
var_dump($string);
var_dump($utf8string);
I have a string, actually the € character, represented with its unicode code points. Up to PHP 5.5 the used internal encoding is ISO-8859-1, hence I think that my string will be encoded using this encoding. With unpack I can see the bite representation of my string, and it corresponds to the hexadecimal codes I use to define the string.
Then I convert the encoding of the string to UTF-8, using mb_convert_encoding. At this point the string displays differently on the screen and its byte representation changes (and this is expected).
If I change the PHP internal encoding also to UTF-8, I'd expect utf8string to be displayed correctly on the screen, but this doesn't happen.
What I am missing?
The script you show doesn't use any non-ascii characters, so its internal encoding does not make any difference. mb_internal_encoding does convert your data on output. This question will tell you more about how it works; it will also tell you it's better not to use it.
The three-byte string $string in your code is the UTF-8 representation of the Euro symbol, not its "unicode code point" (which is 2 bytes wide, like all common Unicode characters: 0x20ac).
Does this clear up the behavior you see?
You started with a string that is the utf-8 representation of the Euro symbol. If you run echo($string) all versions of PHP produce the three bytes you put in $string. How they are interpreted by the browser depends on the character set specified in the Content-Type header. If it is text/html; charset=utf-8 then you get the Euro sign in the rendered page.
Then you do the wrong move. You call mb_convert_encoding() with only two arguments. This lets PHP use the current value of its internal encoding used by the mb_string extension for the the third argument ($from_encoding). Why?
For PHP 5.6 and newer, the default value returned by mb_internal_encoding() is utf-8 and the call to mb_convert_encoding() is a no-op.
But for previous versions of PHP, the default value returned by mb_internal_encoding() is iso-8859-1 and it doesn't match the encoding of your string. Accordingly, mb_convert_encoding() interprets the bytes of $string as three individual characters and encodes them using the rules of utf-8. The outcome is obviously wrong.
Btw, if you initialize $string with '€' you get the same output on all PHP versions (even on PHP 4, iirc).
I have a weird problem , the following code :
$str = "נסיון" // <--- Hebrew chars
echo mb_detect_encoding ($str)."<br><br><br>";
$str = iconv (mb_detect_encoding($str),'UCS-2BE',$str);
echo mb_detect_encoding ($str)."<br><br><br>";
This will output :
UTF-8
UTF-8
This code is written in a file that's encoded (using Notepad++) in UTF-8 Without BOM, trying other encodings and didn't work.
I also tried converting the string using :
$str = mb_convert_encoding($str,'UCS-2BE');
But that didn't work either. Any insights?
From the documentation for mb_detect_order, the function that establishes the order in which mb_detect_encoding tests different encodings:
mbstring currently implements the following encoding detection filters. If there is an invalid byte sequence for the following encodings, encoding detection will fail.
UTF-8, UTF-7, ASCII, EUC-JP,SJIS, eucJP-win, SJIS-win, JIS, ISO-2022-JP
For ISO-8859-*, mbstring always detects as ISO-8859-*.
For UTF-16, UTF-32, UCS2 and UCS4, encoding detection will fail always.
So, you can't detect the encoding of the second string with the mb functions.
ini_set('mbstring.internal_encoding','UTF-8')
what does this signify at the beginning of a php file and what is it used for ?
I know that a php manual exists but it does not explain it in plain common man's language.
It defines the default internal character encoding to UTF-8 character set type.
This is used to make a site multilingual by changing the mbstring.internal_encoding value.
encoding is the character encoding name used for the HTTP input
character encoding conversion, HTTP output character encoding
conversion, and the default character encoding for string functions
defined by the mbstring module. You should notice that the internal
encoding is totally different from the one for multibyte regex.
And UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode.
You can change it using the iso-8859-1, UTF-8, etc.
Here is the list of Unicode character list.
I need to handle strings in my php script using regular expressions. But there is a problem - different strings have different encodings. If string contains just ascii symbols, mb_detect_encoding function returns 'ASCII'. But if string contains russian symbols, for example, mb_detect_encoding returns 'UTF-8'. It's not good idea to check encoding of each string manually, I suppose.
So the question is - is it correct to use preg_replace (with unicode modifier) for ascii strings? Is it right to write such code preg_replace ("/[^_a-z]/u","",$string); for both ascii and utf-8 strings?
This would be no problem if the two choices were "UTF-8" or "ASCII", but that's not the case.
If PHP doesn't use UTF-8, it uses ISO-8859-1, which is NOT ASCII (it's a superset of ASCII in that the first 127 characters . It's a superset of ASCII. Some characters, for example the Swedish ones å, ä and ö, can be represented in both ISO-8859-1 and Unicode, with different code points! I don't think this matter much for preg_* functions so it may not be applicable to your question, but please keep this in mind when working with different encodings.
You should really, really try to know which character set your strings are in, without the magic of mb_detect_encoding (mb_detect_encoding is not a guarantee, just a good guess). For example, strings fetched through HTTP does have a character set specified in the HTTP header.
Yes sure, you can always use Unicode modifier and it will not affect neither results nor performance.
The 7-bit ASCII character set is encoded identically in UTF-8. If you have an ASCII string you should be able to use the PREG "u" modifier on it.
However, if you have a "supplemented" 8-bit ASCII character set such as ISO-8859-1, Windows-1252 or HP-Roman8 the characters with the leftmost bit set on (values x80 - xff) are not encoded the same in UTF-8 and it would not be appropriate to use the PREG "u" modifier.
I'm currently trying to remove all special characters and accents from an UTF-8 string by turning them into their equivalent ASCII character if possible.
So I'm simply using this code:
$result = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $input);
The problem is that for example the word "début" turns into "dbut" instead of "debut".
To make it work, I need to add a call to setlocale, like this:
setlocale(LC_ALL, 'en_US.UTF8');
$result = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $input);
And I don't understand why. I thought UTF-8 and ASCII were always the same, whatever locale you use.
EDIT: I didn't mean UTF-8 equals ASCII, I meant UTF-8 always equals UTF-8 and ASCII always equals ASCII
The subset of UTF-8 that overlaps with ASCII (which is code points 0-127) is indeed identical with ASCII. However, accented latin characters are not part of the ASCII character set and if you don't setlocale yourself, the system's default locale (which evidently does not contain these accented characters) is used to get a character set to work with.
In general, iconv can be a little iffy; this is mentioned in the introduction of the extension:
This module contains an interface to iconv character set conversion
facility. With this module, you can turn a string represented by a
local character set into the one represented by another character set,
which may be the Unicode character set. Supported character sets
depend on the iconv implementation of your system. Note that the iconv
function on some systems may not work as you expect. In such case,
it'd be a good idea to install the GNU libiconv library. It will
most likely end up with more consistent results.