Detecting the right character encoding in PHP?

Detecting the right character encoding in PHP? - php

I'm trying to detect the character encoding of a string but I can't get the right result.
For example:
$str = "€ ‚ ƒ „ …" ;
$str = mb_convert_encoding($str, 'Windows-1252' ,'HTML-ENTITIES') ;
// Now $str should be a Windows-1252-encoded string.
// Let's detect its encoding:
echo mb_detect_encoding($str,'Windows-1252, ISO-8859-1, UTF-8') ;
That code outputs ISO-8859-1 but it should be Windows-1252.
What's wrong with this?
EDIT:
Updated example, in response to #raina77ow.
$str = "€‚ƒ„…" ; // no white-spaces
$str = mb_convert_encoding($str, 'Windows-1252' ,'HTML-ENTITIES') ;
$str = "Hello $str" ; // let's add some ascii characters
echo mb_detect_encoding($str,'Windows-1252, ISO-8859-1, UTF-8') ;
I get the wrong result again.

The problem with Windows-1252 in PHP is that it will almost never be detected, because as soon as your text contains any characters outside of 0x80 to 0x9f, it will not be detected as Windows-1252.
This means that if your string contains a normal ASCII letter like "A", or even a space character, PHP will say that this is not valid Windows-1252 and, in your case, fall back to the next possible encoding, which is ISO 8859-1. This is a PHP bug, see https://bugs.php.net/bug.php?id=64667.

Although strings encoded with ISO-8859-1 and CP-1252 have different byte code representation:
<?php
$str = "€ ‚ ƒ „ …" ;
foreach (array('Windows-1252', 'ISO-8859-1') as $encoding)
{
$new = mb_convert_encoding($str, $encoding, 'HTML-ENTITIES');
printf('%15s: %s detected: %10s explicitly: %10s',
$encoding,
implode('', array_map(function($x) { return dechex(ord($x)); }, str_split($new))),
mb_detect_encoding($new),
mb_detect_encoding($new, array('ISO-8859-1', 'Windows-1252'))
);
echo PHP_EOL;
}
Results:
Windows-1252: 802082208320842085 detected: explicitly: ISO-8859-1
ISO-8859-1: 3f203f203f203f203f detected: ASCII explicitly: ISO-8859-1
...from what we can see here it looks like there is problem with second paramater of mb_detect_encoding. Using mb_detect_order instead of parameter yields very similar results.

Related

PHP: Convert Extended Ascii file to UTF-8

i don't have any chance to get a valid utf-8 as output...
$fx = file_get_contents("Extended Ascii file.txt"); // example only has chr(129), but could be mixed Extended Ascii + UTF8
// not working:
//$fx = html_entity_decode($fx, ENT_QUOTES, "UTF-8");
//$fx = mb_convert_encoding($fx, 'UTF-8', 'ASCII');
//$fx = utf8_encode($fx);
//$fx = iconv('ASCII', 'UTF-8//IGNORE', $fx);
echo '"chr('.ord($fx[0]).')"=>"'.$fx[0].'"<br><br>'; // result: "chr(129)"=>"�"
$fx = strtr($fx, [chr(128)=>'Ç',chr(129)=>'ü',chr(130)=>'é',chr(131)=>'â',chr(132)=>'ä',chr(133)=>'à',chr(134)=>'å',chr(135)=>'ç',chr(136)=>'ê',chr(137)=>'ë',chr(138)=>'è',chr(139)=>'ï',chr(140)=>'î',chr(141)=>'ì',chr(142)=>'Ä',chr(143)=>'Å',chr(144)=>'É',chr(145)=>'æ',chr(146)=>'Æ',chr(147)=>'ô',chr(148)=>'ö',chr(149)=>'ò',chr(150)=>'û',chr(151)=>'ù',chr(152)=>'ÿ',chr(153)=>'Ö',chr(154)=>'Ü',chr(155)=>'ø',chr(156)=>'£',chr(157)=>'Ø',chr(158)=>'×',chr(159)=>'ƒ',chr(160)=>'á',chr(161)=>'í',chr(162)=>'ó',chr(163)=>'ú',chr(164)=>'ñ',chr(165)=>'Ñ',chr(166)=>'ª',chr(167)=>'º',chr(168)=>'¿',chr(169)=>'®',chr(170)=>'¬',chr(171)=>'½',chr(172)=>'¼',chr(173)=>'¡',chr(174)=>'«',chr(175)=>'»',chr(176)=>'░',chr(177)=>'▒',chr(178)=>'▓',chr(179)=>'│',chr(180)=>'┤',chr(181)=>'Á',chr(182)=>'Â',chr(183)=>'À',chr(184)=>'©',chr(185)=>'╣',chr(186)=>'║',chr(187)=>'╗',chr(188)=>'╝',chr(189)=>'¢',chr(190)=>'¥',chr(191)=>'┐',chr(192)=>'└',chr(193)=>'┴',chr(194)=>'┬',chr(195)=>'├',chr(196)=>'─',chr(197)=>'┼',chr(198)=>'ã',chr(199)=>'Ã',chr(200)=>'╚',chr(201)=>'╔',chr(202)=>'╩',chr(203)=>'╦',chr(204)=>'╠',chr(205)=>'═',chr(206)=>'╬',chr(207)=>'¤',chr(208)=>'ð',chr(209)=>'Ð',chr(210)=>'Ê',chr(211)=>'Ë',chr(212)=>'È',chr(213)=>'ı',chr(214)=>'Í',chr(215)=>'Î',chr(216)=>'Ï',chr(217)=>'┘',chr(218)=>'┌',chr(219)=>'█',chr(220)=>'▄',chr(221)=>'¦',chr(222)=>'Ì',chr(223)=>'▀',chr(224)=>'Ó',chr(225)=>'ß',chr(226)=>'Ô',chr(227)=>'Ò',chr(228)=>'õ',chr(229)=>'Õ',chr(230)=>'µ',chr(231)=>'þ',chr(232)=>'Þ',chr(233)=>'Ú',chr(234)=>'Û',chr(235)=>'Ù',chr(236)=>'ý',chr(237)=>'Ý',chr(238)=>'¯',chr(239)=>'´',chr(240)=>'≡',chr(241)=>'±',chr(242)=>'‗',chr(243)=>'¾',chr(244)=>'¶',chr(245)=>'§',chr(246)=>'÷',chr(247)=>'¸',chr(248)=>'°',chr(249)=>'¨',chr(250)=>'·',chr(251)=>'¹',chr(252)=>'³',chr(253)=>'²',chr(254)=>'■',chr(255)=>'nbsp']);
echo '"chr('.ord($fx[0]).')"=>"'.$fx[0].'"<br><br>'; // result: "chr(195)"=>"�"
How to convert or remove � ?
28.05.2020 Update: Solution found, thanks to Andrea Pollini!
Some notes:
iconv('UTF-8', 'UTF-8//IGNORE', $fx); // IGNORE is broken in PHP since - https://www.php.net/manual/en/function.iconv.php#108643 - use mb_convert_encoding
Here was my real problem (i figured it out later after many tests):
$P["T"] .= $text; // here was the problem, array is converting strings... (don't know why?)
changed to:
ini_set('mbstring.substitute_character', "none"); // mb_convert_encoding set remove unknown
$P["T"] .= mb_convert_encoding($text, 'UTF-8', 'UTF-8');
Now it's working. But if somebody knows why arrays are converting strings and how to disable that, would be great. :)

first configure in order to discard extended characters
<?php
ini_set('mbstring.substitute_character', "none");
?>
next you can use mb_convert_encoding
mb_convert_encoding($fx, "UTF-8", mb_detect_encoding($fx, "UTF-8, ISO-8859-1, ISO-8859-15", true));
you can add the encoding you need in mb_detect_encoding

Converting Window-1252 to UTF-8 Issue

I have created a function to convert the following text to UTF-8, as it appeared to be in Windows-1252 format, due to being copied to a database table from a Word Document.
Testing weird characterâ€™s correction
This seems to fix the dodgy â€™ character. However i'm not getting � in the following:
Devon�s most prominent dealerships
When passing the following through the same function:
Devon's most prominent dealerships
Below is the code which does the converting:
function Windows1252ToUTF8($text) {
return mb_convert_encoding($text, "Windows-1252", "UTF-8");
}
Edit:
The database can't be changed due to holding thousands of custom records. I tried the below but the mb_detect_encoding thinks characterâ€™s correction is UTF-8.
function Windows1252ToUTF8($text) {
if (mb_detect_encoding($text) == "UTF-8") {
return $text;
}
return mb_convert_encoding($text, "Windows-1252", "UTF-8");
}
Edit 2:
Just tried the example from the PHP Documentation:
$str = 'áéóú'; // ISO-8859-1
echo "<pre>";
var_dump(mb_detect_encoding($str, 'UTF-8')); // 'UTF-8'
var_dump(mb_detect_encoding($str, 'UTF-8', true)); // false
echo "</pre>";
die();
but this simply outputs:
string(5) "UTF-8"
string(5) "UTF-8"
So I can't even detect the encoding of the string :S
Edit 3:
This seems to do the trick:
function Windows1252ToUTF8($text) {
$badChars = [ "â", "á", "ú", "é", "ó" ];
$match = preg_match("/[".join("",$badChars)."]/", $text);
if ($match) {
return mb_convert_encoding($text, "Windows-1252", "UTF-8");
}
return $text;
}
Edit 4:
I have matched the hex values to their corresponding values. However when I get to the weird characters they don't appear to match.

Converting Testing weird characterâ€™s correction using bin2hex
gives me
54657374696e6720776569726420636861726163746572c3a2e282ace284a27320636f7272656374696f6e
This means the "â€™" is actually the bytes \xc3\xa2\xe2\x82\xac\xe2\x84\xa2. This is a typical sign of a UTF-8 string having been interpreted as Windows Latin-1/1252, and then re-encoded to UTF-8.
’ (UTF-8 \xe2\x80\x99)
→ bytes interpreted as Latin-1 equal the string â€™
→ characters encoded to UTF-8 result in \xc3\xa2\xe2\x82\xac\xe2\x84\xa2
To restore the original, you need to reverse that chain of mis-encodings:
$s = "\xc3\xa2\xe2\x82\xac\xe2\x84\xa2";
echo mb_convert_encoding($s, 'Windows-1252', 'UTF-8');
This interprets the string as UTF-8, converts it to the Windows-1252 equivalent, which is then the valid UTF-8 representation of ’.
Preferably you figure out at what point the encoding screwed up like this and you stop that from happening in the future. If it happened by "copy and pasting from Word", then basically somebody pasted garbage into your database and you need to fix the workflow with Word somehow. Otherwise there may be an incorrect encoding-conversion step somewhere in your code which you need to fix.

The following seems to do the trick. Not the way I wanted it to work by checking for specific characters, but it does the trick.
function Windows1252ToUTF8($text) {
$badChars = [ "â", "á", "ú", "é", "ó" ];
$match = preg_match("/[".join("",$badChars)."]/", $text);
if ($match) {
return mb_convert_encoding($text, "Windows-1252", "UTF-8");
}
return $text;
}
Edit:
function Windows1252ToUTF8($text) {
// http://www.fileformat.info/info/charset/UTF-8/list.htm
$illegal_hex = [ "c3a2", "c3a1", "c3ba", "c3a9", "c3b3" ];
$match = preg_match("/".join("|",$illegal_hex)."/", bin2hex($text));
if ($match) {
return mb_convert_encoding($text, "Windows-1252", "UTF-8");
}
return $text;
}

PHP Default String Encoding

Could someone explain why the output is ASCII in the last three tests below?
I get the same results on my own system, PHPTester.net, and PhpFiddle.org.
echo mb_internal_encoding(); // UTF-8
$str = 'foobar';
echo mb_check_encoding($str, 'UTF-8'); // true
echo mb_detect_encoding($str); // ASCII
$encoded = utf8_encode($str);
echo mb_detect_encoding($encoded); // ASCII
$converted = mb_convert_encoding($str, 'UTF-8');
echo mb_detect_encoding($converted); // ASCII

That would be because there are no characters in foobar that cannot be represented in ASCII.
mb_check_encoding($str, 'UTF-8') works because ASCII text is innately compatible with UTF-8 (deliberately so)
But in the absence of multi-byte characters, there's no discernible difference between the two. Proof of this: 'foobar' === utf8_encode('foobar') // true

How to convert a Unicode text-block to UTF-8 (HEX) code point?

I have a Unicode text-block, like this:
ụ
ư
ứ
Ỳ
Ỷ
Ỵ
Đ
Now, I want to convert this orginal Unicode text-block into a text-block of UTF-8 (HEX) code point (see the Hexadecimal UTF-8 column, on this page: https://en.wikipedia.org/wiki/UTF-8), by PHP; like this:
\xe1\xbb\xa5
\xc6\xb0
\xe1\xbb\xa9
\xe1\xbb\xb2
\xe1\xbb\xb6
\xe1\xbb\xb4
\xc4\x90
Not like this:
0x1EE5
0x01B0
0x1EE9
0x1EF2
0x1EF6
0x1EF4
0x0110
Is there any way to do it, by PHP?
I have read this topic (PHP: Convert unicode codepoint to UTF-8). But, it is not similar to my question.
I am sorry, I don't know much about Unicode.

I think you're looking for the bin2hex() function:
Convert binary data into hexadecimal representation
And format by prepending \x to each byte (00-FF)
function str_hex_format ($bin) {
return '\x'.implode('\x', str_split(bin2hex($bin), 2));
}
For your sample:
// utf8 encoded input
$arr = ["ụ","ư","ứ","Ỳ","Ỷ","Ỵ","Đ"];
foreach($arr AS $v)
echo $v . " => " . str_hex_format($v) . "\n";
See test at eval.in (link expires)
ụ => \xe1\xbb\xa5
ư => \xc6\xb0
ứ => \xe1\xbb\xa9
Ỳ => \xe1\xbb\xb2
Ỷ => \xe1\xbb\xb6
Ỵ => \xe1\xbb\xb4
Đ => \xc4\x90
Decode example: $str = str_hex_format("ụưứỲỶỴĐ"); echo $str;
\xe1\xbb\xa5\xc6\xb0\xe1\xbb\xa9\xe1\xbb\xb2\xe1\xbb\xb6\xe1\xbb\xb4\xc4\x90
echo hex2bin(str_replace('\x', "", $str));
ụưứỲỶỴĐ
For more info about escape sequence \x in double quoted strings see php manual.

PHP treats strings as arrays of characters, regardless of encoding. If you don't need to delimit the UTF8 characters, then something like this works:
$str='ụưứỲỶỴĐ';
foreach(str_split($str) as $char)
echo '\x'.str_pad(dechex(ord($char)),'0',2,STR_PAD_LEFT);
Output:
\xe1\xbb\xa5\xc6\xb0\xe1\xbb\xa9\xe1\xbb\xb2\xe1\xbb\xb6\xe1\xbb\xb4\xc4\x90
If you need to delimit the UTF8 characters (i.e. with a newline), then you'll need something like this:
$str='ụưứỲỶỴĐ';
foreach(array_slice(preg_split('~~u',$str),1,-1) as $UTF8char){ // split before/after every UTF8 character and remove first/last empty string
foreach(str_split($UTF8char) as $char)
echo '\x'.str_pad(dechex(ord($char)),'0',2,STR_PAD_LEFT);
echo "\n"; // delimiter
}
Output:
\xe1\xbb\xa5
\xc6\xb0
\xe1\xbb\xa9
\xe1\xbb\xb2
\xe1\xbb\xb6
\xe1\xbb\xb4
\xc4\x90
This splits the string into UTF8 characters using preg_split and the u flag. Since preg_split returns the empty string before the first character and the empty string after the last character, we need to array_slice the first and last characters. This can be easily modified to return an array, for example.
Edit:
A more "correct" way to do this is this:
echo trim(json_encode(utf8_encode('ụưứỲỶỴĐ')),'"');

The main thing you need to do is to tell PHP to interpret the incoming Unicode characters correctly. Once you do that, you can then convert them to UTF-8 and then to hex as needed.
This code frag takes your example character in Unicode, converts them to UTF-8, and then dumps the hex representation of those characters.
<?php
// Hex equivalent of "ụưứỲỶỴĐ" in Unicode
$unistr = "\x1E\xE5\x01\xB0\x1E\xE9\x1E\xF2\x1E\xF6\x1E\xF4\x01\x10";
echo " length=" . mb_strlen($unistr, 'UCS-2BE') . "\n";
// Here's the key statement, convert from Unicode 16-bit to UTF-8
$utf8str = mb_convert_encoding($unistr, "UTF-8", 'UCS-2BE');
echo $utf8str . "\n";
for($i=0; $i < mb_strlen($utf8str, 'UTF-8'); $i++) {
$c = mb_substr($utf8str, $i, 1, 'UTF-8');
$hex = bin2hex($c);
echo $c . "\t" . $hex . "\t" . preg_replace("/([0-9a-f]{2})/", '\\\\x\\1', $hex) . "\n";
}
?>
Produces
length=7
ụưứỲỶỴĐ
ụ e1bba5 \xe1\xbb\xa5
ư c6b0 \xc6\xb0
ứ e1bba9 \xe1\xbb\xa9
Ỳ e1bbb2 \xe1\xbb\xb2
Ỷ e1bbb6 \xe1\xbb\xb6
Ỵ e1bbb4 \xe1\xbb\xb4
Đ c490 \xc4\x90

replace multibyte utf8 character in php

I am trying to preg_replace the multibytecharacter for euro in UTF (shown as â¬ in my html) to a "$" and the * for an "#"
$orig = "2 **** reviews â¬ 19,99 price";
$orig = mb_ereg_replace(mb_convert_encoding('€', 'UTF-8', 'HTML-ENTITIES'), "$", $orig);
$orig = preg_replace("/[\$\;\?\!\{\}\(\)\[\]\/\*\>\<]/", "#", $orig);
$a = htmlentities($orig);
$b = html_entity_decode($a);
The "*" are being replaced but not the "â¬" .......
Also tried to replace it with
$orig = preg_replace("/[\xe2\x82\xac]/", "$", $orig);
Doesn't convert either....
Another plan which didnt work:
$orig= mb_ereg_replace(mb_convert_encoding('€', 'UTF-8', 'HTML-ENTITIES'), "$", $orig);
Brrr someone knows how to get rid of this utf8 euro character:
echo html_entity_decode('€');
(driving me nuts)

This could be caused by two reasons:
The actual source text is UTF8 encoded, but your PHP code not.
You can solve this by just using this line and save your file UTF8 encoded (try using notepad++).
str_replace('€', '$', $source);
The source text is corrupted: multibyte characters are converted to latin1 (wrong database charset?). You can try to convert them back to latin1:
str_replace('€', '$', utf8_decode($source))

Pasting my comment here as an answer so you can mark it!
Wouldn't
str_replace(html_entity_decode('€'), '$', $source)
work?

In your $orig string you do not have euro sign.
When I run this php file:
<?php
$orig = "â¬";
for($i=0; $i<strlen($orig); $i++)
echo '0x' . dechex(ord($orig{$i})) . ' ';
?>
If saved as utf-8 I get: 0xc3 0xa2 0xc2 0xac
If saved as latin-1 I get: 0xe2 0xac
In any case it is not € sign which is:0xE2 0x82 0xAC or unicode \u20AC ( http://www.fileformat.info/info/unicode/char/20ac/index.htm ).
0x82 is missing!!!!!
Run this program above, see what do you get and use this hex values to get rid of â¬.
For real € sign this works:
<?php
$orig = html_entity_decode('€', ENT_COMPAT, 'UTF-8');
$dest = preg_replace('~\x{20ac}~u', '$', $orig);
echo "($orig) ($dest)";
?>
BTW if UTF-8 file containing € is displayed as latin-1 you should get:
â‚¬ and not â¬ as in your example.
So in fact, you have problems with encoding and conversion between encodings. If you try to save â‚¬ in latin1 middle character will be lost (for example my Komodo will alert me and then replace ‚ with ?). In other words, you somehow damaged your € sign - and then you tried to replace it as it was complete. :D

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Detecting the right character encoding in PHP? - php

Related

PHP: Convert Extended Ascii file to UTF-8

Converting Window-1252 to UTF-8 Issue

PHP Default String Encoding

How to convert a Unicode text-block to UTF-8 (HEX) code point?

replace multibyte utf8 character in php

Categories

Resources