I am new to encoding so please be patient.
I am working on a system where a user upload a csv, what i need to do is to display the content and then save it in the database. (utf-8 encoding)
I have been asked to fix a issue with some french alphabet characters that weren't displayed correctly. I have almost solved the problem, I am displaying characters such as
ÀàÂâÆÄäÇçÉéÈèÊêËëÎîÏïÔôœÖöÙùÛûÜüÿ
However the two mentioned in the title Ÿ Œ are not displayed correctly yet on the webpage.
Here is my php code so far:
// say in the csv we have "ÖüÜߟÀàÂ"
$content = file_get_contents(addslashes($file_name));
var_dump($content) // output: string(54) "���ߟ��� "
if(!mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true)){
$data = iconv('macintosh', 'UTF-8', $content);
}
// deal with known encoding types
else if(mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true) == 'ISO-8859-1'){
//$data = mb_convert_encoding($content, 'UTF-8', mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true)); // does not work
$data = iconv('ISO-8859-1', 'UTF-8', $content); //does not work
}else if(mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true) == 'UTF-8'){
$data = $content
}
//if i print $data "Ÿ Œ " are not printed out... they got lost somewhere
//do more stuff here
the file I am dealing with has an encoding type of ISO-8859-1(when i print out mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true) it displays ISO-8859-1).
Is there anyone that have an idea on how to deal with this special cases?
The characters Ÿ and Œ are not representable in ISO-8859-1. It seems that the incoming data is actually windows-1252 (Windows Latin 1) encoded, since windows-1252 has graphic characters, including Ÿ and Œ, in some code positions that are reserved for control characters in ISO-8859-1.
So you should probably add windows-1252 to the list of recognized encodings and treat recognized ISO-8859-1 as windows-1252, i.e use iconv('windows-1252', 'UTF-8', $content) even when ISO-8859-1 has bee recognized. Windows-1252 data mislabeled as ISO-8859-1 is very common.
Related
I'm trying to convert a cyrillic 1251 to utf-8.
Given String: Íó è ÿ ñäåëàëà âûâîäû...
Expected String: Ну и я сделала выводы...
What I've tried so far:
echo iconv('CP1251', 'UTF-8', 'Íó è ÿ ñäåëàëà âûâîäû...');
or
echo mb_convert_encoding('Íó è ÿ ñäåëàëà âûâîäû...', 'UTF-8', 'CP1251');
The result I got:
Íó è ÿ ñäåëà ëà âûâîäû...
Any ideas how I could make it work?
What you have is a UTF8 string made up of cp1252 characters which are a misrepresentation of cp1251.
The true answer is to fix what produced this mistake so that your data doesn't get corrupted like this.
The worse answer is to repeat the mis-translation in reverse to recover the original string, and then convert it properly.
$input = 'Íó è ÿ ñäåëàëà âûâîäû...';
// convert back to source string via CP1252 single-byte encoding
$out = mb_convert_encoding($input, 'CP1252', 'UTF-8');
// correctly convert source string to UTF8 using CP1251
$out = mb_convert_encoding($out, 'UTF-8', 'CP1251');
var_dump($st2);
I'm currently trying to figure out how to convert an ASCII encoded string to ISO-8859-1 encoding to be used for utf8_encode() to display special characters like "ñ" but I can't seem to make it work. In need of help.
I've already tried this iconv(mb_detect_encoding($text, mb_detect_order(), true), "ISO-8859-1", $text); and this mb_convert_encoding($text, "ISO-8859-1"); and also this mb_convert_encoding($text, "ASCII", "ISO-8859-1"); but it doesn't work, the string is still ASCII encoded.
I've created a temporary solution for this by creating a lookup table using the string provided by reading each character of the string. But I want to use the php built-in functions, is this possible?
Here is my code:
<?php
function convertString($text) {
$text = iconv(mb_detect_encoding($text, mb_detect_order(), true), "ISO-8859-1", $text);
echo mb_detect_encoding($text) .'<br/>'; // to check what encoding the string is in, displays ASCII
return utf8_encode($text);
}
echo convertString('\xc3\xb1');
?>
I have created a function to convert the following text to UTF-8, as it appeared to be in Windows-1252 format, due to being copied to a database table from a Word Document.
Testing weird character’s correction
This seems to fix the dodgy ’ character. However i'm not getting � in the following:
Devon�s most prominent dealerships
When passing the following through the same function:
Devon's most prominent dealerships
Below is the code which does the converting:
function Windows1252ToUTF8($text) {
return mb_convert_encoding($text, "Windows-1252", "UTF-8");
}
Edit:
The database can't be changed due to holding thousands of custom records. I tried the below but the mb_detect_encoding thinks character’s correction is UTF-8.
function Windows1252ToUTF8($text) {
if (mb_detect_encoding($text) == "UTF-8") {
return $text;
}
return mb_convert_encoding($text, "Windows-1252", "UTF-8");
}
Edit 2:
Just tried the example from the PHP Documentation:
$str = 'áéóú'; // ISO-8859-1
echo "<pre>";
var_dump(mb_detect_encoding($str, 'UTF-8')); // 'UTF-8'
var_dump(mb_detect_encoding($str, 'UTF-8', true)); // false
echo "</pre>";
die();
but this simply outputs:
string(5) "UTF-8"
string(5) "UTF-8"
So I can't even detect the encoding of the string :S
Edit 3:
This seems to do the trick:
function Windows1252ToUTF8($text) {
$badChars = [ "â", "á", "ú", "é", "ó" ];
$match = preg_match("/[".join("",$badChars)."]/", $text);
if ($match) {
return mb_convert_encoding($text, "Windows-1252", "UTF-8");
}
return $text;
}
Edit 4:
I have matched the hex values to their corresponding values. However when I get to the weird characters they don't appear to match.
Converting Testing weird character’s correction using bin2hex
gives me
54657374696e6720776569726420636861726163746572c3a2e282ace284a27320636f7272656374696f6e
This means the "’" is actually the bytes \xc3\xa2\xe2\x82\xac\xe2\x84\xa2. This is a typical sign of a UTF-8 string having been interpreted as Windows Latin-1/1252, and then re-encoded to UTF-8.
’ (UTF-8 \xe2\x80\x99)
→ bytes interpreted as Latin-1 equal the string ’
→ characters encoded to UTF-8 result in \xc3\xa2\xe2\x82\xac\xe2\x84\xa2
To restore the original, you need to reverse that chain of mis-encodings:
$s = "\xc3\xa2\xe2\x82\xac\xe2\x84\xa2";
echo mb_convert_encoding($s, 'Windows-1252', 'UTF-8');
This interprets the string as UTF-8, converts it to the Windows-1252 equivalent, which is then the valid UTF-8 representation of ’.
Preferably you figure out at what point the encoding screwed up like this and you stop that from happening in the future. If it happened by "copy and pasting from Word", then basically somebody pasted garbage into your database and you need to fix the workflow with Word somehow. Otherwise there may be an incorrect encoding-conversion step somewhere in your code which you need to fix.
The following seems to do the trick. Not the way I wanted it to work by checking for specific characters, but it does the trick.
function Windows1252ToUTF8($text) {
$badChars = [ "â", "á", "ú", "é", "ó" ];
$match = preg_match("/[".join("",$badChars)."]/", $text);
if ($match) {
return mb_convert_encoding($text, "Windows-1252", "UTF-8");
}
return $text;
}
Edit:
function Windows1252ToUTF8($text) {
// http://www.fileformat.info/info/charset/UTF-8/list.htm
$illegal_hex = [ "c3a2", "c3a1", "c3ba", "c3a9", "c3b3" ];
$match = preg_match("/".join("|",$illegal_hex)."/", bin2hex($text));
if ($match) {
return mb_convert_encoding($text, "Windows-1252", "UTF-8");
}
return $text;
}
I'm trying to detect the character encoding of a string but I can't get the right result.
For example:
$str = "€ ‚ ƒ „ …" ;
$str = mb_convert_encoding($str, 'Windows-1252' ,'HTML-ENTITIES') ;
// Now $str should be a Windows-1252-encoded string.
// Let's detect its encoding:
echo mb_detect_encoding($str,'Windows-1252, ISO-8859-1, UTF-8') ;
That code outputs ISO-8859-1 but it should be Windows-1252.
What's wrong with this?
EDIT:
Updated example, in response to #raina77ow.
$str = "€‚ƒ„…" ; // no white-spaces
$str = mb_convert_encoding($str, 'Windows-1252' ,'HTML-ENTITIES') ;
$str = "Hello $str" ; // let's add some ascii characters
echo mb_detect_encoding($str,'Windows-1252, ISO-8859-1, UTF-8') ;
I get the wrong result again.
The problem with Windows-1252 in PHP is that it will almost never be detected, because as soon as your text contains any characters outside of 0x80 to 0x9f, it will not be detected as Windows-1252.
This means that if your string contains a normal ASCII letter like "A", or even a space character, PHP will say that this is not valid Windows-1252 and, in your case, fall back to the next possible encoding, which is ISO 8859-1. This is a PHP bug, see https://bugs.php.net/bug.php?id=64667.
Although strings encoded with ISO-8859-1 and CP-1252 have different byte code representation:
<?php
$str = "€ ‚ ƒ „ …" ;
foreach (array('Windows-1252', 'ISO-8859-1') as $encoding)
{
$new = mb_convert_encoding($str, $encoding, 'HTML-ENTITIES');
printf('%15s: %s detected: %10s explicitly: %10s',
$encoding,
implode('', array_map(function($x) { return dechex(ord($x)); }, str_split($new))),
mb_detect_encoding($new),
mb_detect_encoding($new, array('ISO-8859-1', 'Windows-1252'))
);
echo PHP_EOL;
}
Results:
Windows-1252: 802082208320842085 detected: explicitly: ISO-8859-1
ISO-8859-1: 3f203f203f203f203f detected: ASCII explicitly: ISO-8859-1
...from what we can see here it looks like there is problem with second paramater of mb_detect_encoding. Using mb_detect_order instead of parameter yields very similar results.
I'm trying to convert a string from iso-8859-1 to utf-8.
But when I find these two charachter € and • the function returns
a charachter that is a square with two number inside.
How can I solve this issue?
I think the encoding you are looking for is Windows code page 1252 (Western European). It is not the same as ISO-8859-1 (or 8859-15 for that matter); the characters in the range 0xA0-0xFF match 8859-1, but cp1252 adds an assortment of extra characters in the range 0x80-0x9F where ISO-8859-1 assigns little-used control codes.
The confusion comes about because when you serve a page as text/html;charset=iso-8859-1, for historical reasons, browsers actually use cp1252 (and will hence submit forms in cp1252 too).
iconv('cp1252', 'utf-8', "\x80 and \x95")
-> "\xe2\x82\xac and \xe2\x80\xa2"
Always check your encoding first! You should never blindly trust your encoding (even if it is from your own website!):
function convert_cp1252_to_utf8($input, $default = '') {
if ($input === null || $input == '') {
return $default;
}
// https://en.wikipedia.org/wiki/UTF-8
// https://en.wikipedia.org/wiki/ISO/IEC_8859-1
// https://en.wikipedia.org/wiki/Windows-1252
// http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
$encoding = mb_detect_encoding($input, array('Windows-1252', 'ISO-8859-1'), true);
if ($encoding == 'ISO-8859-1' || $encoding == 'Windows-1252') {
/*
* Because ISO-8859-1 and CP1252 are identical except for 0x80 through 0x9F
* and control characters, always convert from Windows-1252 to UTF-8.
*/
$input = iconv('Windows-1252', 'UTF-8//IGNORE', $input);
}
return $input;
}
iso-8859-1 doesn't contain the € sign so your string cannot be interpreted with iso-8859-1 if it contains it. Use iso-8859-15 instead.
Those 2 characters are illegal in iso-8859-1 (did you mean iso-8859-15?)
$ php -r 'echo iconv("utf-8","iso-8859-1//TRANSLIT","ter € and • the");'
ter EUR and o the