PHP function iconv character encoding from iso-8859-1 to utf-8 - php

I'm trying to convert a string from iso-8859-1 to utf-8.
But when I find these two charachter € and • the function returns
a charachter that is a square with two number inside.
How can I solve this issue?

I think the encoding you are looking for is Windows code page 1252 (Western European). It is not the same as ISO-8859-1 (or 8859-15 for that matter); the characters in the range 0xA0-0xFF match 8859-1, but cp1252 adds an assortment of extra characters in the range 0x80-0x9F where ISO-8859-1 assigns little-used control codes.
The confusion comes about because when you serve a page as text/html;charset=iso-8859-1, for historical reasons, browsers actually use cp1252 (and will hence submit forms in cp1252 too).
iconv('cp1252', 'utf-8', "\x80 and \x95")
-> "\xe2\x82\xac and \xe2\x80\xa2"

Always check your encoding first! You should never blindly trust your encoding (even if it is from your own website!):
function convert_cp1252_to_utf8($input, $default = '') {
if ($input === null || $input == '') {
return $default;
}
// https://en.wikipedia.org/wiki/UTF-8
// https://en.wikipedia.org/wiki/ISO/IEC_8859-1
// https://en.wikipedia.org/wiki/Windows-1252
// http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
$encoding = mb_detect_encoding($input, array('Windows-1252', 'ISO-8859-1'), true);
if ($encoding == 'ISO-8859-1' || $encoding == 'Windows-1252') {
/*
* Because ISO-8859-1 and CP1252 are identical except for 0x80 through 0x9F
* and control characters, always convert from Windows-1252 to UTF-8.
*/
$input = iconv('Windows-1252', 'UTF-8//IGNORE', $input);
}
return $input;
}

iso-8859-1 doesn't contain the € sign so your string cannot be interpreted with iso-8859-1 if it contains it. Use iso-8859-15 instead.

Those 2 characters are illegal in iso-8859-1 (did you mean iso-8859-15?)
$ php -r 'echo iconv("utf-8","iso-8859-1//TRANSLIT","ter € and • the");'
ter EUR and o the

Related

Encoding conversions from cp1255 to UTF-8

I have the following encoded Hebrew strings in an old DB:
éçìéó àú ùîåàì æåñîï äòåáã á÷áåöä îòì 50 ùðä
The ASP code that is being used to decode this string is the following:
function Get_RightHebrew(ByVal sText)
Dim i
Dim sRightText
if isNull(sText) then
sRightText = ""
else
For i = 1 To Len(sText)
If (AscW(Mid(sText, i, 1)) >= 1488 And AscW(Mid(sText, i, 1)) <= 1514) Then
sRightText = sRightText & Chr(AscW(Mid(sText, i, 1)) - 1264)
else
sRightText = sRightText & Mid(sText, i, 1)
End If
Next
end if
Get_RightHebrew = sRightText
End Function
I'm looking for an equivalent PHP function to convert the string to correct UTF-8
You've got a CP1255 encoded string but decoded with CP1252 (Latin1), so you can get your Hebrew text back by cheating.
# mis-decoded string
$str = "éçìéó àú ùîåàì æåñîï äòåáã á÷áåöä îòì 50 ùðä";
# convert to CP1252 from UTF-8
$str = iconv("UTF-8", "CP1252", $str);
# convert to UTF-8 by claiming $str is encoded with CP1255
$str = iconv("CP1255", "UTF-8", $str);
echo $str;
Here's the test I made online: https://3v4l.org/7taaN
I'd like to share an example code that uses mb_* functions instead of iconv but CP1255 is not supported. Using the charset ISO-8859-8 with mb_* instead is an option but since it's a subset of CP1255 it's likely to experience data loss.

PHP Default String Encoding

Could someone explain why the output is ASCII in the last three tests below?
I get the same results on my own system, PHPTester.net, and PhpFiddle.org.
echo mb_internal_encoding(); // UTF-8
$str = 'foobar';
echo mb_check_encoding($str, 'UTF-8'); // true
echo mb_detect_encoding($str); // ASCII
$encoded = utf8_encode($str);
echo mb_detect_encoding($encoded); // ASCII
$converted = mb_convert_encoding($str, 'UTF-8');
echo mb_detect_encoding($converted); // ASCII
That would be because there are no characters in foobar that cannot be represented in ASCII.
mb_check_encoding($str, 'UTF-8') works because ASCII text is innately compatible with UTF-8 (deliberately so)
But in the absence of multi-byte characters, there's no discernible difference between the two. Proof of this: 'foobar' === utf8_encode('foobar') // true

Ÿ Œ charcters in csv don't get displayed php

I am new to encoding so please be patient.
I am working on a system where a user upload a csv, what i need to do is to display the content and then save it in the database. (utf-8 encoding)
I have been asked to fix a issue with some french alphabet characters that weren't displayed correctly. I have almost solved the problem, I am displaying characters such as
ÀàÂâÆÄäÇçÉéÈèÊêËëÎîÏïÔôœÖöÙùÛûÜüÿ
However the two mentioned in the title Ÿ Œ are not displayed correctly yet on the webpage.
Here is my php code so far:
// say in the csv we have "ÖüÜߟÀàÂ"
$content = file_get_contents(addslashes($file_name));
var_dump($content) // output: string(54) "���ߟ��� "
if(!mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true)){
$data = iconv('macintosh', 'UTF-8', $content);
}
// deal with known encoding types
else if(mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true) == 'ISO-8859-1'){
//$data = mb_convert_encoding($content, 'UTF-8', mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true)); // does not work
$data = iconv('ISO-8859-1', 'UTF-8', $content); //does not work
}else if(mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true) == 'UTF-8'){
$data = $content
}
//if i print $data "Ÿ Œ " are not printed out... they got lost somewhere
//do more stuff here
the file I am dealing with has an encoding type of ISO-8859-1(when i print out mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true) it displays ISO-8859-1).
Is there anyone that have an idea on how to deal with this special cases?
The characters Ÿ and Œ are not representable in ISO-8859-1. It seems that the incoming data is actually windows-1252 (Windows Latin 1) encoded, since windows-1252 has graphic characters, including Ÿ and Œ, in some code positions that are reserved for control characters in ISO-8859-1.
So you should probably add windows-1252 to the list of recognized encodings and treat recognized ISO-8859-1 as windows-1252, i.e use iconv('windows-1252', 'UTF-8', $content) even when ISO-8859-1 has bee recognized. Windows-1252 data mislabeled as ISO-8859-1 is very common.

Detecting the right character encoding in PHP?

I'm trying to detect the character encoding of a string but I can't get the right result.
For example:
$str = "€ ‚ ƒ „ …" ;
$str = mb_convert_encoding($str, 'Windows-1252' ,'HTML-ENTITIES') ;
// Now $str should be a Windows-1252-encoded string.
// Let's detect its encoding:
echo mb_detect_encoding($str,'Windows-1252, ISO-8859-1, UTF-8') ;
That code outputs ISO-8859-1 but it should be Windows-1252.
What's wrong with this?
EDIT:
Updated example, in response to #raina77ow.
$str = "€‚ƒ„…" ; // no white-spaces
$str = mb_convert_encoding($str, 'Windows-1252' ,'HTML-ENTITIES') ;
$str = "Hello $str" ; // let's add some ascii characters
echo mb_detect_encoding($str,'Windows-1252, ISO-8859-1, UTF-8') ;
I get the wrong result again.
The problem with Windows-1252 in PHP is that it will almost never be detected, because as soon as your text contains any characters outside of 0x80 to 0x9f, it will not be detected as Windows-1252.
This means that if your string contains a normal ASCII letter like "A", or even a space character, PHP will say that this is not valid Windows-1252 and, in your case, fall back to the next possible encoding, which is ISO 8859-1. This is a PHP bug, see https://bugs.php.net/bug.php?id=64667.
Although strings encoded with ISO-8859-1 and CP-1252 have different byte code representation:
<?php
$str = "€ ‚ ƒ „ …" ;
foreach (array('Windows-1252', 'ISO-8859-1') as $encoding)
{
$new = mb_convert_encoding($str, $encoding, 'HTML-ENTITIES');
printf('%15s: %s detected: %10s explicitly: %10s',
$encoding,
implode('', array_map(function($x) { return dechex(ord($x)); }, str_split($new))),
mb_detect_encoding($new),
mb_detect_encoding($new, array('ISO-8859-1', 'Windows-1252'))
);
echo PHP_EOL;
}
Results:
Windows-1252: 802082208320842085 detected: explicitly: ISO-8859-1
ISO-8859-1: 3f203f203f203f203f detected: ASCII explicitly: ISO-8859-1
...from what we can see here it looks like there is problem with second paramater of mb_detect_encoding. Using mb_detect_order instead of parameter yields very similar results.

Char Encoding: Changing file from MacRoman to UTF-8 breaks string

I am working on a CakePHP site saved in MacRoman char encoding. I want to change all the files to UTF-8 for internationalisation. For all the other files in the site this works fine. However, in the core.php file there is a security salt, which is a string with special characters ("!:* etc.). When I save this file as UTF-8 the salt gets corrupted. I can roll this back with git, but it's an annoyance.
Does anyone know how I can convert the string from MacRoman to UTF-8?
You don't give enough information to confirm this, but I guess the salt is used in its binary form. In that case, changing the encoding of the file will corrupt the salt if this binary stream is changed, even if the characters are correctly converted.
Since the first 128 characters are similar in UTF-8 and Mac OS Roman, you don't have to worry if the salt is written using only these characters.
Let's say the salt is somewhere:
$salt = "a!c‡Œ";
You could write instead:
$salt = "a!c\xE0\xCE";
You could map all to their hexadecimal representation, as it might be easier to automate:
$salt = "\x61\x21\x63\xE0\xCE";
See the table here.
The following snippet can automate this conversion:
$res = "";
foreach (str_split($salt) as $c) {
$res .= "\\x".dechex(ord($c));
}
echo $res;
Thanks for the input, pointed me in the right direction. The solution is:
$salt = iconv('UTF-8', 'macintosh', $string);
For those who do not have access to iconv here is a function in PHP:
http://sebastienguillon.com/test/jeux-de-caracteres/MacRoman_to_utf8.txt.php
It will properly convert MacRoman text to UTF-8 and you can even decide how you want to break ligatures.
<?php
function MacRoman_to_utf8($str, $break_ligatures='none')
{
// $break_ligatures : 'none' | 'fifl' | 'all'
// 'none' : don't break any MacRoman ligatures, transform them into their utf-8 counterparts
// 'fifl' : break only fi ("\xDE" => "fi") and fl ("\xDF"=>"fl")
// 'all' : break fi, fl and also AE ("\xAE"=>"AE"), ae ("\xBE"=>"ae"), OE ("\xCE"=>"OE") and oe ("\xCF"=>"oe")
if($break_ligatures == 'fifl')
{
$str = strtr($str, array("\xDE"=>"fi", "\xDF"=>"fl"));
}
if($break_ligatures == 'all')
{
$str = strtr($str, array("\xDE"=>"fi", "\xDF"=>"fl", "\xAE"=>"AE", "\xBE"=>"ae", "\xCE"=>"OE", "\xCF"=>"oe"));
}
$str = strtr($str, array("\x7F"=>"\x20", "\x80"=>"\xC3\x84", "\x81"=>"\xC3\x85",
"\x82"=>"\xC3\x87", "\x83"=>"\xC3\x89", "\x84"=>"\xC3\x91", "\x85"=>"\xC3\x96",
"\x86"=>"\xC3\x9C", "\x87"=>"\xC3\xA1", "\x88"=>"\xC3\xA0", "\x89"=>"\xC3\xA2",
"\x8A"=>"\xC3\xA4", "\x8B"=>"\xC3\xA3", "\x8C"=>"\xC3\xA5", "\x8D"=>"\xC3\xA7",
"\x8E"=>"\xC3\xA9", "\x8F"=>"\xC3\xA8", "\x90"=>"\xC3\xAA", "\x91"=>"\xC3\xAB",
"\x92"=>"\xC3\xAD", "\x93"=>"\xC3\xAC", "\x94"=>"\xC3\xAE", "\x95"=>"\xC3\xAF",
"\x96"=>"\xC3\xB1", "\x97"=>"\xC3\xB3", "\x98"=>"\xC3\xB2", "\x99"=>"\xC3\xB4",
"\x9A"=>"\xC3\xB6", "\x9B"=>"\xC3\xB5", "\x9C"=>"\xC3\xBA", "\x9D"=>"\xC3\xB9",
"\x9E"=>"\xC3\xBB", "\x9F"=>"\xC3\xBC", "\xA0"=>"\xE2\x80\xA0", "\xA1"=>"\xC2\xB0",
"\xA2"=>"\xC2\xA2", "\xA3"=>"\xC2\xA3", "\xA4"=>"\xC2\xA7", "\xA5"=>"\xE2\x80\xA2",
"\xA6"=>"\xC2\xB6", "\xA7"=>"\xC3\x9F", "\xA8"=>"\xC2\xAE", "\xA9"=>"\xC2\xA9",
"\xAA"=>"\xE2\x84\xA2", "\xAB"=>"\xC2\xB4", "\xAC"=>"\xC2\xA8", "\xAD"=>"\xE2\x89\xA0",
"\xAE"=>"\xC3\x86", "\xAF"=>"\xC3\x98", "\xB0"=>"\xE2\x88\x9E", "\xB1"=>"\xC2\xB1",
"\xB2"=>"\xE2\x89\xA4", "\xB3"=>"\xE2\x89\xA5", "\xB4"=>"\xC2\xA5", "\xB5"=>"\xC2\xB5",
"\xB6"=>"\xE2\x88\x82", "\xB7"=>"\xE2\x88\x91", "\xB8"=>"\xE2\x88\x8F", "\xB9"=>"\xCF\x80",
"\xBA"=>"\xE2\x88\xAB", "\xBB"=>"\xC2\xAA", "\xBC"=>"\xC2\xBA", "\xBD"=>"\xCE\xA9",
"\xBE"=>"\xC3\xA6", "\xBF"=>"\xC3\xB8", "\xC0"=>"\xC2\xBF", "\xC1"=>"\xC2\xA1",
"\xC2"=>"\xC2\xAC", "\xC3"=>"\xE2\x88\x9A", "\xC4"=>"\xC6\x92", "\xC5"=>"\xE2\x89\x88",
"\xC6"=>"\xE2\x88\x86", "\xC7"=>"\xC2\xAB", "\xC8"=>"\xC2\xBB", "\xC9"=>"\xE2\x80\xA6",
"\xCA"=>"\xC2\xA0", "\xCB"=>"\xC3\x80", "\xCC"=>"\xC3\x83", "\xCD"=>"\xC3\x95",
"\xCE"=>"\xC5\x92", "\xCF"=>"\xC5\x93", "\xD0"=>"\xE2\x80\x93", "\xD1"=>"\xE2\x80\x94",
"\xD2"=>"\xE2\x80\x9C", "\xD3"=>"\xE2\x80\x9D", "\xD4"=>"\xE2\x80\x98", "\xD5"=>"\xE2\x80\x99",
"\xD6"=>"\xC3\xB7", "\xD7"=>"\xE2\x97\x8A", "\xD8"=>"\xC3\xBF", "\xD9"=>"\xC5\xB8",
"\xDA"=>"\xE2\x81\x84", "\xDB"=>"\xE2\x82\xAC", "\xDC"=>"\xE2\x80\xB9", "\xDD"=>"\xE2\x80\xBA",
"\xDE"=>"\xEF\xAC\x81", "\xDF"=>"\xEF\xAC\x82", "\xE0"=>"\xE2\x80\xA1", "\xE1"=>"\xC2\xB7",
"\xE2"=>"\xE2\x80\x9A", "\xE3"=>"\xE2\x80\x9E", "\xE4"=>"\xE2\x80\xB0", "\xE5"=>"\xC3\x82",
"\xE6"=>"\xC3\x8A", "\xE7"=>"\xC3\x81", "\xE8"=>"\xC3\x8B", "\xE9"=>"\xC3\x88",
"\xEA"=>"\xC3\x8D", "\xEB"=>"\xC3\x8E", "\xEC"=>"\xC3\x8F", "\xED"=>"\xC3\x8C",
"\xEE"=>"\xC3\x93", "\xEF"=>"\xC3\x94", "\xF0"=>"\xEF\xA3\xBF", "\xF1"=>"\xC3\x92",
"\xF2"=>"\xC3\x9A", "\xF3"=>"\xC3\x9B", "\xF4"=>"\xC3\x99", "\xF5"=>"\xC4\xB1",
"\xF6"=>"\xCB\x86", "\xF7"=>"\xCB\x9C", "\xF8"=>"\xC2\xAF", "\xF9"=>"\xCB\x98",
"\xFA"=>"\xCB\x99", "\xFB"=>"\xCB\x9A", "\xFC"=>"\xC2\xB8", "\xFD"=>"\xCB\x9D",
"\xFE"=>"\xCB\x9B", "\xFF"=>"\xCB\x87", "\x00"=>"\x20", "\x01"=>"\x20",
"\x02"=>"\x20", "\x03"=>"\x20", "\x04"=>"\x20", "\x05"=>"\x20",
"\x06"=>"\x20", "\x07"=>"\x20", "\x08"=>"\x20", "\x0B"=>"\x20",
"\x0C"=>"\x20", "\x0E"=>"\x20", "\x0F"=>"\x20", "\x10"=>"\x20",
"\x11"=>"\x20", "\x12"=>"\x20", "\x13"=>"\x20", "\x14"=>"\x20",
"\x15"=>"\x20", "\x16"=>"\x20", "\x17"=>"\x20", "\x18"=>"\x20",
"\x19"=>"\x20", "\x1A"=>"\x20", "\x1B"=>"\x20", "\x1C"=>"\x20",
"\1D"=>"\x20", "\x1E"=>"\x20", "\x1F"=>"\x20", "\xF0"=>""));
return $str;
}
?>
Have you tried mb-convert-encoding ?
Think it would be:
$str = mb_convert_encoding($str, "macintosh", "UTF-8");
Just curious, have you tried copying the salt, saving as UTF-8 and then pasting the salt back in place and saving again?

Categories