I have the following encoded Hebrew strings in an old DB:
éçìéó àú ùîåàì æåñîï äòåáã á÷áåöä îòì 50 ùðä
The ASP code that is being used to decode this string is the following:
function Get_RightHebrew(ByVal sText)
Dim i
Dim sRightText
if isNull(sText) then
sRightText = ""
else
For i = 1 To Len(sText)
If (AscW(Mid(sText, i, 1)) >= 1488 And AscW(Mid(sText, i, 1)) <= 1514) Then
sRightText = sRightText & Chr(AscW(Mid(sText, i, 1)) - 1264)
else
sRightText = sRightText & Mid(sText, i, 1)
End If
Next
end if
Get_RightHebrew = sRightText
End Function
I'm looking for an equivalent PHP function to convert the string to correct UTF-8
You've got a CP1255 encoded string but decoded with CP1252 (Latin1), so you can get your Hebrew text back by cheating.
# mis-decoded string
$str = "éçìéó àú ùîåàì æåñîï äòåáã á÷áåöä îòì 50 ùðä";
# convert to CP1252 from UTF-8
$str = iconv("UTF-8", "CP1252", $str);
# convert to UTF-8 by claiming $str is encoded with CP1255
$str = iconv("CP1255", "UTF-8", $str);
echo $str;
Here's the test I made online: https://3v4l.org/7taaN
I'd like to share an example code that uses mb_* functions instead of iconv but CP1255 is not supported. Using the charset ISO-8859-8 with mb_* instead is an option but since it's a subset of CP1255 it's likely to experience data loss.
I am new to encoding so please be patient.
I am working on a system where a user upload a csv, what i need to do is to display the content and then save it in the database. (utf-8 encoding)
I have been asked to fix a issue with some french alphabet characters that weren't displayed correctly. I have almost solved the problem, I am displaying characters such as
ÀàÂâÆÄäÇçÉéÈèÊêËëÎîÏïÔôœÖöÙùÛûÜüÿ
However the two mentioned in the title Ÿ Œ are not displayed correctly yet on the webpage.
Here is my php code so far:
// say in the csv we have "ÖüÜߟÀàÂ"
$content = file_get_contents(addslashes($file_name));
var_dump($content) // output: string(54) "���ߟ��� "
if(!mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true)){
$data = iconv('macintosh', 'UTF-8', $content);
}
// deal with known encoding types
else if(mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true) == 'ISO-8859-1'){
//$data = mb_convert_encoding($content, 'UTF-8', mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true)); // does not work
$data = iconv('ISO-8859-1', 'UTF-8', $content); //does not work
}else if(mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true) == 'UTF-8'){
$data = $content
}
//if i print $data "Ÿ Œ " are not printed out... they got lost somewhere
//do more stuff here
the file I am dealing with has an encoding type of ISO-8859-1(when i print out mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true) it displays ISO-8859-1).
Is there anyone that have an idea on how to deal with this special cases?
The characters Ÿ and Œ are not representable in ISO-8859-1. It seems that the incoming data is actually windows-1252 (Windows Latin 1) encoded, since windows-1252 has graphic characters, including Ÿ and Œ, in some code positions that are reserved for control characters in ISO-8859-1.
So you should probably add windows-1252 to the list of recognized encodings and treat recognized ISO-8859-1 as windows-1252, i.e use iconv('windows-1252', 'UTF-8', $content) even when ISO-8859-1 has bee recognized. Windows-1252 data mislabeled as ISO-8859-1 is very common.
I'm trying to detect the character encoding of a string but I can't get the right result.
For example:
$str = "€ ‚ ƒ „ …" ;
$str = mb_convert_encoding($str, 'Windows-1252' ,'HTML-ENTITIES') ;
// Now $str should be a Windows-1252-encoded string.
// Let's detect its encoding:
echo mb_detect_encoding($str,'Windows-1252, ISO-8859-1, UTF-8') ;
That code outputs ISO-8859-1 but it should be Windows-1252.
What's wrong with this?
EDIT:
Updated example, in response to #raina77ow.
$str = "€‚ƒ„…" ; // no white-spaces
$str = mb_convert_encoding($str, 'Windows-1252' ,'HTML-ENTITIES') ;
$str = "Hello $str" ; // let's add some ascii characters
echo mb_detect_encoding($str,'Windows-1252, ISO-8859-1, UTF-8') ;
I get the wrong result again.
The problem with Windows-1252 in PHP is that it will almost never be detected, because as soon as your text contains any characters outside of 0x80 to 0x9f, it will not be detected as Windows-1252.
This means that if your string contains a normal ASCII letter like "A", or even a space character, PHP will say that this is not valid Windows-1252 and, in your case, fall back to the next possible encoding, which is ISO 8859-1. This is a PHP bug, see https://bugs.php.net/bug.php?id=64667.
Although strings encoded with ISO-8859-1 and CP-1252 have different byte code representation:
<?php
$str = "€ ‚ ƒ „ …" ;
foreach (array('Windows-1252', 'ISO-8859-1') as $encoding)
{
$new = mb_convert_encoding($str, $encoding, 'HTML-ENTITIES');
printf('%15s: %s detected: %10s explicitly: %10s',
$encoding,
implode('', array_map(function($x) { return dechex(ord($x)); }, str_split($new))),
mb_detect_encoding($new),
mb_detect_encoding($new, array('ISO-8859-1', 'Windows-1252'))
);
echo PHP_EOL;
}
Results:
Windows-1252: 802082208320842085 detected: explicitly: ISO-8859-1
ISO-8859-1: 3f203f203f203f203f detected: ASCII explicitly: ISO-8859-1
...from what we can see here it looks like there is problem with second paramater of mb_detect_encoding. Using mb_detect_order instead of parameter yields very similar results.
I am working on a CakePHP site saved in MacRoman char encoding. I want to change all the files to UTF-8 for internationalisation. For all the other files in the site this works fine. However, in the core.php file there is a security salt, which is a string with special characters ("!:* etc.). When I save this file as UTF-8 the salt gets corrupted. I can roll this back with git, but it's an annoyance.
Does anyone know how I can convert the string from MacRoman to UTF-8?
You don't give enough information to confirm this, but I guess the salt is used in its binary form. In that case, changing the encoding of the file will corrupt the salt if this binary stream is changed, even if the characters are correctly converted.
Since the first 128 characters are similar in UTF-8 and Mac OS Roman, you don't have to worry if the salt is written using only these characters.
Let's say the salt is somewhere:
$salt = "a!c‡Œ";
You could write instead:
$salt = "a!c\xE0\xCE";
You could map all to their hexadecimal representation, as it might be easier to automate:
$salt = "\x61\x21\x63\xE0\xCE";
See the table here.
The following snippet can automate this conversion:
$res = "";
foreach (str_split($salt) as $c) {
$res .= "\\x".dechex(ord($c));
}
echo $res;
Thanks for the input, pointed me in the right direction. The solution is:
$salt = iconv('UTF-8', 'macintosh', $string);
For those who do not have access to iconv here is a function in PHP:
http://sebastienguillon.com/test/jeux-de-caracteres/MacRoman_to_utf8.txt.php
It will properly convert MacRoman text to UTF-8 and you can even decide how you want to break ligatures.
<?php
function MacRoman_to_utf8($str, $break_ligatures='none')
{
// $break_ligatures : 'none' | 'fifl' | 'all'
// 'none' : don't break any MacRoman ligatures, transform them into their utf-8 counterparts
// 'fifl' : break only fi ("\xDE" => "fi") and fl ("\xDF"=>"fl")
// 'all' : break fi, fl and also AE ("\xAE"=>"AE"), ae ("\xBE"=>"ae"), OE ("\xCE"=>"OE") and oe ("\xCF"=>"oe")
if($break_ligatures == 'fifl')
{
$str = strtr($str, array("\xDE"=>"fi", "\xDF"=>"fl"));
}
if($break_ligatures == 'all')
{
$str = strtr($str, array("\xDE"=>"fi", "\xDF"=>"fl", "\xAE"=>"AE", "\xBE"=>"ae", "\xCE"=>"OE", "\xCF"=>"oe"));
}
$str = strtr($str, array("\x7F"=>"\x20", "\x80"=>"\xC3\x84", "\x81"=>"\xC3\x85",
"\x82"=>"\xC3\x87", "\x83"=>"\xC3\x89", "\x84"=>"\xC3\x91", "\x85"=>"\xC3\x96",
"\x86"=>"\xC3\x9C", "\x87"=>"\xC3\xA1", "\x88"=>"\xC3\xA0", "\x89"=>"\xC3\xA2",
"\x8A"=>"\xC3\xA4", "\x8B"=>"\xC3\xA3", "\x8C"=>"\xC3\xA5", "\x8D"=>"\xC3\xA7",
"\x8E"=>"\xC3\xA9", "\x8F"=>"\xC3\xA8", "\x90"=>"\xC3\xAA", "\x91"=>"\xC3\xAB",
"\x92"=>"\xC3\xAD", "\x93"=>"\xC3\xAC", "\x94"=>"\xC3\xAE", "\x95"=>"\xC3\xAF",
"\x96"=>"\xC3\xB1", "\x97"=>"\xC3\xB3", "\x98"=>"\xC3\xB2", "\x99"=>"\xC3\xB4",
"\x9A"=>"\xC3\xB6", "\x9B"=>"\xC3\xB5", "\x9C"=>"\xC3\xBA", "\x9D"=>"\xC3\xB9",
"\x9E"=>"\xC3\xBB", "\x9F"=>"\xC3\xBC", "\xA0"=>"\xE2\x80\xA0", "\xA1"=>"\xC2\xB0",
"\xA2"=>"\xC2\xA2", "\xA3"=>"\xC2\xA3", "\xA4"=>"\xC2\xA7", "\xA5"=>"\xE2\x80\xA2",
"\xA6"=>"\xC2\xB6", "\xA7"=>"\xC3\x9F", "\xA8"=>"\xC2\xAE", "\xA9"=>"\xC2\xA9",
"\xAA"=>"\xE2\x84\xA2", "\xAB"=>"\xC2\xB4", "\xAC"=>"\xC2\xA8", "\xAD"=>"\xE2\x89\xA0",
"\xAE"=>"\xC3\x86", "\xAF"=>"\xC3\x98", "\xB0"=>"\xE2\x88\x9E", "\xB1"=>"\xC2\xB1",
"\xB2"=>"\xE2\x89\xA4", "\xB3"=>"\xE2\x89\xA5", "\xB4"=>"\xC2\xA5", "\xB5"=>"\xC2\xB5",
"\xB6"=>"\xE2\x88\x82", "\xB7"=>"\xE2\x88\x91", "\xB8"=>"\xE2\x88\x8F", "\xB9"=>"\xCF\x80",
"\xBA"=>"\xE2\x88\xAB", "\xBB"=>"\xC2\xAA", "\xBC"=>"\xC2\xBA", "\xBD"=>"\xCE\xA9",
"\xBE"=>"\xC3\xA6", "\xBF"=>"\xC3\xB8", "\xC0"=>"\xC2\xBF", "\xC1"=>"\xC2\xA1",
"\xC2"=>"\xC2\xAC", "\xC3"=>"\xE2\x88\x9A", "\xC4"=>"\xC6\x92", "\xC5"=>"\xE2\x89\x88",
"\xC6"=>"\xE2\x88\x86", "\xC7"=>"\xC2\xAB", "\xC8"=>"\xC2\xBB", "\xC9"=>"\xE2\x80\xA6",
"\xCA"=>"\xC2\xA0", "\xCB"=>"\xC3\x80", "\xCC"=>"\xC3\x83", "\xCD"=>"\xC3\x95",
"\xCE"=>"\xC5\x92", "\xCF"=>"\xC5\x93", "\xD0"=>"\xE2\x80\x93", "\xD1"=>"\xE2\x80\x94",
"\xD2"=>"\xE2\x80\x9C", "\xD3"=>"\xE2\x80\x9D", "\xD4"=>"\xE2\x80\x98", "\xD5"=>"\xE2\x80\x99",
"\xD6"=>"\xC3\xB7", "\xD7"=>"\xE2\x97\x8A", "\xD8"=>"\xC3\xBF", "\xD9"=>"\xC5\xB8",
"\xDA"=>"\xE2\x81\x84", "\xDB"=>"\xE2\x82\xAC", "\xDC"=>"\xE2\x80\xB9", "\xDD"=>"\xE2\x80\xBA",
"\xDE"=>"\xEF\xAC\x81", "\xDF"=>"\xEF\xAC\x82", "\xE0"=>"\xE2\x80\xA1", "\xE1"=>"\xC2\xB7",
"\xE2"=>"\xE2\x80\x9A", "\xE3"=>"\xE2\x80\x9E", "\xE4"=>"\xE2\x80\xB0", "\xE5"=>"\xC3\x82",
"\xE6"=>"\xC3\x8A", "\xE7"=>"\xC3\x81", "\xE8"=>"\xC3\x8B", "\xE9"=>"\xC3\x88",
"\xEA"=>"\xC3\x8D", "\xEB"=>"\xC3\x8E", "\xEC"=>"\xC3\x8F", "\xED"=>"\xC3\x8C",
"\xEE"=>"\xC3\x93", "\xEF"=>"\xC3\x94", "\xF0"=>"\xEF\xA3\xBF", "\xF1"=>"\xC3\x92",
"\xF2"=>"\xC3\x9A", "\xF3"=>"\xC3\x9B", "\xF4"=>"\xC3\x99", "\xF5"=>"\xC4\xB1",
"\xF6"=>"\xCB\x86", "\xF7"=>"\xCB\x9C", "\xF8"=>"\xC2\xAF", "\xF9"=>"\xCB\x98",
"\xFA"=>"\xCB\x99", "\xFB"=>"\xCB\x9A", "\xFC"=>"\xC2\xB8", "\xFD"=>"\xCB\x9D",
"\xFE"=>"\xCB\x9B", "\xFF"=>"\xCB\x87", "\x00"=>"\x20", "\x01"=>"\x20",
"\x02"=>"\x20", "\x03"=>"\x20", "\x04"=>"\x20", "\x05"=>"\x20",
"\x06"=>"\x20", "\x07"=>"\x20", "\x08"=>"\x20", "\x0B"=>"\x20",
"\x0C"=>"\x20", "\x0E"=>"\x20", "\x0F"=>"\x20", "\x10"=>"\x20",
"\x11"=>"\x20", "\x12"=>"\x20", "\x13"=>"\x20", "\x14"=>"\x20",
"\x15"=>"\x20", "\x16"=>"\x20", "\x17"=>"\x20", "\x18"=>"\x20",
"\x19"=>"\x20", "\x1A"=>"\x20", "\x1B"=>"\x20", "\x1C"=>"\x20",
"\1D"=>"\x20", "\x1E"=>"\x20", "\x1F"=>"\x20", "\xF0"=>""));
return $str;
}
?>
Have you tried mb-convert-encoding ?
Think it would be:
$str = mb_convert_encoding($str, "macintosh", "UTF-8");
Just curious, have you tried copying the salt, saving as UTF-8 and then pasting the salt back in place and saving again?