php non latin to hex function - php

I have website that's in win-1251 encoding and it needs to stay that way. But I also need to be able to echo few links that contain non latin, non cyrillic characters like šžāņūī...
I need a function that convert this
"māja un man tā patīk"
to
"māja un man tā patīk"
and that does not touch html, so if there is <b> it needs to stay as <b>, not > or <
And please no advices about the encoding and how wrong that is.

$str = "<b>Obāchan</b> おばあちゃん";
$str = preg_replace_callback('/./u', function ($matches) {
$chr = $matches[0];
if (strlen($chr) > 1) {
$chr = mb_convert_encoding($chr, 'HTML-ENTITIES', 'UTF-8');
}
return $chr;
}, $str);
This expects the original $str to be UTF-8 encoded, i.e. your PHP file should be saved in UTF-8. It encodes all non-ASCII compatible code points to HTML entities. Since all HTML special characters are ASCII characters, they remain untouched. The resulting string is pure ASCII. Since the lower Win-1251 code points are ASCII compatible, the resulting string is also a valid Win-1251 string. The above $str converts to:
<b>Obāchan</b> おばあちゃん

The main things you probably don't want to encode are <, > and &. Those are really the only special characters. So how about encoding everything first, and then just decode <, > and & I feel you should be fine.
This is untested:
$output =
htmlspecialchars_decode(
htmlentities($input, ENT_NOQUOTES, 'CP-1251')
);
let me know

What Evert suggest looks logical to me too! If you insist this is a way to do it if there are only two letters that bother you. For more letters the scrit will not be as effective and needs to change.
<?PHP
function myConvert($str)
{
$chars['ā']='ā';
$chars['ī']='ī';
foreach ($chars as $key => $value)
$output = str_replace($key, $value, $str);
echo $str;
}
myConvert("māja un man tā patīk");
?>
==================edited==============
For many characters maybe this one can help you:
<?PHP
function myConvert($str)
{
$final=null;
$parts = preg_split("/&#[0-9]*;/i", $str);//get all text parts
preg_match_all("/&#[0-9]*;/i", $str, $delimiters );//get delimiters;
$delimiters[0][]='';//make arrays equal size
foreach($parts as $key => $value)
$final.=$value.mb_convert_encoding
($delimiters[0][$key], "UTF-8", "HTML-ENTITIES");
return $final;
}
$fh = fopen("testFile.txt", 'w') ;
fwrite($fh, myConvert("māja un man tā patīkī"));
fclose($fh);
?>
The desired output is written in the text file. This code, exactly as it is -not merged in some project- does what it claims to do. Converts codes like ā to the analogous character they present.

Related

Converting Window-1252 to UTF-8 Issue

I have created a function to convert the following text to UTF-8, as it appeared to be in Windows-1252 format, due to being copied to a database table from a Word Document.
Testing weird character’s correction
This seems to fix the dodgy ’ character. However i'm not getting � in the following:
Devon�s most prominent dealerships
When passing the following through the same function:
Devon's most prominent dealerships
Below is the code which does the converting:
function Windows1252ToUTF8($text) {
return mb_convert_encoding($text, "Windows-1252", "UTF-8");
}
Edit:
The database can't be changed due to holding thousands of custom records. I tried the below but the mb_detect_encoding thinks character’s correction is UTF-8.
function Windows1252ToUTF8($text) {
if (mb_detect_encoding($text) == "UTF-8") {
return $text;
}
return mb_convert_encoding($text, "Windows-1252", "UTF-8");
}
Edit 2:
Just tried the example from the PHP Documentation:
$str = 'áéóú'; // ISO-8859-1
echo "<pre>";
var_dump(mb_detect_encoding($str, 'UTF-8')); // 'UTF-8'
var_dump(mb_detect_encoding($str, 'UTF-8', true)); // false
echo "</pre>";
die();
but this simply outputs:
string(5) "UTF-8"
string(5) "UTF-8"
So I can't even detect the encoding of the string :S
Edit 3:
This seems to do the trick:
function Windows1252ToUTF8($text) {
$badChars = [ "â", "á", "ú", "é", "ó" ];
$match = preg_match("/[".join("",$badChars)."]/", $text);
if ($match) {
return mb_convert_encoding($text, "Windows-1252", "UTF-8");
}
return $text;
}
Edit 4:
I have matched the hex values to their corresponding values. However when I get to the weird characters they don't appear to match.
Converting Testing weird character’s correction using bin2hex
gives me
54657374696e6720776569726420636861726163746572c3a2e282ace284a27320636f7272656374696f6e
This means the "’" is actually the bytes \xc3\xa2\xe2\x82\xac\xe2\x84\xa2. This is a typical sign of a UTF-8 string having been interpreted as Windows Latin-1/1252, and then re-encoded to UTF-8.
’ (UTF-8 \xe2\x80\x99)
→ bytes interpreted as Latin-1 equal the string ’
→ characters encoded to UTF-8 result in \xc3\xa2\xe2\x82\xac\xe2\x84\xa2
To restore the original, you need to reverse that chain of mis-encodings:
$s = "\xc3\xa2\xe2\x82\xac\xe2\x84\xa2";
echo mb_convert_encoding($s, 'Windows-1252', 'UTF-8');
This interprets the string as UTF-8, converts it to the Windows-1252 equivalent, which is then the valid UTF-8 representation of ’.
Preferably you figure out at what point the encoding screwed up like this and you stop that from happening in the future. If it happened by "copy and pasting from Word", then basically somebody pasted garbage into your database and you need to fix the workflow with Word somehow. Otherwise there may be an incorrect encoding-conversion step somewhere in your code which you need to fix.
The following seems to do the trick. Not the way I wanted it to work by checking for specific characters, but it does the trick.
function Windows1252ToUTF8($text) {
$badChars = [ "â", "á", "ú", "é", "ó" ];
$match = preg_match("/[".join("",$badChars)."]/", $text);
if ($match) {
return mb_convert_encoding($text, "Windows-1252", "UTF-8");
}
return $text;
}
Edit:
function Windows1252ToUTF8($text) {
// http://www.fileformat.info/info/charset/UTF-8/list.htm
$illegal_hex = [ "c3a2", "c3a1", "c3ba", "c3a9", "c3b3" ];
$match = preg_match("/".join("|",$illegal_hex)."/", bin2hex($text));
if ($match) {
return mb_convert_encoding($text, "Windows-1252", "UTF-8");
}
return $text;
}

How to convert a Unicode text-block to UTF-8 (HEX) code point?

I have a Unicode text-block, like this:
ụ
ư
ứ
Ỳ
Ỷ
Ỵ
Đ
Now, I want to convert this orginal Unicode text-block into a text-block of UTF-8 (HEX) code point (see the Hexadecimal UTF-8 column, on this page: https://en.wikipedia.org/wiki/UTF-8), by PHP; like this:
\xe1\xbb\xa5
\xc6\xb0
\xe1\xbb\xa9
\xe1\xbb\xb2
\xe1\xbb\xb6
\xe1\xbb\xb4
\xc4\x90
Not like this:
0x1EE5
0x01B0
0x1EE9
0x1EF2
0x1EF6
0x1EF4
0x0110
Is there any way to do it, by PHP?
I have read this topic (PHP: Convert unicode codepoint to UTF-8). But, it is not similar to my question.
I am sorry, I don't know much about Unicode.
I think you're looking for the bin2hex() function:
Convert binary data into hexadecimal representation
And format by prepending \x to each byte (00-FF)
function str_hex_format ($bin) {
return '\x'.implode('\x', str_split(bin2hex($bin), 2));
}
For your sample:
// utf8 encoded input
$arr = ["ụ","ư","ứ","Ỳ","Ỷ","Ỵ","Đ"];
foreach($arr AS $v)
echo $v . " => " . str_hex_format($v) . "\n";
See test at eval.in (link expires)
ụ => \xe1\xbb\xa5
ư => \xc6\xb0
ứ => \xe1\xbb\xa9
Ỳ => \xe1\xbb\xb2
Ỷ => \xe1\xbb\xb6
Ỵ => \xe1\xbb\xb4
Đ => \xc4\x90
Decode example: $str = str_hex_format("ụưứỲỶỴĐ"); echo $str;
\xe1\xbb\xa5\xc6\xb0\xe1\xbb\xa9\xe1\xbb\xb2\xe1\xbb\xb6\xe1\xbb\xb4\xc4\x90
echo hex2bin(str_replace('\x', "", $str));
ụưứỲỶỴĐ
For more info about escape sequence \x in double quoted strings see php manual.
PHP treats strings as arrays of characters, regardless of encoding. If you don't need to delimit the UTF8 characters, then something like this works:
$str='ụưứỲỶỴĐ';
foreach(str_split($str) as $char)
echo '\x'.str_pad(dechex(ord($char)),'0',2,STR_PAD_LEFT);
Output:
\xe1\xbb\xa5\xc6\xb0\xe1\xbb\xa9\xe1\xbb\xb2\xe1\xbb\xb6\xe1\xbb\xb4\xc4\x90
If you need to delimit the UTF8 characters (i.e. with a newline), then you'll need something like this:
$str='ụưứỲỶỴĐ';
foreach(array_slice(preg_split('~~u',$str),1,-1) as $UTF8char){ // split before/after every UTF8 character and remove first/last empty string
foreach(str_split($UTF8char) as $char)
echo '\x'.str_pad(dechex(ord($char)),'0',2,STR_PAD_LEFT);
echo "\n"; // delimiter
}
Output:
\xe1\xbb\xa5
\xc6\xb0
\xe1\xbb\xa9
\xe1\xbb\xb2
\xe1\xbb\xb6
\xe1\xbb\xb4
\xc4\x90
This splits the string into UTF8 characters using preg_split and the u flag. Since preg_split returns the empty string before the first character and the empty string after the last character, we need to array_slice the first and last characters. This can be easily modified to return an array, for example.
Edit:
A more "correct" way to do this is this:
echo trim(json_encode(utf8_encode('ụưứỲỶỴĐ')),'"');
The main thing you need to do is to tell PHP to interpret the incoming Unicode characters correctly. Once you do that, you can then convert them to UTF-8 and then to hex as needed.
This code frag takes your example character in Unicode, converts them to UTF-8, and then dumps the hex representation of those characters.
<?php
// Hex equivalent of "ụưứỲỶỴĐ" in Unicode
$unistr = "\x1E\xE5\x01\xB0\x1E\xE9\x1E\xF2\x1E\xF6\x1E\xF4\x01\x10";
echo " length=" . mb_strlen($unistr, 'UCS-2BE') . "\n";
// Here's the key statement, convert from Unicode 16-bit to UTF-8
$utf8str = mb_convert_encoding($unistr, "UTF-8", 'UCS-2BE');
echo $utf8str . "\n";
for($i=0; $i < mb_strlen($utf8str, 'UTF-8'); $i++) {
$c = mb_substr($utf8str, $i, 1, 'UTF-8');
$hex = bin2hex($c);
echo $c . "\t" . $hex . "\t" . preg_replace("/([0-9a-f]{2})/", '\\\\x\\1', $hex) . "\n";
}
?>
Produces
length=7
ụưứỲỶỴĐ
ụ e1bba5 \xe1\xbb\xa5
ư c6b0 \xc6\xb0
ứ e1bba9 \xe1\xbb\xa9
Ỳ e1bbb2 \xe1\xbb\xb2
Ỷ e1bbb6 \xe1\xbb\xb6
Ỵ e1bbb4 \xe1\xbb\xb4
Đ c490 \xc4\x90

PHP not have a function for XML-safe entity decode? Not have some xml_entity_decode?

THE PROBLEM: I need a XML file "full encoded" by UTF8; that is, with no entity representing symbols, all symbols enconded by UTF8, except the only 3 ones that are XML-reserved, "&" (amp), "<" (lt) and ">" (gt). And, I need a build-in function that do it fast: to transform entities into real UTF8 characters (without corrupting my XML).
PS: it is a "real world problem" (!); at PMC/journals, for example, have 2.8 MILLION of scientific articles enconded with a special XML DTD (knowed also as JATS format)... To process as "usual XML-UTF8-text" we need to change from numeric entity to UTF8 char.
THE ATTEMPTED SOLUTION: the natural function to this task is html_entity_decode, but it destroys the XML code (!), transforming the reserved 3 XML-reserved symbols.
Illustrating the problem
Suppose
$xmlFrag ='<p>Hello world!    Let A<B and A=∬dxdy</p>';
Where the entities 160 (nbsp) and x222C (double integral) must be transformed into UTF8, and the XML-reserved lt not. The XML text will be (after transformed),
$xmlFrag = '<p>Hello world!    Let A<B and A=∬dxdy</p>';
The text "A<B" needs an XML-reserved character, so MUST stay as A<B.
Frustrated solutions
I try to use html_entity_decode for solve (directly!) the problem... So, I updated my PHP to v5.5 to try to use the ENT_XML1 option,
$s = html_entity_decode($xmlFrag, ENT_XML1, 'UTF-8'); // not working
// as I expected
Perhaps another question is, "WHY there are no other option to do what I expected?" -- it is important for many other XML applications (!), not only for me.
I not need a workaround as answer... Ok, I show my ugly function, perhaps it helps you to understand the problem,
function xml_entity_decode($s) {
// here an illustration (by user-defined function)
// about how the hypothetical PHP-build-in-function MUST work
static $XENTITIES = array('&','>','<');
static $XSAFENTITIES = array('#_x_amp#;','#_x_gt#;','#_x_lt#;');
$s = str_replace($XENTITIES,$XSAFENTITIES,$s);
//$s = html_entity_decode($s, ENT_NOQUOTES, 'UTF-8'); // any php version
$s = html_entity_decode($s, ENT_HTML5|ENT_NOQUOTES, 'UTF-8'); // PHP 5.3+
$s = str_replace($XSAFENTITIES,$XENTITIES,$s);
return $s;
} // you see? not need a benchmark:
// it is not so fast as direct use of html_entity_decode; if there
// was an XML-safe option was ideal.
PS: corrected after this answer. Must be ENT_HTML5 flag, for convert really all named entities.
This question is creating, time-by-time, a "false answer" (see answers). This is perhaps because people not pay attention, and because there are NO ANSWER: there are a lack of PHP build-in solution.
... So, lets repeat my workaround (that is NOT an answer!) to not create more confusion:
The best workaround
Pay attention:
The function xml_entity_decode() below is the best (over any other) workaround.
The function below is not an answer to the present question, it is only a workwaround.
function xml_entity_decode($s) {
// illustrating how a (hypothetical) PHP-build-in-function MUST work
static $XENTITIES = array('&','>','<');
static $XSAFENTITIES = array('#_x_amp#;','#_x_gt#;','#_x_lt#;');
$s = str_replace($XENTITIES,$XSAFENTITIES,$s);
$s = html_entity_decode($s, ENT_HTML5|ENT_NOQUOTES, 'UTF-8'); // PHP 5.3+
$s = str_replace($XSAFENTITIES,$XENTITIES,$s);
return $s;
}
To test and to demonstrate that you have a better solution, please test first with this simple benckmark:
$countBchMk_MAX=1000;
$xml = file_get_contents('sample1.xml'); // BIG and complex XML string
$start_time = microtime(TRUE);
for($countBchMk=0; $countBchMk<$countBchMk_MAX; $countBchMk++){
$A = xml_entity_decode($xml); // 0.0002
/* 0.0014
$doc = new DOMDocument;
$doc->loadXML($xml, LIBXML_DTDLOAD | LIBXML_NOENT);
$doc->encoding = 'UTF-8';
$A = $doc->saveXML();
*/
}
$end_time = microtime(TRUE);
echo "\n<h1>END $countBchMk_MAX BENCKMARKs WITH ",
($end_time - $start_time)/$countBchMk_MAX,
" seconds</h1>";
Use the DTD when loading the JATS XML document, as it will define any mapping from named entities to Unicode characters, then set the encoding to UTF-8 when saving:
$doc = new DOMDocument;
$doc->load($inputFile, LIBXML_DTDLOAD | LIBXML_NOENT);
$doc->encoding = 'UTF-8';
$doc->save($outputFile);
I had the same problem because someone used HTML templates to create XML, instead of using SimpleXML. sigh... Anyway, I came up with the following. It's not as fast as yours, but it's not an order of magnitude slower, and it is less hacky. Yours will inadvertently convert #_x_amp#; to $amp;, however unlikely its presence in the source XML.
Note: I'm assuming default encoding is UTF-8
// Search for named entities (strings like "&abc1;").
echo preg_replace_callback('#&[A-Z0-9]+;#i', function ($matches) {
// Decode the entity and re-encode as XML entities. This means "&"
// will remain "&" whereas "€" becomes "€".
return htmlentities(html_entity_decode($matches[0]), ENT_XML1);
}, "<Foo>€&foo Ç</Foo>") . "\n";
/* <Foo>€&foo Ç</Foo> */
Also, if you want to replace special characters with numbered entities (in case you don't want a UTF-8 XML), you can easily add a function to the above code:
// Search for named entities (strings like "&abc1;").
$xml_utf8 = preg_replace_callback('#&[A-Z0-9]+;#i', function ($matches) {
// Decode the entity and re-encode as XML entities. This means "&"
// will remain "&" whereas "€" becomes "€".
return htmlentities(html_entity_decode($matches[0]), ENT_XML1);
}, "<Foo>€&foo Ç</Foo>") . "\n";
echo mb_encode_numericentity($xml_utf8, [0x80, 0xffff, 0, 0xffff]);
/* <Foo>€&foo Ç</Foo> */
In your case you want it the other way around. Encode numbered entities as UTF-8:
// Search for named entities (strings like "&abc1;").
$xml_utf8 = preg_replace_callback('#&[A-Z0-9]+;#i', function ($matches) {
// Decode the entity and re-encode as XML entities. This means "&"
// will remain "&" whereas "€" becomes "€".
return htmlentities(html_entity_decode($matches[0]), ENT_XML1);
}, "<Foo>€&foo Ç</Foo>") . "\n";
// Encodes (uncaught) numbered entities to UTF-8.
echo mb_decode_numericentity($xml_utf8, [0x80, 0xffff, 0, 0xffff]);
/* <Foo>€&foo Ç</Foo> */
Benchmark
I've added a benchmark for good measure. This also demonstrates the flaw in your solution for clarity. Below is the input string I used.
<Foo>€&foo Ç é #_x_amp#; ∬</Foo>
Your method
php -r '$q=["&",">","<"];$y=["#_x_amp#;","#_x_gt#;","#_x_lt#;"]; $s=microtime(1); for(;++$i<1000000;)$r=str_replace($y,$q,html_entity_decode(str_replace($q,$y,"<Foo>€&foo Ç é #_x_amp#; ∬</Foo>"),ENT_HTML5|ENT_NOQUOTES)); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'
<Foo>€&foo Ç é & ∬</Foo>
=====
Time taken: 2.0397531986237
My method
php -r '$s=microtime(1); for(;++$i<1000000;)$r=preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"<Foo>€&foo Ç é #_x_amp#; ∬</Foo>"); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'
<Foo>€&foo Ç é #_x_amp#; ∬</Foo>
=====
Time taken: 4.045273065567
My method (with unicode to numbered entity):
php -r '$s=microtime(1); for(;++$i<1000000;)$r=mb_encode_numericentity(preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"<Foo>€&foo Ç é #_x_amp#; ∬</Foo>"),[0x80,0xffff,0,0xffff]); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'
<Foo>€&foo Ç é #_x_amp#; ∬</Foo>
=====
Time taken: 5.4407880306244
My method (with numbered entity to unicode):
php -r '$s=microtime(1); for(;++$i<1000000;)$r=mb_decode_numericentity(preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"<Foo>€&foo Ç é #_x_amp#;</Foo>"),[0x80,0xffff,0,0xffff]); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'
<Foo>€&foo Ç é #_x_amp#; ∬</Foo>
=====
Time taken: 5.5400078296661
public function entity_decode($str, $charset = NULL)
{
if (strpos($str, '&') === FALSE)
{
return $str;
}
static $_entities;
isset($charset) OR $charset = $this->charset;
$flag = is_php('5.4')
? ENT_COMPAT | ENT_HTML5
: ENT_COMPAT;
do
{
$str_compare = $str;
// Decode standard entities, avoiding false positives
if ($c = preg_match_all('/&[a-z]{2,}(?![a-z;])/i', $str, $matches))
{
if ( ! isset($_entities))
{
$_entities = array_map('strtolower', get_html_translation_table(HTML_ENTITIES, $flag, $charset));
// If we're not on PHP 5.4+, add the possibly dangerous HTML 5
// entities to the array manually
if ($flag === ENT_COMPAT)
{
$_entities[':'] = '&colon;';
$_entities['('] = '&lpar;';
$_entities[')'] = '&rpar';
$_entities["\n"] = '&newline;';
$_entities["\t"] = '&tab;';
}
}
$replace = array();
$matches = array_unique(array_map('strtolower', $matches[0]));
for ($i = 0; $i < $c; $i++)
{
if (($char = array_search($matches[$i].';', $_entities, TRUE)) !== FALSE)
{
$replace[$matches[$i]] = $char;
}
}
$str = str_ireplace(array_keys($replace), array_values($replace), $str);
}
// Decode numeric & UTF16 two byte entities
$str = html_entity_decode(
preg_replace('/(&#(?:x0*[0-9a-f]{2,5}(?![0-9a-f;]))|(?:0*\d{2,4}(?![0-9;])))/iS', '$1;', $str),
$flag,
$charset
);
}
while ($str_compare !== $str);
return $str;
}
For those coming here because your numeric entity in the range 128 to 159 remains as numeric entity instead of being converted to a character:
echo xml_entity_decode('€');
//Output € instead expected €
This depends on PHP version (at least for PHP >=5.6 the entity remains) and on the affected characters. The reason is that the characters 128 to 159 are not printable characters in UTF-8. This can happen if the data to be converted mix up windows-1252 content (where € is the € sign).
Try this function:
function xmlsafe($s,$intoQuotes=1) {
if ($intoQuotes)
return str_replace(array('&','>','<','"'), array('&','>','<','"'), $s);
else
return str_replace(array('&','>','<'), array('&','>','<'), html_entity_decode($s));
}
example usage:
echo '<k nid="'.$node->nid.'" description="'.xmlsafe($description).'"/>';
also: https://stackoverflow.com/a/9446666/2312709
this code used in production seem that no problems happened with UTF-8

How to Convert Html Codes to Relevant Unicode Characters

Actually, I have googled a Lot, And I have explored this forum too, but this is my second day, and I could not find the solution.
My Problem is that I want to convert the Html Codes
باخ
to its equallent unicode characters
خ ا ب
Actually I do not want to convert all the html symbols to unicode characters. I only want to convert the arabic / urdu html code to unicode characters. The range of these characters is from ؛ To ۹ If there is no any PHP function then How can I replace the codes with their equallent unicode character in one go?
I think you're looking for:
html_entity_decode('باخ', ENT_QUOTES, 'UTF-8');
When you go from ب to ب, that's called decoding. Doing the opposite is called encoding.
As for replacing only characters from ؛ to ۹ maybe try something like this.
<?php
// Random set of entities, two are outside the 1563 - 1785 range.
$entities = '؛؜<لñ۸۹';
// Matches entities from 1500 to 1799, not perfect, I know.
preg_match_all('/&#1[5-7][0-9]{2};/', $entities, $matches);
$entityRegex = array(); // Will hold the entity code regular expression.
$decodedCharacters = array(); // Will hold the decoded characters.
foreach ($matches[0] as $entity)
{
// Convert the entity to human-readable character.
$unicodeCharacter = html_entity_decode($entity, ENT_QUOTES, 'UTF-8');
array_push($entityRegex, "/$entity/");
array_push($decodedCharacters, $unicodeCharacter);
}
// Replace all of the matched entities with the human-readable character.
$replaced = preg_replace($entityRegex, $decodedCharacters, $entities);
?>
That's as close as I can get to solving this. Hopefully, this helps a little. It's 5:00am where I am now, so I'm off to sleep! :)
did you try the utf-8 encoding in html head?
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
try this
<?php
$trans_tbl = get_html_translation_table(HTML_ENTITIES);
foreach($trans_tbl as $k => $v)
{
$ttr[$v] = utf8_encode($k);
}
$text = 'بب....;&#1582';
$text = strtr($text, $ttr);
echo $text;
?>
for mysql solution you can set the character set as
$mysqli = new mysqli($host, $user, $pass, $db);
if (!$mysqli->set_charset("utf8")) {
die("error");
}

Char Encoding: Changing file from MacRoman to UTF-8 breaks string

I am working on a CakePHP site saved in MacRoman char encoding. I want to change all the files to UTF-8 for internationalisation. For all the other files in the site this works fine. However, in the core.php file there is a security salt, which is a string with special characters ("!:* etc.). When I save this file as UTF-8 the salt gets corrupted. I can roll this back with git, but it's an annoyance.
Does anyone know how I can convert the string from MacRoman to UTF-8?
You don't give enough information to confirm this, but I guess the salt is used in its binary form. In that case, changing the encoding of the file will corrupt the salt if this binary stream is changed, even if the characters are correctly converted.
Since the first 128 characters are similar in UTF-8 and Mac OS Roman, you don't have to worry if the salt is written using only these characters.
Let's say the salt is somewhere:
$salt = "a!c‡Œ";
You could write instead:
$salt = "a!c\xE0\xCE";
You could map all to their hexadecimal representation, as it might be easier to automate:
$salt = "\x61\x21\x63\xE0\xCE";
See the table here.
The following snippet can automate this conversion:
$res = "";
foreach (str_split($salt) as $c) {
$res .= "\\x".dechex(ord($c));
}
echo $res;
Thanks for the input, pointed me in the right direction. The solution is:
$salt = iconv('UTF-8', 'macintosh', $string);
For those who do not have access to iconv here is a function in PHP:
http://sebastienguillon.com/test/jeux-de-caracteres/MacRoman_to_utf8.txt.php
It will properly convert MacRoman text to UTF-8 and you can even decide how you want to break ligatures.
<?php
function MacRoman_to_utf8($str, $break_ligatures='none')
{
// $break_ligatures : 'none' | 'fifl' | 'all'
// 'none' : don't break any MacRoman ligatures, transform them into their utf-8 counterparts
// 'fifl' : break only fi ("\xDE" => "fi") and fl ("\xDF"=>"fl")
// 'all' : break fi, fl and also AE ("\xAE"=>"AE"), ae ("\xBE"=>"ae"), OE ("\xCE"=>"OE") and oe ("\xCF"=>"oe")
if($break_ligatures == 'fifl')
{
$str = strtr($str, array("\xDE"=>"fi", "\xDF"=>"fl"));
}
if($break_ligatures == 'all')
{
$str = strtr($str, array("\xDE"=>"fi", "\xDF"=>"fl", "\xAE"=>"AE", "\xBE"=>"ae", "\xCE"=>"OE", "\xCF"=>"oe"));
}
$str = strtr($str, array("\x7F"=>"\x20", "\x80"=>"\xC3\x84", "\x81"=>"\xC3\x85",
"\x82"=>"\xC3\x87", "\x83"=>"\xC3\x89", "\x84"=>"\xC3\x91", "\x85"=>"\xC3\x96",
"\x86"=>"\xC3\x9C", "\x87"=>"\xC3\xA1", "\x88"=>"\xC3\xA0", "\x89"=>"\xC3\xA2",
"\x8A"=>"\xC3\xA4", "\x8B"=>"\xC3\xA3", "\x8C"=>"\xC3\xA5", "\x8D"=>"\xC3\xA7",
"\x8E"=>"\xC3\xA9", "\x8F"=>"\xC3\xA8", "\x90"=>"\xC3\xAA", "\x91"=>"\xC3\xAB",
"\x92"=>"\xC3\xAD", "\x93"=>"\xC3\xAC", "\x94"=>"\xC3\xAE", "\x95"=>"\xC3\xAF",
"\x96"=>"\xC3\xB1", "\x97"=>"\xC3\xB3", "\x98"=>"\xC3\xB2", "\x99"=>"\xC3\xB4",
"\x9A"=>"\xC3\xB6", "\x9B"=>"\xC3\xB5", "\x9C"=>"\xC3\xBA", "\x9D"=>"\xC3\xB9",
"\x9E"=>"\xC3\xBB", "\x9F"=>"\xC3\xBC", "\xA0"=>"\xE2\x80\xA0", "\xA1"=>"\xC2\xB0",
"\xA2"=>"\xC2\xA2", "\xA3"=>"\xC2\xA3", "\xA4"=>"\xC2\xA7", "\xA5"=>"\xE2\x80\xA2",
"\xA6"=>"\xC2\xB6", "\xA7"=>"\xC3\x9F", "\xA8"=>"\xC2\xAE", "\xA9"=>"\xC2\xA9",
"\xAA"=>"\xE2\x84\xA2", "\xAB"=>"\xC2\xB4", "\xAC"=>"\xC2\xA8", "\xAD"=>"\xE2\x89\xA0",
"\xAE"=>"\xC3\x86", "\xAF"=>"\xC3\x98", "\xB0"=>"\xE2\x88\x9E", "\xB1"=>"\xC2\xB1",
"\xB2"=>"\xE2\x89\xA4", "\xB3"=>"\xE2\x89\xA5", "\xB4"=>"\xC2\xA5", "\xB5"=>"\xC2\xB5",
"\xB6"=>"\xE2\x88\x82", "\xB7"=>"\xE2\x88\x91", "\xB8"=>"\xE2\x88\x8F", "\xB9"=>"\xCF\x80",
"\xBA"=>"\xE2\x88\xAB", "\xBB"=>"\xC2\xAA", "\xBC"=>"\xC2\xBA", "\xBD"=>"\xCE\xA9",
"\xBE"=>"\xC3\xA6", "\xBF"=>"\xC3\xB8", "\xC0"=>"\xC2\xBF", "\xC1"=>"\xC2\xA1",
"\xC2"=>"\xC2\xAC", "\xC3"=>"\xE2\x88\x9A", "\xC4"=>"\xC6\x92", "\xC5"=>"\xE2\x89\x88",
"\xC6"=>"\xE2\x88\x86", "\xC7"=>"\xC2\xAB", "\xC8"=>"\xC2\xBB", "\xC9"=>"\xE2\x80\xA6",
"\xCA"=>"\xC2\xA0", "\xCB"=>"\xC3\x80", "\xCC"=>"\xC3\x83", "\xCD"=>"\xC3\x95",
"\xCE"=>"\xC5\x92", "\xCF"=>"\xC5\x93", "\xD0"=>"\xE2\x80\x93", "\xD1"=>"\xE2\x80\x94",
"\xD2"=>"\xE2\x80\x9C", "\xD3"=>"\xE2\x80\x9D", "\xD4"=>"\xE2\x80\x98", "\xD5"=>"\xE2\x80\x99",
"\xD6"=>"\xC3\xB7", "\xD7"=>"\xE2\x97\x8A", "\xD8"=>"\xC3\xBF", "\xD9"=>"\xC5\xB8",
"\xDA"=>"\xE2\x81\x84", "\xDB"=>"\xE2\x82\xAC", "\xDC"=>"\xE2\x80\xB9", "\xDD"=>"\xE2\x80\xBA",
"\xDE"=>"\xEF\xAC\x81", "\xDF"=>"\xEF\xAC\x82", "\xE0"=>"\xE2\x80\xA1", "\xE1"=>"\xC2\xB7",
"\xE2"=>"\xE2\x80\x9A", "\xE3"=>"\xE2\x80\x9E", "\xE4"=>"\xE2\x80\xB0", "\xE5"=>"\xC3\x82",
"\xE6"=>"\xC3\x8A", "\xE7"=>"\xC3\x81", "\xE8"=>"\xC3\x8B", "\xE9"=>"\xC3\x88",
"\xEA"=>"\xC3\x8D", "\xEB"=>"\xC3\x8E", "\xEC"=>"\xC3\x8F", "\xED"=>"\xC3\x8C",
"\xEE"=>"\xC3\x93", "\xEF"=>"\xC3\x94", "\xF0"=>"\xEF\xA3\xBF", "\xF1"=>"\xC3\x92",
"\xF2"=>"\xC3\x9A", "\xF3"=>"\xC3\x9B", "\xF4"=>"\xC3\x99", "\xF5"=>"\xC4\xB1",
"\xF6"=>"\xCB\x86", "\xF7"=>"\xCB\x9C", "\xF8"=>"\xC2\xAF", "\xF9"=>"\xCB\x98",
"\xFA"=>"\xCB\x99", "\xFB"=>"\xCB\x9A", "\xFC"=>"\xC2\xB8", "\xFD"=>"\xCB\x9D",
"\xFE"=>"\xCB\x9B", "\xFF"=>"\xCB\x87", "\x00"=>"\x20", "\x01"=>"\x20",
"\x02"=>"\x20", "\x03"=>"\x20", "\x04"=>"\x20", "\x05"=>"\x20",
"\x06"=>"\x20", "\x07"=>"\x20", "\x08"=>"\x20", "\x0B"=>"\x20",
"\x0C"=>"\x20", "\x0E"=>"\x20", "\x0F"=>"\x20", "\x10"=>"\x20",
"\x11"=>"\x20", "\x12"=>"\x20", "\x13"=>"\x20", "\x14"=>"\x20",
"\x15"=>"\x20", "\x16"=>"\x20", "\x17"=>"\x20", "\x18"=>"\x20",
"\x19"=>"\x20", "\x1A"=>"\x20", "\x1B"=>"\x20", "\x1C"=>"\x20",
"\1D"=>"\x20", "\x1E"=>"\x20", "\x1F"=>"\x20", "\xF0"=>""));
return $str;
}
?>
Have you tried mb-convert-encoding ?
Think it would be:
$str = mb_convert_encoding($str, "macintosh", "UTF-8");
Just curious, have you tried copying the salt, saving as UTF-8 and then pasting the salt back in place and saving again?

Categories