Json response, how to decode characters with accent? - php

I am receiving Json response from a webservice so the character Ô turns into \u00d3
How can PETR\u00d3POLIS became PETRÓPOLIS ?
I am using PHP to query the database and return JSON.
After a research from http://www.fileformat.info/info/unicode/char/00d3/index.htm i know the character is Unicode Character 'LATIN CAPITAL LETTER O WITH ACUTE' (U+00D3) .
Wich is the best way to translate these characters ?

Unicode characters are just like escape characters - you can see them in JS string, but they will be displayed properly as a text.
var o = {
text: 'PETR\u00d3POLIS \n\u00a5\u00a5\u00a5'
};
document.body.innerHTML = "<pre>" + o.text + "</pre>";

You can use the below regex
$string = "u00d3";
echo $string = preg_replace('/u([\da-fA-F]{4})/', '&#x\1;', $string)

Related

ISO 8859 1 octal back to normal characters

I'm currently converting our old project database into a new format/new database. There are some old data, which were probably escaped by a smartphone app. Now the entry looks like this:
Tak hur\341 v posteli po pr\341ci a jde se sp\355nkat
now the real entry should look like this:
Tak hurá v posteli po práci a jde se spinkat
There are also entries like
Som nen\\355 ja len chodiaca kapuc\\341 pra\\u0161iva ignorujuca
which don't seem like ISO 8859 1, especially the \\u0161 part.
Any thoughts on any PHP function I may use to convert this back to readable version? Thanks!
Simple workaround:
The first string is only octal iso-8859-1, while the second one is double slashed iso-8859-1 with mixed utf-16 characters (why? now that is the question). The code below takes octal codes, converts to hex, packs them to binary and encodes them into utf-8. The utf-16 codes are already in hex, so they are only packed and encoded into utf-8.
For future info reference on charsets: http://www.fileformat.info/info/charset/index.htm
<?php
$string = "Tak hur\341 v posteli po pr\341ci a jde se sp\355nkat";
$string2 = "Som nen\\355 ja len chodiaca kapuc\\341 pra\\u0161iva ignorujuca";
print decode_str($string2)."<br>";
print decode_str($string);
function decode_str($string){
return utf16_to_utf8(iso_to_utf8($string));
}
function iso_to_utf8($string){
preg_match_all('#\\\\[0-9]{3}#',$string,$matches);
foreach($matches[0] as $match){
$char = preg_replace("#(\\\)#","",$match);
$a = pack("H*" , base_convert($char,8,16));
$string = preg_replace('#(\\\\)'.$char.'#',$a,$string);
}
return mb_convert_encoding($string,"UTF-8","ISO-8859-1");
}
function utf16_to_utf8($string){
preg_match_all('#\\\u[a-z0-9]{4}#',$string,$matches);
foreach($matches[0] as $match){
$char = preg_replace("#\\\\u#","",$match);
$a = pack("H*" , $char);
$a = mb_convert_encoding($a,"UTF-8","UTF-16");
$string = preg_replace('#'.preg_quote($match).'#',$a,$string);
}
return $string;
}
?>

How to convert a Unicode text-block to UTF-8 (HEX) code point?

I have a Unicode text-block, like this:
ụ
ư
ứ
Ỳ
Ỷ
Ỵ
Đ
Now, I want to convert this orginal Unicode text-block into a text-block of UTF-8 (HEX) code point (see the Hexadecimal UTF-8 column, on this page: https://en.wikipedia.org/wiki/UTF-8), by PHP; like this:
\xe1\xbb\xa5
\xc6\xb0
\xe1\xbb\xa9
\xe1\xbb\xb2
\xe1\xbb\xb6
\xe1\xbb\xb4
\xc4\x90
Not like this:
0x1EE5
0x01B0
0x1EE9
0x1EF2
0x1EF6
0x1EF4
0x0110
Is there any way to do it, by PHP?
I have read this topic (PHP: Convert unicode codepoint to UTF-8). But, it is not similar to my question.
I am sorry, I don't know much about Unicode.
I think you're looking for the bin2hex() function:
Convert binary data into hexadecimal representation
And format by prepending \x to each byte (00-FF)
function str_hex_format ($bin) {
return '\x'.implode('\x', str_split(bin2hex($bin), 2));
}
For your sample:
// utf8 encoded input
$arr = ["ụ","ư","ứ","Ỳ","Ỷ","Ỵ","Đ"];
foreach($arr AS $v)
echo $v . " => " . str_hex_format($v) . "\n";
See test at eval.in (link expires)
ụ => \xe1\xbb\xa5
ư => \xc6\xb0
ứ => \xe1\xbb\xa9
Ỳ => \xe1\xbb\xb2
Ỷ => \xe1\xbb\xb6
Ỵ => \xe1\xbb\xb4
Đ => \xc4\x90
Decode example: $str = str_hex_format("ụưứỲỶỴĐ"); echo $str;
\xe1\xbb\xa5\xc6\xb0\xe1\xbb\xa9\xe1\xbb\xb2\xe1\xbb\xb6\xe1\xbb\xb4\xc4\x90
echo hex2bin(str_replace('\x', "", $str));
ụưứỲỶỴĐ
For more info about escape sequence \x in double quoted strings see php manual.
PHP treats strings as arrays of characters, regardless of encoding. If you don't need to delimit the UTF8 characters, then something like this works:
$str='ụưứỲỶỴĐ';
foreach(str_split($str) as $char)
echo '\x'.str_pad(dechex(ord($char)),'0',2,STR_PAD_LEFT);
Output:
\xe1\xbb\xa5\xc6\xb0\xe1\xbb\xa9\xe1\xbb\xb2\xe1\xbb\xb6\xe1\xbb\xb4\xc4\x90
If you need to delimit the UTF8 characters (i.e. with a newline), then you'll need something like this:
$str='ụưứỲỶỴĐ';
foreach(array_slice(preg_split('~~u',$str),1,-1) as $UTF8char){ // split before/after every UTF8 character and remove first/last empty string
foreach(str_split($UTF8char) as $char)
echo '\x'.str_pad(dechex(ord($char)),'0',2,STR_PAD_LEFT);
echo "\n"; // delimiter
}
Output:
\xe1\xbb\xa5
\xc6\xb0
\xe1\xbb\xa9
\xe1\xbb\xb2
\xe1\xbb\xb6
\xe1\xbb\xb4
\xc4\x90
This splits the string into UTF8 characters using preg_split and the u flag. Since preg_split returns the empty string before the first character and the empty string after the last character, we need to array_slice the first and last characters. This can be easily modified to return an array, for example.
Edit:
A more "correct" way to do this is this:
echo trim(json_encode(utf8_encode('ụưứỲỶỴĐ')),'"');
The main thing you need to do is to tell PHP to interpret the incoming Unicode characters correctly. Once you do that, you can then convert them to UTF-8 and then to hex as needed.
This code frag takes your example character in Unicode, converts them to UTF-8, and then dumps the hex representation of those characters.
<?php
// Hex equivalent of "ụưứỲỶỴĐ" in Unicode
$unistr = "\x1E\xE5\x01\xB0\x1E\xE9\x1E\xF2\x1E\xF6\x1E\xF4\x01\x10";
echo " length=" . mb_strlen($unistr, 'UCS-2BE') . "\n";
// Here's the key statement, convert from Unicode 16-bit to UTF-8
$utf8str = mb_convert_encoding($unistr, "UTF-8", 'UCS-2BE');
echo $utf8str . "\n";
for($i=0; $i < mb_strlen($utf8str, 'UTF-8'); $i++) {
$c = mb_substr($utf8str, $i, 1, 'UTF-8');
$hex = bin2hex($c);
echo $c . "\t" . $hex . "\t" . preg_replace("/([0-9a-f]{2})/", '\\\\x\\1', $hex) . "\n";
}
?>
Produces
length=7
ụưứỲỶỴĐ
ụ e1bba5 \xe1\xbb\xa5
ư c6b0 \xc6\xb0
ứ e1bba9 \xe1\xbb\xa9
Ỳ e1bbb2 \xe1\xbb\xb2
Ỷ e1bbb6 \xe1\xbb\xb6
Ỵ e1bbb4 \xe1\xbb\xb4
Đ c490 \xc4\x90

Trouble with decode JSON + PHP

My php script gives out this string (for example) for JSON:
{"time":"0:38:01","kto":"\u00d3\u00e1\u00e8\u00e2\u00f6\u00e0 \u00c3\u00e5\u00ed\u00e5\u00f0\u00e0\u00eb\u00ee\u00e2","mess":"\u00c5\u00e4\u00e8\u00ed\u00fb\u00e9: *mm"}
jQuery code gets this string through JSON:
$.getJSON('chat_ajax.php?q=1',
function(result) {
alert('Time ' + result.time + ' Kto' + result.kto + ' Mess' + result.mess);
});
Browser show:
0:38:01 Óáèâöà Ãåíåðàëîâ
Åäèíûé: *mm
How can I decode this string to cyrillic?
Try use:
<META http-equiv="content-type" content="text/html; charset=windows-1251">
but nothing change
PHP Code:
$res1=mysqli_query($dbc, "SELECT * FROM chat ORDER BY id DESC LIMIT 1");
while ($row1=mysqli_fetch_array($res1)) {
$rawArray=array('time' => #date("G:i:s", ($row1['time'] + $plus)), 'kto' => $row1[kto], 'mess' => $row1[mess]);
$encodedArray = array_map(utf8_encode, $rawArray);
echo json_encode($encodedArray);
PHP ver 5.3.19
\uXXXX stands for unicode characters and in unicode 00d3 is Ó and so on. Unicode characters are unambigouos, so the character encoding of the page is ignored for them. You could use the correct unicode (i.e. \u0443 for У) or write your script so that it outputs the real characters in Windows-1251 instead of unicode sequences.
Update
I see from your comment that you fetch this data from MySQL and use json_encode() to output it. json_encode only works for UTF-8 encoded data (and d3 is Ó in UTF-8 as well, this is why you get the wrong unicode sequences).
So, you will have to convert all data from Windows-1251 to UTF-8 before passing it to json_encode, then everything else will work fine.
Converting:
$utf8Array = array_map(function($in) {
return iconv('Windows-1251', 'UTF-8', $in);
}, $rawArray);
utf8_encode will not work because it is only useful for input in ISO-8859-1 encoding.
I had similar problem when storing json datas in MySQL BDD : this solved the problem :
json_encode($json_data, JSON_UNESCAPED_UNICODE) ;

php non latin to hex function

I have website that's in win-1251 encoding and it needs to stay that way. But I also need to be able to echo few links that contain non latin, non cyrillic characters like šžāņūī...
I need a function that convert this
"māja un man tā patīk"
to
"māja un man tā patīk"
and that does not touch html, so if there is <b> it needs to stay as <b>, not > or <
And please no advices about the encoding and how wrong that is.
$str = "<b>Obāchan</b> おばあちゃん";
$str = preg_replace_callback('/./u', function ($matches) {
$chr = $matches[0];
if (strlen($chr) > 1) {
$chr = mb_convert_encoding($chr, 'HTML-ENTITIES', 'UTF-8');
}
return $chr;
}, $str);
This expects the original $str to be UTF-8 encoded, i.e. your PHP file should be saved in UTF-8. It encodes all non-ASCII compatible code points to HTML entities. Since all HTML special characters are ASCII characters, they remain untouched. The resulting string is pure ASCII. Since the lower Win-1251 code points are ASCII compatible, the resulting string is also a valid Win-1251 string. The above $str converts to:
<b>Obāchan</b> おばあちゃん
The main things you probably don't want to encode are <, > and &. Those are really the only special characters. So how about encoding everything first, and then just decode <, > and & I feel you should be fine.
This is untested:
$output =
htmlspecialchars_decode(
htmlentities($input, ENT_NOQUOTES, 'CP-1251')
);
let me know
What Evert suggest looks logical to me too! If you insist this is a way to do it if there are only two letters that bother you. For more letters the scrit will not be as effective and needs to change.
<?PHP
function myConvert($str)
{
$chars['ā']='ā';
$chars['ī']='ī';
foreach ($chars as $key => $value)
$output = str_replace($key, $value, $str);
echo $str;
}
myConvert("māja un man tā patīk");
?>
==================edited==============
For many characters maybe this one can help you:
<?PHP
function myConvert($str)
{
$final=null;
$parts = preg_split("/&#[0-9]*;/i", $str);//get all text parts
preg_match_all("/&#[0-9]*;/i", $str, $delimiters );//get delimiters;
$delimiters[0][]='';//make arrays equal size
foreach($parts as $key => $value)
$final.=$value.mb_convert_encoding
($delimiters[0][$key], "UTF-8", "HTML-ENTITIES");
return $final;
}
$fh = fopen("testFile.txt", 'w') ;
fwrite($fh, myConvert("māja un man tā patīkī"));
fclose($fh);
?>
The desired output is written in the text file. This code, exactly as it is -not merged in some project- does what it claims to do. Converts codes like ā to the analogous character they present.

How to convert some multibyte characters into its numeric html entity using PHP?

Test string:
$s = "convert this: ";
$s .= "–, —, †, ‡, •, ≤, ≥, μ, ₪, ©, ® y ™, ⅓, ⅔, ⅛, ⅜, ⅝, ⅞, ™, Ω, ℮, ∑, ⌂, ♀, ♂ ";
$s .= "but, not convert ordinary characters to entities";
$encoded = mb_convert_encoding($s, 'HTML-ENTITIES', 'UTF-8');
asssuming your input string is UTF-8, this should encode most everything into numeric entities.
Well htmlentities doesn't work correctly. Fortunately someone has posted code on the php website that seems to do the translation of multibyte characters properly
I did work on decoding ascii into html coded text (&#xxxx). https://github.com/hellonearthis/ascii2web

Categories