echo $title gives me something like \u00ca\u00e0\u00f7\u00e5\u00eb\u00e8.
It should be a readable text instead. How do I decode it correctly?
I've tried html_entity_decode($title, 0, 'UTF-8'), but it doesn't work for non-english languages. I get something like Êà÷åëè instead of a real text.
Try echo htmlentities($str, ENT_QUOTES | ENT_IGNORE, "UTF-8");
try this
$title = mb_convert_encoding($title,'HTML-ENTITIES','utf-8');
hope this will work for you.
Edit:
Try this if it works
$title = iconv(mb_detect_encoding($text, mb_detect_order(), true), "UTF-8", $title);
Related
TLDR: Trying to convert the string \u00e6\u0097\u00a5\u00e6\u009c\u00ac to 日本 in php.
(Trying to get \u00e6\u0097\u00a5\u00e6\u009c\u00ac to echo out 日本)
Hi folks,
I have a json file from Instagram (downloaded my data) and many of my posts contain Japanese text which is stored encoded in UTF-8 (and please correct me if I'm wrong, especially as mb_detect_encoding("\u00e6\u0097\u00a5\u00e6\u009c\u00ac") returns "ASCII").
For example \u00e6\u0097\u00a5\u00e6\u009c\u00ac becomes 日本.
The conversions can be seen working fine on this encoder/decoder website:
https://mothereff.in/utf-8
(Note that if you put 日本 into the above site it returns \xE6\x97\xA5\xE6\x9C\xAC, so adding \xE6\x97\xA5\xE6\x9C\xAC \u00e6\u0097\u00a5\u00e6\u009c\u00ac to the encoded field will produce 日本 日本 in the decoded field)
I'm trying to convert it back to regular Japanese text but am having issues.
I've been googling and looking over Stackoverflow for just over a day and have been trying many different methods, but I just can't get it to convert. I'm clearly missing something. In most cases it does not change at all.
For the scope of this question, I'm simply trying to convert \u00e6\u0097\u00a5\u00e6\u009c\u00ac into 日本.
I am not trying to convert the json file (though am open to any suggestions that would need me to).
(For the record I am using the variable $str for \u00e6\u0097\u00a5\u00e6\u009c\u00ac)
The following attempts resulted in no visible change, \u00e6\u0097\u00a5\u00e6\u009c\u00ac
echo call_user_func_array('mb_convert_encoding', array(&$str,'HTML-ENTITIES','UTF-8'));
echo iconv('ASCII', 'UTF-8', $str);
echo iconv("UTF-8", "CP1252", $str);
echo iconv('UTF-8', 'ISO-8859-1', $str);
echo iconv('UTF-8', 'UTF-8//IGNORE', utf8_encode($str));
echo iconv('ISO-8859-1', 'UTF-8', $str);
echo iconv('ISO-8859-9', 'UTF-8', $str);
echo iconv(mb_detect_encoding($str, mb_detect_order(), true), "UTF-8", $str);
echo htmlentities($str);
echo mb_convert_encoding($str, 'utf-8', 'iso-8859-1');
echo mb_convert_encoding($str, "EUC-JP", "auto");
echo mb_convert_encoding($str, "utf-8", "windows-1251");
echo mb_convert_encoding($str, "windows-1251", "utf-8");
echo mb_convert_encoding($str,'HTML-ENTITIES', 'UTF-8');
echo mb_convert_encoding($str,"UTF-8","auto");
echo mb_convert_encoding($str,"UTF-8");
echo mb_convert_encoding($str, 'UTF-8', array('EUC-JP', 'SHIFT-JIS', 'AUTO'));
echo mb_convert_encoding($str, 'UTF-8', 'ISO-8859-1');
echo mb_convert_encoding($str, "UTF-8", mb_detect_encoding($str, "UTF-8, ISO-8859-1, ISO-8859-15", true));
echo mb_convert_encoding($str, "ISO-8859-1", mb_detect_encoding($str, "UTF-8, ISO-8859-1, ISO-8859-15", true));
echo mb_convert_encoding($str, 'ISO-8859-1', 'UTF-8');
echo utf8_decode($str);
echo utf8_encode($str);
The following attempt resulted in the slash being duplicated with double quotation marks added, "\\u00e6\\u0097\\u00a5\\u00e6\\u009c\\u00ac"
echo json_encode($str,JSON_HEX_TAG);
echo json_encode($str,JSON_UNESCAPED_UNICODE |JSON_PRETTY_PRINT);
echo json_encode($str,JSON_UNESCAPED_UNICODE|JSON_UNESCAPED_SLASHES);
The following attempt resulted in nothing being returned,
echo json_decode($str, JSON_HEX_TAG);
echo json_decode($str, false);
echo json_decode($str, false, 512, JSON_UNESCAPED_UNICODE);
The following attempted resulted in the slashes changing to an unknown character, �_u00e6�_u0097�_u00a5�_u00e6�_u009c�_u00ac
echo mb_convert_encoding($str, "SJIS");
From the PHP documentation I tried this to see if any of the combinations would work, but none did.
https://www.php.net/manual/en/function.mb-convert-encoding.php#97902
foreach(mb_list_encodings() as $chr){
echo mb_convert_encoding($str, 'UTF-8', $chr)." : ".$chr."<br>";
}
echo "<br>--- REVERSE TRY ---<br><br>";
foreach(mb_list_encodings() as $chr){
echo mb_convert_encoding($str, $chr, 'UTF-8')." : ".$chr."<br>";
}
I tried using the Unicode Codepoint Escape Syntax, which gave 日本
https://www.php.net/manual/en/migration70.new-features.php#migration70.new-features.unicode-codepoint-escape-syntax
echo "\u{00e6}\u{0097}\u{00a5}\u{00e6}\u{009c}\u{00ac}";
As mentioned in the brackets earlier, \xE6\x97\xA5\xE6\x9C\xAC does convert to 日本 when echoed.
echo "\xE6\x97\xA5\xE6\x9C\xAC";
Noticing above that the two different codes had the same endings, I tried using str_replace so that they would match, but this time \xE6\x97\xA5\xE6\x9C\xAC was echoed.
echo str_replace("\U00","\x",strtoupper($str));
I have also tried all of the above with and without the following:
header('Content-Type: text/plain; charset="UTF-8"');
Here is a segment of the original JSON file (original file is 13k lines, so here is a single element).
{
"media": [
{
"uri": "media/posts/202104/175127092_241529264421003_4026764649651789139_n_18106766305234668.jpg",
"creation_timestamp": 1619277565,
"title": "Time to head back to Tokyo.\nFukuoka Airport, Japan.\n18 October 2020\n.\n.\n.\n.\n.\n#japan #\u00e6\u0097\u00a5\u00e6\u009c\u00ac #toyphotography #toy #\u00e3\u0081\u008a\u00e3\u0082\u0082\u00e3\u0081\u00a1\u00e3\u0082\u0083 #\u00e3\u0083\u00ad\u00e3\u0083\u009b\u00e3\u0082\u0099\u00e3\u0083\u0083\u00e3\u0083\u0088 #GodJesusRobot #robot #toyholiday #holiday #vacation #\u00e6\u0097\u0085\u00e8\u00a1\u008c #photography #\u00e5\u0086\u0099\u00e7\u009c\u009f #japan_of_insta #japantravel #\u00e6\u0097\u00a5\u00e6\u009c\u00ac\u00e6\u0097\u0085\u00e8\u00a1\u008c #travel #kitakyushu #\u00e5\u008c\u0097\u00e4\u00b9\u009d\u00e5\u00b7\u009e #airport #\u00e7\u00a9\u00ba\u00e6\u00b8\u00af #fukuokaairport #\u00e7\u00a6\u008f\u00e5\u00b2\u00a1\u00e7\u00a9\u00ba\u00e6\u00b8\u00af #plane #airplane #aeroplane #\u00e9\u00a3\u009b\u00e8\u00a1\u008c\u00e6\u00a9\u009f #windowseat #window"
}
]
}
UPDATE
Based on the comments by #jerry and #yourcommonsense, hexbin can work so the string will have to be converted by dropping the \u00. hex2bin(str_replace('\u00', '', $str)); will definitely work for the string mentioned in the TLDR and upper part of the question, but to tackle the full title string in the json I've come up with a very ugly and messy method.
$str = "Time to head back to Tokyo.\nFukuoka Airport, Japan.\n18 October 2020\n.\n.\n.\n.\n.\n#japan #\u00e6\u0097\u00a5\u00e6\u009c\u00ac #toyphotography #toy #\u00e3\u0081\u008a\u00e3\u0082\u0082\u00e3\u0081\u00a1\u00e3\u0082\u0083 #\u00e3\u0083\u00ad\u00e3\u0083\u009b\u00e3\u0082\u0099\u00e3\u0083\u0083\u00e3\u0083\u0088 #GodJesusRobot #robot #toyholiday #holiday #vacation #\u00e6\u0097\u0085\u00e8\u00a1\u008c #photography #\u00e5\u0086\u0099\u00e7\u009c\u009f #japan_of_insta #japantravel #\u00e6\u0097\u00a5\u00e6\u009c\u00ac\u00e6\u0097\u0085\u00e8\u00a1\u008c #travel #kitakyushu #\u00e5\u008c\u0097\u00e4\u00b9\u009d\u00e5\u00b7\u009e #airport #\u00e7\u00a9\u00ba\u00e6\u00b8\u00af #fukuokaairport #\u00e7\u00a6\u008f\u00e5\u00b2\u00a1\u00e7\u00a9\u00ba\u00e6\u00b8\u00af #plane #airplane #aeroplane #\u00e9\u00a3\u009b\u00e8\u00a1\u008c\u00e6\u00a9\u009f #windowseat #window";
$pattern = '/(\\\\u00..)+/i';
function getHex2Bin($matches) {
return hex2bin(str_replace("\U00","",strtoupper($matches[0])));
}
$result = preg_replace_callback($pattern, 'getHex2Bin', $str);
echo $result;
This does work, giving me my desired result:
Time to head back to Tokyo. Fukuoka Airport, Japan. 18 October 2020 . . . . . #japan #日本 #toyphotography #toy #おもちゃ #ロボット #GodJesusRobot #robot #toyholiday #holiday #vacation #旅行 #photography #写真 #japan_of_insta #japantravel #日本旅行 #travel #kitakyushu #北九州 #airport #空港 #fukuokaairport #福岡空港 #plane #airplane #aeroplane #飛行機 #windowseat #window but I can't help feel that there is a much neater solution.
Update 2
Here is a PHP Sandbox showing the results of all attempts mentioned above, including the messy working one.
You face a mojibake case (example in Python for its universal intelligibility):
print('\u00e6\u0097\u00a5\u00e6\u009c\u00ac'.encode('latin1').decode())
日本
Let's rewrite above code in PHP terms (utilizing the json_decode function):
<?php
$str_chin = "日本";
echo $str_chin . " => test: chinese to JSON => "
. json_encode($str_chin) . PHP_EOL;
$str = '\u00e6\u0097\u00a5\u00e6\u009c\u00ac';
$str_moj = json_decode('"' . $str . '"', JSON_INVALID_UTF8_IGNORE );
echo $str . " => mojibake => "
. $str_moj . PHP_EOL;
echo $str . " => solution => "
. mb_convert_encoding($str_moj, 'iso-8859-1', 'utf-8');
?>
Output:
73099438.php
日本 => test: chinese to JSON => "\u65e5\u672c"
\u00e6\u0097\u00a5\u00e6\u009c\u00ac => mojibake => æ¥æ¬
\u00e6\u0097\u00a5\u00e6\u009c\u00ac => solution => 日本
I am trying to replace all occurences of \/ in a string output in php with /, but it is not working..
Here is my code:
$output = str_replace("\\/", "/", $output);
echo json_encode($output, JSON_UNESCAPED_UNICODE );
echo json_encode($output, JSON_UNESCAPED_SLASHES);
but I am still getting such strings in the output on the webpage, like:
https:\/\/img.xxxx.com\/images\/channel-resources\/1\/def\/43\/0\/1\/defintion.png
or something like that:
https:\/\/img.yyyy.de\/images\/channel-resources\/1\/obchi\/43\/0\/1\/obchi_1.png
If I switch the order of the two functions like that:
$output = str_replace("\\/", "/", $output);
echo json_encode($output, JSON_UNESCAPED_SLASHES);
echo json_encode($output, JSON_UNESCAPED_UNICODE );
I get the slashes written right, but the germans letters are appearing in a weird form, like: "\u00df" or "u00f6\u00df"... for example the world "große" would be written like "gro\u00dfe"
Anyone an idea to fix that? to get the german letters and the URIs written right? not like "https://img.xxxx.com/images/channel-resources/1/def/43/0/1/defintion.png"?
You're using the wrong constant.
Use JSON_UNESCAPED_SLASHES instead of JSON_UNESCAPED_UNICODE to prevent escaping the slashes in json_encode().
You can specify both using JSON_UNESCAPED_SLASHES | JSON_UNESCAPED_UNICODE.
See http://php.net/manual/en/json.constants.php
$output = str_replace("\\/", "/", $output);
echo json_encode($output, JSON_UNESCAPED_SLASHES | JSON_UNESCAPED_UNICODE);
Try to echo $output and check it, I am almost sure it is the json_encode() you are using that adds the \ for you
\u00* are Unicode letters.
Try this to parse into html_entities
$output = 'http:\/\/ßßüüää.com\/';
$output = str_replace("\\/", "/", $output);
$output = htmlentities($output, ENT_COMPAT, "UTF-8");
echo json_encode($output, JSON_UNESCAPED_SLASHES);
I have to save russian language product description in db. So for that I converted that string to utf 8 using below code,
$data = 'Это русский';
$cData = iconv(mb_detect_encoding($data, mb_detect_order(), true), "UTF-8", $data);
It is working fine. But I need to get that data back, and I don't know how to decode it again. I tried below one, but it is not working,
$des = $object->getDescription("ru");
$enc = mb_detect_encoding($des, "UTF-8,ISO-8859-1");
echo iconv($enc, "UTF-8", $des);
and I tried below one, but not working
utf8_decode ( $data );
Can any one tell me how to decode this ?
Update:
I tried below one to encode,
$data = 'Это русский';
$cData = htmlentities($data, ENT_COMPAT, 'UTF-8');
It's working fine, But how to decode this ?
I tried below one, but it is not working..
$des = $object->getDescription("ru");
echo $cData = htmlentities($des, ENT_COMPAT, 'UTF-8');
The encoding appears to be Windows-1251.
Encode to UFT-8 using:
$html_utf8 = mb_convert_encoding($html, "utf-8", "windows-1251");
Decode back to Windows-1251 using:
$html_1251 = mb_convert_encoding($html, "windows-1251", "utf-8");
PHP has a UTF-8 decode function, utf8_decode()
utf8_decode ( $data );
From the manual :
This function decodes data, assumed to be UTF-8 encoded, to
ISO-8859-1.
What is your original encoding?
In your decoding example you're not converting back to your original encoding, but again to UTF-8. Try this:
$original_encoding = '...'; //put your original encoding here
$description = $object->getDescription("ru");
echo iconv('UTF-8', $original_encoding, $description);
how do i convert below text to something like "Växjö" using PHP?
Växjö
I have tried
html_entity_decode(preg_replace("/U\+([0-9A-F]{4})/", "&#x\\1;", $text), ENT_NOQUOTES, 'UTF-8')
iconv(mb_detect_encoding($text, mb_detect_order(), true), "UTF-8", $text)
Any PHP version from 5.0 onwards should be fine with...
$decoded = html_entity_decode('Växjö', ENT_COMPAT, 'UTF-8');
Demo here - http://3v4l.org/DZc59
echo html_entity_decode('Växjö', ENT_QUOTES, 'UTF-8');
I'm using Zend framework with mongoDB. I need to convert France character to special character.
For example: Prénom -> Prénom . what could I do?
htmlentities ( http://php.net/htmlentities ) can do this if you call:
htmlentities('Prénom', ENT_COMPAT, 'UTF-8');
I get:
Prénom
as the result
Maybe you can take a look at strtr function (Read more at http://php.net/strtr)?
I think that the right way to look is either mb_convert_encoding or htmlentities
Here is an example which you can view here:
$text = "Prénom";
echo mb_convert_encoding($text, 'HTML-ENTITIES', 'UTF-8');
echo "\n";
echo htmlentities($text, ENT_COMPAT | ENT_HTML401, 'UTF-8');