Php Convert escaped characters to utf-8 - php

I have emotions in javascript escaped characters in a string and need to pass it to Android as json. In the android side the text has to be utf8 to display the emotion character properly. For example I have \ud83d\ude00 in a string which is a code for GRINNING FACE. But I need it to be converted to f0 9f 98 80 using Php.
I tried mb_convert_encoding and iconv but they outputs some strange characters. Please help. Thanks

Seems like a duplicate of:
How to decode Unicode escape sequences like "\u00ed" to proper UTF-8 encoded characters?
$str = preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/', function ($match) {
return mb_convert_encoding(pack('H*', $match[1]), 'UTF-8', 'UCS-2BE');
}, $str);

Related

Trouble decoding some special characters ’ “ ”

I'm trying to decode some special characters in php and can't seem to find a way to do it.
$str = 'Thi’s i"s a’n e”xa“mple';
This just returns some dots.
$str = preg_replace_callback("/(&#[0-9]+;)/", function($m) {
return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES");
}, $str);
Some other tests just return the same string.
$str = html_entity_decode($str, ENT_QUOTES, 'UTF-8');
$str = htmlspecialchars_decode($str, ENT_QUOTES);
Anyway, I've been trying all sorts of combinations but really no idea how to convert this to UTF-8 characters.
What I'm expecting to see is this:
Thi’s i"s a’n e”xa“mple
And actually if I take this directly and use htmlentities to encode it I see different characters to begin with.
Thi’s i"s a’n e”xa“mple
Unfortunately I don't have control of the source and I'm stuck dealing with those characters.
Are they non standard, do I need to replace them manually with my own lookup table?
EDIT
Looking at this table here: https://brajeshwar.github.io/entities/
I see the characters I'm looking after are not listed. When I test a few characters from this table they decode just fine. I guess the list in php is incomplete by default?
If you check the unicode standard for the characters you're referring to: http://www.unicode.org/charts/PDF/U0080.pdf
You would see that all the codepoints you have in your string do not have representable glyphs and are control characters.
Which means that it is expected that they are rendered as empty squares (or dots, depending on how your renderer treats those).
If it works for someone somewhere - it's a non-standard behaviour, which one must not rely on, since it is, well, non-standard.
Apparently the text you have has the initial encoding of cp1250, so you either should treat it accordingly, or re-encode entities manually:
$str = 'Thi’s i"s a’n e”xa“mple';
$str = preg_replace_callback("/&#([0-9]+);/u", function($m) {
return iconv('cp1250', 'utf-8', chr($m[1]));
}, $str);
echo $str;

Problems with special chars encoding with an access mdb database using php

My boss is forcing me to use an access mdb database (yes, I'm serious) in a php server.
I can connect it and retrieve data from it, but as you could imagine, I have problems with encodings because I want to work using utf8.
The thing is that now I have two "solutions" to translate Windows-1252 to UTF-8
This is the first way:
mb_convert_encoding($string, "UTF-8", "Windows-1252").
It works, but the problem is that special chars are not properly converted, for example char º is converted to \u00ba and char Ó is converted to \u00d3.
My second way is doing this:
mb_convert_encoding(mb_convert_encoding($string, "UTF-8", "Windows-1252"), "HTML-ENTITIES", "UTF-8")
It works too, but it happens the same, special chars are not correctly converted. Char º is converted to º
Does anybody know how to properly change encoding including special chars?
Or does anybody know how to convert from º and \u00ba to something readable?
I did simple test to convert codepoint to letters
<?php
function codepoint_decode($str) {
return json_decode(sprintf('"%s"', $str));
}
$string_with_codepoint = "Ahed \u00d3\u00ba\u00d3";
// $string_with_codepoint = mb_convert_encoding($string, "UTF-8", "Windows-1252");
$output = codepoint_decode($string_with_codepoint);
echo $output; // Ahed ÓºÓ
Credit go for this answer
I finally found the solution.
I had the solution from the beginning but I was doing my tests wrong.
My bad.
The right way to do it for me is mb_convert_encoding($string, "UTF-8", "Windows-1252")
But i was checking the result like this:
$stringUTF8 = mb_convert_encoding($string, "UTF-8", "Windows-1252");
echo json_encode($stringUTF8);
that's why it was returning unicode chars like \u20ac, if I would have done:
$stringUTF8 = mb_convert_encoding($string, "UTF-8", "Windows-1252");
echo $stringUTF8;
I should have seen the solution from the beginning but I was wrong. It was json_encode() what was turning special chars into unicode chars.
Thanks everybody for your help!!

php converting unknown symbols to the known symbols in url

Converting unknown symbols in url ,
like this
https://r4---sn-hgn7zn7r.c.docs.google.com/videoplayback?requiressl\u003dyes\u0026id\u003d376b916e4a3c65b1\u0026itag\u003d22\u0026source\u003dwebdrive\u0026app\u003dtexmex\u0026ip\u003d109.110.116.1\u0026ipbits\u003d8\u0026expire\u003d1456065477\u0026sparams\u003drequiressl%2Cid%2Citag%2Csource%2Cip%2Cipbits%2Cexpire\u0026signature\u003d5C06093099C3B4A7DE28AF323E2E15AC7DE5BEEE.758E1110B23CD41EA7E246DE2564ABE5368431FE\u0026key\u003dck2\u0026mm\u003d30\u0026mn\u003dsn-hgn7zn7r\u0026ms\u003dnxu\u0026mt\u003d1456050981\u0026mv\u003dm\u0026nh\u003dIgpwcjAyLm1yczAyKgkxMjcuMC4wLjE\u0026pl\u003d22
to real link,
like this
https://r4---sn-hgn7zn7r.c.docs.google.com/videoplayback?requiressl=yes&id=376b916e4a3c65b1&itag=22&source=webdrive&app=texmex&ip=109.110.116.1&ipbits=8&expire=1456065477&sparams=requiressl,id,itag,source,ip,ipbits,expire&signature=5C06093099C3B4A7DE28AF323E2E15AC7DE5BEEE.758E1110B23CD41EA7E246DE2564ABE5368431FE&key=ck2&mm=30&mn=sn-hgn7zn7r&ms=nxu&mt=1456050981&mv=m&nh=IgpwcjAyLm1yczAyKgkxMjcuMC4wLjE&pl=22
i have no idea how convert it ,
i use this website to convert the link
DDecode - Hex,Octal,HTML Decode
In your case, you have to convert unicode escape sequences like "\uxxxx" into utf8 characters.
Use preg_repalce_callback function to replace all matched escape sequences with the respective utf8 character.
In the callback function we are using pack function which will pack the initial HEX string to binary string, then it will convert that binary order('UCS-2BE') into UTF-8 equivalent with mb-convert-encoding.
$str = "https://r4---sn-hgn7zn7r.c.docs.google.com/videoplayback?requiressl\u003dyes\u0026id\u003d376b916e4a3c65b1\u0026itag\u003d22\u0026source\u003dwebdrive\u0026app\u003dtexmex\u0026ip\u003d109.110.116.1\u0026ipbits\u003d8\u0026expire\u003d1456065477\u0026sparams\u003drequiressl%2Cid%2Citag%2Csource%2Cip%2Cipbits%2Cexpire\u0026signature\u003d5C06093099C3B4A7DE28AF323E2E15AC7DE5BEEE.758E1110B23CD41EA7E246DE2564ABE5368431FE\u0026key\u003dck2\u0026mm\u003d30\u0026mn\u003dsn-hgn7zn7r\u0026ms\u003dnxu\u0026mt\u003d1456050981\u0026mv\u003dm\u0026nh\u003dIgpwcjAyLm1yczAyKgkxMjcuMC4wLjE\u0026pl\u003d22";
$str = preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/', function ($match) {
return mb_convert_encoding(pack('H*', $match[1]), 'UTF-8', 'UCS-2BE');
}, rawurldecode($str));
echo $str;
// the output:
https://r4---sn-hgn7zn7r.c.docs.google.com/videoplayback?requiressl=yes&id=376b916e4a3c65b1&itag=22&source=webdrive&app=texmex&ip=109.110.116.1&ipbits=8&expire=1456065477&sparams=requiressl,id,itag,source,ip,ipbits,expire&signature=5C06093099C3B4A7DE28AF323E2E15AC7DE5BEEE.758E1110B23CD41EA7E246DE2564ABE5368431FE&key=ck2&mm=30&mn=sn-hgn7zn7r&ms=nxu&mt=1456050981&mv=m&nh=IgpwcjAyLm1yczAyKgkxMjcuMC4wLjE&pl=22
http://php.net/manual/en/function.preg-replace-callback.php
It appears to be "Unicode Escape Sequences for Latin 1 Characters" (see http://archive.oreilly.com/pub/a/actionscript/excerpts/as3-cookbook/appendix.html).
A quick search didn't find any native library for decoding this in PHP, but it should be straightforward to decode the characters you're most likely to encounter that need decoding (& and = specifically).
Here's a SO solution to doing it from 5 years ago: How to decode Unicode escape sequences like "\u00ed" to proper UTF-8 encoded characters?

PHP Utf8 Decoding Issue

I have the following address line: Praha 5, Staré Město,
I need to use utf8_decode() function on this string before I can write it to a PDF file (using domPDF lib).
However, the php utf8 decode function for the above address line appears incorrect (or rather, incomplete).
The following code:
<?php echo utf8_decode('Praha 5, Staré Město,'); ?>
Produces this:
Praha 5, Staré M?sto,
Any idea why ě is not getting decoded?
utf8_decode converts the string from a UTF-8 encoding to ISO-8859-1, a.k.a. "Latin-1".
The Latin-1 encoding cannot represent the letter "ě". It's that simple.
"Decode" is a total misnomer, it does the same as iconv('UTF-8', 'ISO-8859-1', $string).
See What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.
I wound up using a home-grown UTF-8 / UTF-16 decoding function (convert to &#number; representations), I haven't found any pattern to why UTF-8 isn't detected, I suspect it's because the "encoded-as" sequence isn't always exactly in the same position in the string returned. You might do some additional checking on that.
Three-character UTF-8 indicator: $startutf8 = chr(0xEF).chr(187).chr(191); (if you see this ANYWHERE, not just first three characters, the string is UTF-8 encoded)
Decode according to UTF-8 rules; this replaced an earlier version which chugged through byte by byte:using
function charset_decode_utf_8 ($string) {
/* Only do the slow convert if there are 8-bit characters */
/* avoid using 0xA0 (\240) in ereg ranges. RH73 does not like that */
if (! ereg("[\200-\237]", $string) and ! ereg("[\241-\377]", $string))
return $string;
// decode three byte unicode characters
$string = preg_replace("/([\340-\357])([\200-\277])([\200-\277])/e",
"'&#'.((ord('\\1')-224)*4096 + (ord('\\2')-128)*64 + (ord('\\3')-128)).';'",
$string);
// decode two byte unicode characters
$string = preg_replace("/([\300-\337])([\200-\277])/e",
"'&#'.((ord('\\1')-192)*64+(ord('\\2')-128)).';'",
$string);
return $string;
}
Problem is in your PHP file encoding , save your file in UTF-8 encoding , then even no need to use utf8_decode , if you get these data 'Praha 5, Staré Město,' from database , better change it charset to UTF-8
you don't need that (#Rajeev :this string is automatically detected as utf-8 encoded :
echo mb_detect_encoding('Praha 5, Staré Město,');
will always return UTF-8.).
You'd rather see :
https://code.google.com/p/dompdf/wiki/CPDFUnicode

Convert UTF8 characters returned from Facebook Graph API

The character is UTF8 encoded, like..
"\u676f\u845b"
How to convert it back to normal UTF8 string in PHP?
The simple approach would be to wrap your string into double quotes and let json_decode convert the \u0000 escapes. (Which happen to be Javascript string syntax.)
$str = json_decode("\"$str\"");
Seems to be asian letters: 杯葛 (It's already UTF-8 when json_decode returns it.)
(Source)
http://webarto.com/83/php-unicode_decode-5.3
demo: http://ideone.com/AtY0v
function decode_encoded_utf8($string){
return preg_replace_callback('#\\\\u([0-9a-f]{4})#ism', function($matches) { return mb_convert_encoding(pack("H*", $matches[1]), "UTF-8", "UCS-2BE"); }, $string);
}
echo unicode_decode('\u676f\u845b'); # 杯葛

Categories