Convert Unicode from JSON string with PHP - php

I've been reading up on a few solutions but have not managed to get anything to work as yet.
I have a JSON string that I read in from an API call and it contains Unicode characters - \u00c2\u00a3 for example is the £ symbol.
I'd like to use PHP to convert these into either £ or £.
I'm looking into the problem and found the following code (using my pound symbol to test) but it didn't seem to work:
$title = preg_replace("/\\\\u([a-f0-9]{4})/e", "iconv('UCS-4LE','UTF-8',pack('V', hexdec('U$1')))", '\u00c2\u00a3');
The output is £.
Am I correct in thinking that this is UTF-16 encoded? How would I convert these to output as HTML?
UPDATE
It seems that the JSON string from the API has 2 or 3 unescaped Unicode strings, e.g.:
That\u00e2\u0080\u0099s (right single quotation)
\u00c2\u00a (pound symbol)

It is not UTF-16 encoding. It rather seems like bogus encoding, because the \uXXXX encoding is independant of whatever UTF or UCS encodings for Unicode. \u00c2\u00a3 really maps to the £ string.
What you should have is \u00a3 which is the unicode code point for £.
{0xC2, 0xA3} is the UTF-8 encoded 2-byte character for this code point.
If, as I think, the software that encoded the original UTF-8 string to JSON was oblivious to the fact it was UTF-8 and blindly encoded each byte to an escaped unicode code point, then you need to convert each pair of unicode code points to an UTF-8 encoded character, and then decode it to the native PHP encoding to make it printable.
function fixBadUnicode($str) {
return utf8_decode(preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2"))', $str));
}
Example here: http://phpfiddle.org/main/code/6sq-rkn
Edit:
If you want to fix the string in order to obtain a valid JSON string, you need to use the following function:
function fixBadUnicodeForJson($str) {
$str = preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3")).chr(hexdec("$4"))', $str);
$str = preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3"))', $str);
$str = preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2"))', $str);
$str = preg_replace("/\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1"))', $str);
return $str;
}
Edit 2: fixed the previous function to transform any wrongly unicode escaped utf-8 byte sequence into the equivalent utf-8 character.
Be careful that some of these characters, which probably come from an editor such as Word are not translatable to ISO-8859-1, therefore will appear as '?' after ut8_decode.

The output is correct.
\u00c2 == Â
\u00a3 == £
So nothing is wrong here. And converting to HTML entities is easy:
htmlentities($title);

Here is an updated version of the function using preg_replace_callback instead of preg_replace.
function fixBadUnicodeForJson($str) {
$str = preg_replace_callback(
'/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/',
function($matches) { return chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3")).chr(hexdec("$4")); },
$str
);
$str = preg_replace_callback(
'/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/',
function($matches) { return chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3")); },
$str
);
$str = preg_replace_callback(
'/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/',
function($matches) { return chr(hexdec("$1")).chr(hexdec("$2")); },
$str
);
$str = preg_replace_callback(
'/\\\\u00([0-9a-f]{2})/',
function($matches) { return chr(hexdec("$1")); },
$str
);
return $str;
}

Related

HTML Entities to hex

How to convert html entities to hex?
I used this code
$username = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($m) {
$char = current($m);
$utf = iconv('UTF-8', 'UCS-4', $char);
return sprintf("&#x%s;", ltrim(strtoupper(bin2hex($utf)), "0"));
}, $username);
But it dosent convert chars like < and others.
If you look into the regex used [\x{80}-\x{10FFFF}], you'll see that it would match all chars whose ASCII value(in hex) lies between 0x80 and 0x10FFFF
But if you take a look at the ASCII chart you see
The hex values of < and > are lower than 0x80. Assuming you have gotten the regex from API publishers they probably want you to convert extended ASCII chars such as these so it won't cause any problem whatsoever. But you can just edit the regex and get it to work for other characters as well

how to transform japanese english character to normal english character?

I have an japanese english character.
This character is not normal english string.
Characters: Game
How to transform this character to normal english character in php?
Subtract 65248 from the ordinal value of each character. In other words:
$str = "Game some other text by ヴィックサ";
$str = preg_replace_callback(
"/[\x{ff01}-\x{ff5e}]/u",
function($c) {
// convert UTF-8 sequence to ordinal value
$code = ((ord($c[0][0])&0xf)<<12)|((ord($c[0][1])&0x3f)<<6)|(ord($c[0][2])&0x3f);
return chr($code-0xffe0);
},
$str);
This will replace all of the "Fullwidth" characters with their normal width equivalents.
It would be easier to use mb_convert_kana:
$string = 'Characters: Game';
$newString = mb_convert_kana($string,'a');
I'm sure there is a much easier answer but couldnt you make a dictonary object with the special charter as the key and the char you want as the value
then just do a simple find and replace?

Removing Various symbols like  é

OK I have read many threads and have found some options that work but now I am just more curious than anything...
When trying to remove characters like: Â é as google does not like them in the XML product feed.
Why does this work:
But neither of these 2 do?
$string = preg_replace("/[^[:print:]]+/", ' ', $string);
$string = preg_replace("/[^[:print:]]/", ' ', $string);
To put it all in context here is the full function:
// Remove all unprintable characters
$string = ereg_replace("[^[:print:]]", ' ', $string);
// Convert back into HTML entities after printable characters removed
$string = htmlentities($string, ENT_QUOTES, 'UTF-8');
// Decode back
$string = html_entity_decode($string, ENT_QUOTES, 'UTF-8');
// Return the UTF-8 encoded string
$string = strip_tags(stripslashes($string));
// Return the UTF-8 encoded string
return utf8_encode($string);
}
The reason that code doesn't work is because it removes characters that are not in the posix :print: character group which is comprised of printable characters. á É, etc are all printable.
You can find more about posix sets here.
Also, removing accentuated characters might not always be the best option... Check out this question for alternatives.

PHP code explanation question.

I don't know if this id the place to ask this question so be kind if I am wrong.
I was wondering if someone can explain to me in detail what the following 3 code snippets below do.
Snippet 1
if($str !== mb_convert_encoding(mb_convert_encoding($str, 'UTF-32', 'UTF-8'), 'UTF-8', 'UTF-32')){
$str = mb_convert_encoding($str, 'UTF-8');
}
Snippet 2
$str = preg_replace('`&([a-z]{1,2})(acute|uml|circ|grave|ring|cedil|slash|tilde|caron|lig);`i', '\\1', $str);
Snippet 3
$str = preg_replace(array('`[^a-z0-9]`i','`[-]+`'), '-', $str);
Here is the full code below for reference.
function to_permalink($str){
if($str !== mb_convert_encoding(mb_convert_encoding($str, 'UTF-32', 'UTF-8'), 'UTF-8', 'UTF-32')){
$str = mb_convert_encoding($str, 'UTF-8');
}
$str = htmlentities($str, ENT_NOQUOTES, 'UTF-8');
$str = preg_replace('`&([a-z]{1,2})(acute|uml|circ|grave|ring|cedil|slash|tilde|caron|lig);`i', '\\1', $str);
$str = html_entity_decode($str, ENT_NOQUOTES, 'UTF-8');
$str = preg_replace(array('`[^a-z0-9]`i','`[-]+`'), '-', $str);
$str = strtolower(trim($str, '-'));
return $str;
}
Snippet 1 makes sure the string is in UTF-8 encoding.
Snippet 2 converts all special characters to their base form (ie, 'é' -> 'e').
Snippet 3 will convert spaces to hyphens (-).
All in all, taking into account the function's name and content, I'd say it is used to make URL friendly links, for example, convert
I discovered a new french word: église
to
i-discovered-a-new-french-word-eglise
Usually used for SEO.
Many of your questions can be answered by looking up what the functions do in your code.
Go here to get started: http://php.net/docs.php
Snippet #1: Checking if the string is valid UTF-8 data by round-trip converting it from source-> UTF-32 -> UTF-8. If the result is NOT the same as the input, then try to let the MB library determine the input encoding and output as UTF-8 regardless. Seems to be rather much work for little gain.
Snippet #2: Looks for a series of potential character entities (accented characters, in this case), and strips off the leading & and trailing ; if it matches and adds a backslash. So Æ becomes \AElig.
Snippet #3: Converts any character which is NOT a-z or 0-9 or a sequence of 1 or more - into a single -.

Converting Hex Codes into Characters

Does PHP have a function that searches for hex codes in a string and converts them into their char equivalents?
For example - I have a string that contains the following
"Hello World\x19s"
And I want to convert it to
"Hello World's"
Thanks in advance.
This code will convert "Hello World\x27s" into "Hello World's". It will convert "\x19" into the "end of medium" character, since that's what 0x19 represents in ASCII.
$str = preg_replace('/\\\\x([0-9a-f]{2})/e', 'chr(hexdec($1))', $str);
Correct me if i'm wrong but i think you should change the callback like so:
$str = preg_replace('/\\\\x([0-9a-f]{2})/e', 'chr(hexdec(\'$1\'))', $str);
By adding the single quotes characters like '=' (\x3d) will be converted fine too.
The /e will generate an error in current php advising to use preg_replace_callback. Try this:
preg_replace_callback('/\\\\x([0-9a-f]{2})/', function ($m) { return chr(hexdec($m[1])); }, $str );
/e Modifier causes PHP errors. It has been deprecated under new PHP updates. The correct way to convert hexcodes into characters is:
$str = html_entity_decode($str, ENT_QUOTES | ENT_XML1, 'UTF-8');
This will turn &apos; into ' and & into & etc

Categories