I use this table of Emoji and try this code:
<?php print json_decode('"\u2600"'); // This convert to ☀ (black sun with rays) ?>
If I try to convert this \u1F600 (grinning face) through json_decode, I see this symbol — ὠ0.
Whats wrong? How to get right Emoji?
PHP 5
JSON's \u can only handle one UTF-16 code unit at a time, so you need to write the surrogate pair instead. For U+1F600 this is \uD83D\uDE00, which works:
echo json_decode('"\uD83D\uDE00"');
😀
PHP 7
You now no longer need to use json_decode and can just use the \u and the unicode literal:
echo "\u{1F30F}";
🌏
In addition to the answer of Tino, I'd like to add code to convert hexadecimal code like 0x1F63C to a unicode symbol in PHP5 with splitting it to a surrogate pair:
function codeToSymbol($em) {
if($em > 0x10000) {
$first = (($em - 0x10000) >> 10) + 0xD800;
$second = (($em - 0x10000) % 0x400) + 0xDC00;
return json_decode('"' . sprintf("\\u%X\\u%X", $first, $second) . '"');
} else {
return json_decode('"' . sprintf("\\u%X", $em) . '"');
}
}
echo codeToSymbol(0x1F63C); outputs 😼
Example of code parsing string including emoji unicode format
$str = 'Test emoji \U0001F607 \U0001F63C';
echo preg_replace_callback(
'/\\\U([A-F0-9]+)/',
function ($matches) {
return mb_convert_encoding(hex2bin($matches[1]), 'UTF-8', 'UTF-32');
},
$str
);
Output: Test emoji 😇 😼
https://3v4l.org/63dUR
Related
I'm trying to write my own mb_ucwords() function to proivde a quick wrapper of mb_convert_case so that it would work with multibyte strings since the base ucwords() function does not.
I have ran into an issue where a string passed in that starts with the µ character (U+00B5 MICRO SIGN) was coming back as "Μ" (U+039C GREEK CAPITAL LETTER MU) instead of being ignored as I would assume should happen.
I wrote a quick test script to verify some information:
function testUtf8($letter) {
echo "CHAR: " . $letter . "\n";
echo "Detected Encoding: " . mb_detect_encoding($letter) . "\n";
echo "IS VALID UTF-8? " . (mb_check_encoding($letter, 'UTF-8') ? 'YES' : 'NO') . "\n";
$lower = mb_strtolower($letter);
$upper = mb_strtoupper($letter);
$conv = mb_convert_case($letter, MB_CASE_TITLE, 'UTF-8');
echo "mb_strtolower(): " . $lower . "(" . mb_ord($lower) . ")\n";
echo "mb_strtoupper(): " . $upper . "(" . mb_ord($upper) . ")\n";
echo "mb_convert_case(): " . $conv . "(" . mb_ord($conv) . ")\n";
echo "\n";
echo "Matches RegEx /\p{L}/u: " . (preg_match('/\p{L}/u', $letter) ? 'YES' : 'NO') . "\n";
echo "Matches RegEx /\p{N}/u: " . (preg_match('/\p{N}/u', $letter) ? 'YES' : 'NO') . "\n";
echo "Matches RegEx /\p{Xan}/u: " . (preg_match('/\p{Xan}/u', $letter) ? 'YES' : 'NO') . "\n";
}
testUtf8('µ');
And the output I get is:
CHAR: µ
Detected Encoding: UTF-8
IS VALID UTF-8? YES
mb_strtolower(): µ(181)
mb_strtoupper(): Μ(924)
mb_convert_case(): Μ(924)
Matches RegEx /\p{L}/u: YES
Matches RegEx /\p{N}/u: NO
Matches RegEx /\p{Xan}/u: YES
Can someone explain to me why PHP thinks µ is a "letter" and why the MB uppercase version is "Μ"? I was going to work around this by testing the first letter of each word and verifying that it was a valid unicode "letter" before running the conversion, but as you can see that wont work for this character since /\p{L}/u matches that character :(
Any idea how I can work around this?
Here is the rough draft of my function:
/**
* #param string $string The string to convert
* #param string $encoding Default is UTF-8
* #param string $delim_pattern Pattern used to break $string into words
* #return string
*/
public static function mb_ucwords(
string $string,
string $encoding = 'UTF-8',
string $delim_pattern = '/([\/\-\s\v"\'\\\]+)/u'
): string {
$words = preg_split($delim_pattern, $string, -1, PREG_SPLIT_DELIM_CAPTURE);
$output = "";
foreach($words as $word) {
$output .= mb_convert_case($word, MB_CASE_TITLE, $encoding);
}
return $output;
}
Currently testing this code agasinst PHP7.4
EDIT:
Apparently this is a GREEK letter as well as the symbol for micro, and M is the capital version of said GREEK letter. I'm not sure how to handle this...
In Unicode 2, µ (U+00B5 MICRO SIGN) was changed to have a compatibility decomposition of μ (U+03BC GREEK SMALL LETTER MU). At the same time, its category was changed from symbol to letter, to match μ (U+03BC GREEK SMALL LETTER MU). This means that U+00B5 should not be used in new text; it is only to be used for compatibility with non-Unicode character sets. Under certain normalization forms, these are considered to be the same character.
In Unicode 3.0, it was updated to have has M (U+039C GREEK CAPITAL LETTER MU) as its uppercase mapping, giving the result that you see now.
Unfortunately, since µ (U+00B5 MICRO SIGN) is basically deprecated, you're on your own if you use it. You could compare the first character of the string with µ (U+00B5 MICRO SIGN) before calling mb_convert_case. However, there's no guarantee that some system won't silently convert it to μ (U+03BC GREEK SMALL LETTER MU), for example if it normalizes the string. If you will never otherwise use μ (U+03BC GREEK SMALL LETTER MU), you could special-case that character as well.
The fail-safe way to handle this without breaking support for Greek text would be to use some sort of markup language or rich text to indicate that the character is used as a symbol instead of a letter, and then parse that when performing the case conversion. But that would obviously be a larger undertaking.
You could go as simple as this
function mb_ucfirst($string)
{
$main_encoding = "cp1250";
$inner_encoding = "utf-8";
$string = iconv($main_encoding, $inner_encoding, $string);
$strlen = mb_strlen($string);
$firstChar = mb_substr($string, 0, 1, $inner_encoding);
$then = mb_substr($string, 1, $strlen - 1, $inner_encoding);
return iconv($inner_encoding, $main_encoding , mb_strtoupper($firstChar, $inner_encoding) . $then );
}
Keeps the µ while I was testing it.
I am making a dynamic Unicode icon in PHP. I want the UTF-8 code of the Unicode icon.
So far I have done:
$value = "1F600";
$emoIcon = "\u{$value}";
$emoIcon = preg_replace("/\\\\u([0-9A-F]{2,5})/i", "&#x$1;", $emoIcon);
echo $emoIcon; //output 😀
$hex=bin2hex($emoIcon);
echo $hex; // output 26237831463630303b
$hexVal=chunk_split($hex,2,"\\x");
var_dump($hexVal); // output 26\x23\x78\x31\x46\x36\x30\x30\x3b\x
$result= "\\x" . substr($hexVal,0,-2);
var_dump($result); // output \x26\x23\x78\x31\x46\x36\x30\x30\x3b
But when I put the value directly, it prints the correct data:
$emoIcon = "\u{1F600}";
$emoIcon = preg_replace("/\\\\u([0-9A-F]{2,5})/i", "&#x$1;", $emoIcon);
echo $emoIcon; //output 😀
$hex=bin2hex($emoIcon);
echo $hex; // output f09f9880
$hexVal=chunk_split($hex,2,"\\x");
var_dump($hexVal); // output f0\x9f\x98\x80\x
$result= "\\x" . substr($hexVal,0,-2);
var_dump($result); // output \xf0\x9f\x98\x80
\u{1F600} is a Unicode escape sequence used in double-quoted strings, it must have a literal value - trying to use "\u{$value}", as you've seen, doesn't work (for a couple reasons, but that doesn't matter so much.)
If you want to start with "1F600" and end up with 😀 use hexdec to turn it into an integer and feed that to IntlChar::chr to encode that code point as UTF-8. E.g.:
$value = "1F600";
echo IntlChar::chr(hexdec($value));
Outputs:
😀
I have a Unicode text-block, like this:
ụ
ư
ứ
Ỳ
Ỷ
Ỵ
Đ
Now, I want to convert this orginal Unicode text-block into a text-block of UTF-8 (HEX) code point (see the Hexadecimal UTF-8 column, on this page: https://en.wikipedia.org/wiki/UTF-8), by PHP; like this:
\xe1\xbb\xa5
\xc6\xb0
\xe1\xbb\xa9
\xe1\xbb\xb2
\xe1\xbb\xb6
\xe1\xbb\xb4
\xc4\x90
Not like this:
0x1EE5
0x01B0
0x1EE9
0x1EF2
0x1EF6
0x1EF4
0x0110
Is there any way to do it, by PHP?
I have read this topic (PHP: Convert unicode codepoint to UTF-8). But, it is not similar to my question.
I am sorry, I don't know much about Unicode.
I think you're looking for the bin2hex() function:
Convert binary data into hexadecimal representation
And format by prepending \x to each byte (00-FF)
function str_hex_format ($bin) {
return '\x'.implode('\x', str_split(bin2hex($bin), 2));
}
For your sample:
// utf8 encoded input
$arr = ["ụ","ư","ứ","Ỳ","Ỷ","Ỵ","Đ"];
foreach($arr AS $v)
echo $v . " => " . str_hex_format($v) . "\n";
See test at eval.in (link expires)
ụ => \xe1\xbb\xa5
ư => \xc6\xb0
ứ => \xe1\xbb\xa9
Ỳ => \xe1\xbb\xb2
Ỷ => \xe1\xbb\xb6
Ỵ => \xe1\xbb\xb4
Đ => \xc4\x90
Decode example: $str = str_hex_format("ụưứỲỶỴĐ"); echo $str;
\xe1\xbb\xa5\xc6\xb0\xe1\xbb\xa9\xe1\xbb\xb2\xe1\xbb\xb6\xe1\xbb\xb4\xc4\x90
echo hex2bin(str_replace('\x', "", $str));
ụưứỲỶỴĐ
For more info about escape sequence \x in double quoted strings see php manual.
PHP treats strings as arrays of characters, regardless of encoding. If you don't need to delimit the UTF8 characters, then something like this works:
$str='ụưứỲỶỴĐ';
foreach(str_split($str) as $char)
echo '\x'.str_pad(dechex(ord($char)),'0',2,STR_PAD_LEFT);
Output:
\xe1\xbb\xa5\xc6\xb0\xe1\xbb\xa9\xe1\xbb\xb2\xe1\xbb\xb6\xe1\xbb\xb4\xc4\x90
If you need to delimit the UTF8 characters (i.e. with a newline), then you'll need something like this:
$str='ụưứỲỶỴĐ';
foreach(array_slice(preg_split('~~u',$str),1,-1) as $UTF8char){ // split before/after every UTF8 character and remove first/last empty string
foreach(str_split($UTF8char) as $char)
echo '\x'.str_pad(dechex(ord($char)),'0',2,STR_PAD_LEFT);
echo "\n"; // delimiter
}
Output:
\xe1\xbb\xa5
\xc6\xb0
\xe1\xbb\xa9
\xe1\xbb\xb2
\xe1\xbb\xb6
\xe1\xbb\xb4
\xc4\x90
This splits the string into UTF8 characters using preg_split and the u flag. Since preg_split returns the empty string before the first character and the empty string after the last character, we need to array_slice the first and last characters. This can be easily modified to return an array, for example.
Edit:
A more "correct" way to do this is this:
echo trim(json_encode(utf8_encode('ụưứỲỶỴĐ')),'"');
The main thing you need to do is to tell PHP to interpret the incoming Unicode characters correctly. Once you do that, you can then convert them to UTF-8 and then to hex as needed.
This code frag takes your example character in Unicode, converts them to UTF-8, and then dumps the hex representation of those characters.
<?php
// Hex equivalent of "ụưứỲỶỴĐ" in Unicode
$unistr = "\x1E\xE5\x01\xB0\x1E\xE9\x1E\xF2\x1E\xF6\x1E\xF4\x01\x10";
echo " length=" . mb_strlen($unistr, 'UCS-2BE') . "\n";
// Here's the key statement, convert from Unicode 16-bit to UTF-8
$utf8str = mb_convert_encoding($unistr, "UTF-8", 'UCS-2BE');
echo $utf8str . "\n";
for($i=0; $i < mb_strlen($utf8str, 'UTF-8'); $i++) {
$c = mb_substr($utf8str, $i, 1, 'UTF-8');
$hex = bin2hex($c);
echo $c . "\t" . $hex . "\t" . preg_replace("/([0-9a-f]{2})/", '\\\\x\\1', $hex) . "\n";
}
?>
Produces
length=7
ụưứỲỶỴĐ
ụ e1bba5 \xe1\xbb\xa5
ư c6b0 \xc6\xb0
ứ e1bba9 \xe1\xbb\xa9
Ỳ e1bbb2 \xe1\xbb\xb2
Ỷ e1bbb6 \xe1\xbb\xb6
Ỵ e1bbb4 \xe1\xbb\xb4
Đ c490 \xc4\x90
I'm trying to convert characters, like À, to their escaped form, such as \u00c0. I know this can be done with json_encode, but the function adds backslashes to special characters. (I'm not actually hoping to get a json object, just string conversion):
$str = 'À ß \ Ć " Ď < Ĕ';
For the string above, it'll return
$str = '\u00c0 \u00df \\ \u0106 \" \u010e < \u0114';
and if I stripslashes, it will also strip the one before each uxxxx.
Is there a function for this particular conversion? Or what is the simplest way to do it?
You can use the following code for going back and forward
Code :
if (!function_exists('codepoint_encode')) {
function codepoint_encode($str) {
return substr(json_encode($str), 1, -1);
}
}
if (!function_exists('codepoint_decode')) {
function codepoint_decode($str) {
return json_decode(sprintf('"%s"', $str));
}
}
How to use :
echo "\nUse JSON encoding / decoding\n";
var_dump(codepoint_encode("我好"));
var_dump(codepoint_decode('\u6211\u597d'));
Output :
Use JSON encoding / decoding
string(12) "\u6211\u597d"
string(6) "我好"
$str = 'À ß \ Ć " Ď < Ĕ';
echo trim(preg_replace('/\\\\([^u])/', "$1", json_encode($str)), '"');
// ouptuts: \u00c0 \u00df \ \u0106 " \u010e < \u0114
I know it uses json_encode(), but it's the easiest way to convert to \uXXXX
Slight modification to #cryptic's answer:
script
$str = 'À ß \ Ć " Ď < Ĕ \\\\uxxx';
echo trim(preg_replace('/\\\\([^u])/', "$1", json_encode($string, JSON_UNESCAPED_SLASHES)), '"');
output
\u00c0 \u00df \ \u0106 " \u010e < \u0114 \\uxxx
function convertChars($str) {
return json_decode("\"$str\"");
}
Kindly I need to convert the Arabic text to and from Hexadecimal like the following example Using PHP
مرحبا
06450631062D06280627
Regards,
Eco
If you just need to have the Arabic text written in the HTML document begin generated, I think the simplest way is to convert the sequence to character references, turning e.g. 0645 to م. This could be done as follows:
<?php
$str = '06450631062D06280627';
for($i = 0; $i < strlen($str)/4; $i++) {
echo "&#x", substr($str, 4*$i, 4), ";";
}
?>
I get a unicode string with following code.
$str = "Some Hexa String";
$replacedString = preg_replace("/\\\\u([0-9abcdef]{4})/", "&#x$1;", $str);
$unicodeString = mb_convert_encoding($replacedString, 'UTF-8', 'HTML-ENTITIES');
bin2hex($str); // Bin to Hex
pack("H*", $hexStr); // Hex to Bin