PHP Unicode codepoint to character - php

I would like to convert Unicode codepoint to character. Here is what I have tried:
$point = dechex(127468); // 1f1ec
echo "\u{1f1ec}"; // this works
echo "\u{$point}"; // this outputs '\u1f1ec'
echo "\u{{$point}}"; // Parse error: Invalid UTF-8 codepoint escape sequence
echo "\u\{{$point}\}"; // outputs \u\{1f1ec\}
echo "\u{". $point ."}"; // Parse error; same as above

You don't need to convert integer to hexadecimal string, instead use IntlChar::chr:
echo IntlChar::chr(127468);
Directly from docs of IntlChar::chr:
Return Unicode character by code point value

A similar problem occurs when you want to get a floating point number, say, 12e-4, concatenating pieces. The parsing is done too early in the compiler to allow it. You probably can, however, use eval() to do so. Yuck.

Actually find the solution after several hours:
$unicode = '1F605'; //😅
$uni = '{' . $unicode; // First bracket needs to be separated, otherwise you get '\u1F605'
$str = "\u$uni}";
eval("\$str = \"$str\";"); // Turns unicode into RegEx and store it as $str
echo $str;
Thanks #Rick James for the idea with the eval() function

PHP 7+ solution snippet:
function charFromCodePoint($codepoint) {
eval('$ch = "\u{'.dechex($codepoint).'}";');
return $ch;
}
Notice, that PHP5 doesn't support the "\u{}" syntax.

Related

Regular expression error disallowed Unicode code point

I use this regular expression to remove all possible emojis from a string.
/(\x{00a9}|\x{00ae}|[\x{2000}-\x{3300}]|\x{d83c}[\x{d000}-\x{dfff}]|\x{d83d}[\x{d000}-\x{dfff}]|\x{d83e}[\x{d000}-\x{dfff}])/u
but it throws this exception:
preg_replace(): Compilation failed: disallowed Unicode code point (>= 0xd800 && <= 0xdfff) at offset 46
I googled about this problem, but I couldn't find any accurate answer about this problem. I will be appreciated if someone tell me what this error exactly means and what the solution is.
Also what is this:
>= 0xd800 && <= 0xdfff
Above regex is PCRE version of this source:
https://www.regextester.com/106421
Emojis are specified in UAX #51. The property \p{Emoji} should work, but doesn't.
Do it the hard way. Parse emoji-*.txt:
perl -C -lne'
if (my ($c) = $_ =~ /^((?:(?:[[:xdigit:]]+ )|[[:xdigit:]]+\.\.)[[:xdigit:]]+)/) {
if ($c =~ /\.\./) { # ranges
my ($f, $t) = map { hex } split /\.\./, $c;
print for map { chr } $f..$t;
} else { # sequences
print join "", map { chr hex } split /\s+/, $c;
}
}
' emoji-*.txt
This gives us a newline separated list of all emojis. Using Regexp::Assemble::Compressed, the result is
(?:[\x{23EB}\x{23EC}\x{23F0}\x{2605}\x{2607}-\x{260D}\x{260F}\x{2610}\x{2612}\x{2616}\x{2617}\x{261A}-\x{261C}\x{261E}\x{261F}\x{2621}\x{2624}\x{2625}\x{2627}-\x{2629}\x{262B}-\x{262D}\x{2630}-\x{2637}\x{263B}-\x{263F}\x{2641}\x{2643}-\x{2647}\x{2654}-\x{265E}\x{2661}\x{2662}\x{2664}\x{2667}\x{2669}-\x{267A}\x{267C}\x{267D}\x{2680}-\x{2685}\x{2690}\x{2691}\x{2698}\x{269A}\x{269E}\x{269F}\x{26A2}-\x{26A9}\x{26AC}-\x{26AF}\x{26B3}-\x{26BC}\x{26BF}-\x{26C3}\x{26C6}\x{26C7}\x{26C9}-\x{26CD}\x{26D0}\x{26D2}\x{26D5}-\x{26E1}\x{26E4}-\x{26E8}\x{26EB}-\x{26EF}\x{26F6}\x{26FB}\x{26FC}\x{26FE}\x{26FF}\x{2701}\x{2703}\x{2704}\x{270E}\x{2710}\x{2711}\x{2754}\x{2755}\x{2765}-\x{2767}\x{2795}-\x{2797}\x{1F000}-\x{1F003}\x{1F005}-\x{1F0BE}\x{1F0C1}-\x{1F0CF}\x{1F0D1}-\x{1F0FF}\x{1F10D}-\x{1F10F}\x{1F16D}-\x{1F16F}\x{1F191}-\x{1F19A}\x{1F1AD}-\x{1F1E5}\x{1F201}\x{1F203}-\x{1F20F}\x{1F232}-\x{1F236}\x{1F238}-\x{1F23A}\x{1F23C}-\x{1F23F}\x{1F249}-\x{1F30C}\x{1F310}-\x{1F314}\x{1F316}-\x{1F31B}\x{1F31D}-\x{1F320}\x{1F322}\x{1F323}\x{1F32D}-\x{1F335}\x{1F337}-\x{1F377}\x{1F379}-\x{1F37C}\x{1F37E}-\x{1F384}\x{1F386}-\x{1F392}\x{1F394}\x{1F395}\x{1F398}\x{1F39C}\x{1F39D}\x{1F3A0}-\x{1F3A6}\x{1F3A8}-\x{1F3AB}\x{1F3AF}-\x{1F3C1}\x{1F3C8}\x{1F3C9}\x{1F3CF}-\x{1F3D3}\x{1F3E1}-\x{1F3EC}\x{1F3EE}-\x{1F3F2}\x{1F3F6}\x{1F3F8}-\x{1F407}\x{1F409}-\x{1F414}\x{1F416}-\x{1F41E}\x{1F420}-\x{1F425}\x{1F427}-\x{1F43E}\x{1F444}\x{1F445}\x{1F451}\x{1F452}\x{1F454}-\x{1F465}\x{1F479}-\x{1F47B}\x{1F47E}-\x{1F480}\x{1F484}\x{1F488}-\x{1F4A2}\x{1F4A4}-\x{1F4A9}\x{1F4AB}-\x{1F4AF}\x{1F4B1}\x{1F4B2}\x{1F4B4}-\x{1F4BA}\x{1F4BC}-\x{1F4BE}\x{1F4C0}-\x{1F4CA}\x{1F4CC}-\x{1F4D9}\x{1F4DB}-\x{1F4DE}\x{1F4E0}-\x{1F4E3}\x{1F4E7}-\x{1F4E9}\x{1F4EE}-\x{1F4F6}\x{1F4FC}\x{1F4FE}\x{1F500}-\x{1F507}\x{1F509}-\x{1F50C}\x{1F50E}-\x{1F511}\x{1F514}-\x{1F53D}\x{1F546}-\x{1F548}\x{1F54B}-\x{1F54F}\x{1F568}-\x{1F56E}\x{1F571}\x{1F572}\x{1F57B}-\x{1F586}\x{1F588}\x{1F589}\x{1F58E}\x{1F58F}\x{1F591}-\x{1F594}\x{1F597}-\x{1F5A3}\x{1F5A6}\x{1F5A7}\x{1F5A9}-\x{1F5B0}\x{1F5B3}-\x{1F5BB}\x{1F5BD}-\x{1F5C1}\x{1F5C5}-\x{1F5D0}\x{1F5D4}-\x{1F5DB}\x{1F5DF}\x{1F5E0}\x{1F5E2}\x{1F5E4}-\x{1F5E7}\x{1F5E9}-\x{1F5EE}\x{1F5F0}-\x{1F5F2}\x{1F5F4}-\x{1F5F9}\x{1F5FB}-\x{1F5FF}\x{1F601}-\x{1F60F}\x{1F612}-\x{1F614}\x{1F61C}-\x{1F61E}\x{1F620}-\x{1F62B}\x{1F62E}-\x{1F633}\x{1F635}-\x{1F644}\x{1F648}-\x{1F64A}\x{1F680}-\x{1F686}\x{1F688}-\x{1F68C}\x{1F68E}-\x{1F690}\x{1F692}\x{1F693}\x{1F695}-\x{1F697}\x{1F699}-\x{1F6A2}\x{1F6A4}-\x{1F6AC}\x{1F6AE}-\x{1F6B1}\x{1F6B3}\x{1F6B7}\x{1F6B8}\x{1F6BB}\x{1F6BD}-\x{1F6BF}\x{1F6C1}-\x{1F6CA}\x{1F6D1}-\x{1F6D4}\x{1F6D6}-\x{1F6DF}\x{1F6E6}-\x{1F6E8}\x{1F6EA}-\x{1F6EF}\x{1F6F1}\x{1F6F2}\x{1F6F4}-\x{1F6F8}\x{1F6FB}-\x{1F6FF}\x{1F774}-\x{1F77F}\x{1F7D5}-\x{1F7FF}\x{1F80C}-\x{1F80F}\x{1F848}-\x{1F84F}\x{1F85A}-\x{1F85F}\x{1F888}-\x{1F88F}\x{1F8AE}-\x{1F8FF}\x{1F90D}\x{1F90E}\x{1F910}-\x{1F917}\x{1F91D}\x{1F920}-\x{1F925}\x{1F927}-\x{1F92F}\x{1F93A}\x{1F940}-\x{1F945}\x{1F947}-\x{1F94B}\x{1F94D}-\x{1F970}\x{1F973}-\x{1F979}\x{1F97C}-\x{1F9B4}\x{1F9B7}\x{1F9BA}\x{1F9BC}-\x{1F9BF}\x{1F9C1}-\x{1F9CC}\x{1F9D0}\x{1F9E0}-\x{1FFFD}\x{E0020}-\x{E007F}]|\x{1F1F2}[\x{1F1E6}\x{1F1E8}-\x{1F1ED}\x{1F1F0}-\x{1F1FF}]?|\x{1F1E7}[\x{1F1E6}\x{1F1E7}\x{1F1E9}-\x{1F1EF}\x{1F1F1}-\x{1F1F4}\x{1F1F6}-\x{1F1F9}\x{1F1FB}\x{1F1FC}\x{1F1FE}\x{1F1FF}]?|\x{1F1F8}[\x{1F1E6}-\x{1F1EA}\x{1F1EC}-\x{1F1F4}\x{1F1F7}-\x{1F1F9}\x{1F1FB}\x{1F1FD}-\x{1F1FF}]?|\x{1F1E8}[\x{1F1E6}\x{1F1E8}\x{1F1E9}\x{1F1EB}-\x{1F1EE}\x{1F1F0}-\x{1F1F5}\x{1F1F7}\x{1F1FA}-\x{1F1FF}]?|\x{1F1EC}[\x{1F1E6}\x{1F1E7}\x{1F1E9}-\x{1F1EE}\x{1F1F1}-\x{1F1F3}\x{1F1F5}-\x{1F1FA}\x{1F1FC}\x{1F1FE}]?|\x{1F1E6}[\x{1F1E8}-\x{1F1EC}\x{1F1EE}\x{1F1F1}\x{1F1F2}\x{1F1F4}\x{1F1F6}-\x{1F1FA}\x{1F1FC}\x{1F1FD}\x{1F1FF}]?|\x{1F1F9}[\x{1F1E6}\x{1F1E8}\x{1F1E9}\x{1F1EB}-\x{1F1ED}\x{1F1EF}-\x{1F1F4}\x{1F1F7}\x{1F1F9}\x{1F1FB}\x{1F1FC}\x{1F1FF}]?|\x{1F1F5}[\x{1F1E6}\x{1F1EA}-\x{1F1ED}\x{1F1F0}-\x{1F1F3}\x{1F1F7}-\x{1F1F9}\x{1F1FC}\x{1F1FE}]?|\x{1F1F3}[\x{1F1E6}\x{1F1E8}\x{1F1EA}-\x{1F1EC}\x{1F1EE}\x{1F1F1}\x{1F1F4}\x{1F1F5}\x{1F1F7}\x{1F1FA}\x{1F1FF}]?|\x{1F1EE}[\x{1F1E8}-\x{1F1EA}\x{1F1F1}-\x{1F1F4}\x{1F1F6}-\x{1F1F9}]?|\x{1F1F0}[\x{1F1EA}\x{1F1EC}-\x{1F1EE}\x{1F1F2}\x{1F1F3}\x{1F1F5}\x{1F1F7}\x{1F1FC}\x{1F1FE}\x{1F1FF}]?|\x{1F1F1}[\x{1F1E6}-\x{1F1E8}\x{1F1EE}\x{1F1F0}\x{1F1F7}-\x{1F1FB}\x{1F1FE}]?|\x{1F1EA}[\x{1F1E6}\x{1F1E8}\x{1F1EA}\x{1F1EC}\x{1F1ED}\x{1F1F7}-\x{1F1FA}]?|\x{26F9}[\x{200D}\x{FE0E}\x{FE0F}\x{1F3FB}-\x{1F3FF}]?|\x{1F3C4}[\x{200D}\x{FE0E}\x{FE0F}\x{1F3FB}-\x{1F3FF}]?|\x{1F3CA}[\x{200D}\x{FE0E}\x{FE0F}\x{1F3FB}-\x{1F3FF}]?|\x{1F3CB}[\x{200D}\x{FE0E}\x{FE0F}\x{1F3FB}-\x{1F3FF}]?|\x{1F3CC}[\x{200D}\x{FE0E}\x{FE0F}\x{1F3FB}-\x{1F3FF}]?|\x{1F575}[\x{200D}\x{FE0E}\x{FE0F}\x{1F3FB}-\x{1F3FF}]?|\x{261D}[\x{FE0E}\x{FE0F}\x{1F3FB}-\x{1F3FF}]?|\x{270C}[\x{FE0E}\x{FE0F}\x{1F3FB}-\x{1F3FF}]?|\x{270D}[\x{FE0E}\x{FE0F}\x{1F3FB}-\x{1F3FF}]?|\x{1F1E9}[\x{1F1EA}\x{1F1EC}\x{1F1EF}\x{1F1F0}\x{1F1F2}\x{1F1F4}\x{1F1FF}]?|\x{1F1FA}[\x{1F1E6}\x{1F1EC}\x{1F1F2}\x{1F1F3}\x{1F1F8}\x{1F1FE}\x{1F1FF}]?|\x{1F1FB}[\x{1F1E6}\x{1F1E8}\x{1F1EA}\x{1F1EC}\x{1F1EE}\x{1F1F3}\x{1F1FA}]?|\x{1F3C2}[\x{FE0E}\x{FE0F}\x{1F3FB}-\x{1F3FF}]?|\x{1F442}[\x{FE0E}\x{FE0F}\x{1F3FB}-\x{1F3FF}]?|\x{1F446}[\x{FE0E}\x{FE0F}\x{1F3FB}-\x{1F3FF}]?|\x{1F447}[\x{FE0E}\x{FE0F}\x{1F3FB}-\x{1F3FF}]?|\x{1F448}[\x{FE0E}\x{FE0F}\x{1F3FB}-\x{1F3FF}]?|\x{1F449}[\x{FE0E}\x{FE0F}\x{1F3FB}-\x{1F3FF}]?|\x{1F44D}[\x{FE0E}\x{FE0F}\x{1F3FB}-\x{1F3FF}]?|\x{1F44E}[\x{FE0E}\x{FE0F}\x{1F3FB}-\x{1F3FF}]?|\x{1F574}[\x{FE0E}\x{FE0F}\x{1F3FB}-\x{1F3FF}]?|\x{1F590}[\x{FE0E}\x{FE0F}\x{1F3FB}-\x{1F3FF}]?|\x{1F1EB}[\x{1F1EE}-\x{1F1F0}\x{1F1F2}\x{1F1F4}\x{1F1F7}]?|\x{1F1ED}[\x{1F1F0}\x{1F1F2}\x{1F1F3}\x{1F1F7}\x{1F1F9}\x{1F1FA}]?|\x{1F3C3}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F468}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F469}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F46E}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F471}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F473}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F477}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F481}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F482}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F486}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F487}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F645}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F646}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F647}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F64B}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F64D}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F64E}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F6A3}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F6B4}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F6B5}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F6B6}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F926}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F937}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F938}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F939}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F93D}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F93E}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F9B8}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F9B9}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F9CD}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F9CE}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F9CF}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F9D1}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F9D6}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F9D7}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F9D8}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F9D9}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F9DA}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F9DB}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F9DC}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F9DD}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{270A}[\x{1F3FB}-\x{1F3FF}]?|\x{270B}[\x{1F3FB}-\x{1F3FF}]?|\x{1F1F7}[\x{1F1EA}\x{1F1F4}\x{1F1F8}\x{1F1FA}\x{1F1FC}]?|\x{1F385}[\x{1F3FB}-\x{1F3FF}]?|\x{1F3C7}[\x{1F3FB}-\x{1F3FF}]?|\x{1F443}[\x{1F3FB}-\x{1F3FF}]?|\x{1F44A}[\x{1F3FB}-\x{1F3FF}]?|\x{1F44B}[\x{1F3FB}-\x{1F3FF}]?|\x{1F44C}[\x{1F3FB}-\x{1F3FF}]?|\x{1F44F}[\x{1F3FB}-\x{1F3FF}]?|\x{1F450}[\x{1F3FB}-\x{1F3FF}]?|\x{1F466}[\x{1F3FB}-\x{1F3FF}]?|\x{1F467}[\x{1F3FB}-\x{1F3FF}]?|\x{1F46B}[\x{1F3FB}-\x{1F3FF}]?|\x{1F46C}[\x{1F3FB}-\x{1F3FF}]?|\x{1F46D}[\x{1F3FB}-\x{1F3FF}]?|\x{1F470}[\x{1F3FB}-\x{1F3FF}]?|\x{1F472}[\x{1F3FB}-\x{1F3FF}]?|\x{1F474}[\x{1F3FB}-\x{1F3FF}]?|\x{1F475}[\x{1F3FB}-\x{1F3FF}]?|\x{1F476}[\x{1F3FB}-\x{1F3FF}]?|\x{1F478}[\x{1F3FB}-\x{1F3FF}]?|\x{1F47C}[\x{1F3FB}-\x{1F3FF}]?|\x{1F483}[\x{1F3FB}-\x{1F3FF}]?|\x{1F485}[\x{1F3FB}-\x{1F3FF}]?|\x{1F4AA}[\x{1F3FB}-\x{1F3FF}]?|\x{1F595}[\x{1F3FB}-\x{1F3FF}]?|\x{1F596}[\x{1F3FB}-\x{1F3FF}]?|\x{1F64C}[\x{1F3FB}-\x{1F3FF}]?|\x{1F64F}[\x{1F3FB}-\x{1F3FF}]?|\x{1F6C0}[\x{1F3FB}-\x{1F3FF}]?|\x{1F6CC}[\x{1F3FB}-\x{1F3FF}]?|\x{1F90F}[\x{1F3FB}-\x{1F3FF}]?|\x{1F918}[\x{1F3FB}-\x{1F3FF}]?|\x{1F919}[\x{1F3FB}-\x{1F3FF}]?|\x{1F91A}[\x{1F3FB}-\x{1F3FF}]?|\x{1F91B}[\x{1F3FB}-\x{1F3FF}]?|\x{1F91C}[\x{1F3FB}-\x{1F3FF}]?|\x{1F91E}[\x{1F3FB}-\x{1F3FF}]?|\x{1F931}[\x{1F3FB}-\x{1F3FF}]?|\x{1F932}[\x{1F3FB}-\x{1F3FF}]?|\x{1F933}[\x{1F3FB}-\x{1F3FF}]?|\x{1F934}[\x{1F3FB}-\x{1F3FF}]?|\x{1F935}[\x{1F3FB}-\x{1F3FF}]?|\x{1F936}[\x{1F3FB}-\x{1F3FF}]?|\x{1F9B5}[\x{1F3FB}-\x{1F3FF}]?|\x{1F9B6}[\x{1F3FB}-\x{1F3FF}]?|\x{1F9BB}[\x{1F3FB}-\x{1F3FF}]?|\x{1F9D2}[\x{1F3FB}-\x{1F3FF}]?|\x{1F9D3}[\x{1F3FB}-\x{1F3FF}]?|\x{1F9D4}[\x{1F3FB}-\x{1F3FF}]?|\x{1F9D5}[\x{1F3FB}-\x{1F3FF}]?|\x{1F1EF}[\x{1F1EA}\x{1F1F2}\x{1F1F4}\x{1F1F5}]?|\x{1F57A}[\x{1F3FB}-\x{1F3FF}]|\x{1F91F}[\x{1F3FB}-\x{1F3FF}]|\x{1F930}[\x{1F3FB}-\x{1F3FF}]|0[\x{20E3}\x{FE0E}\x{FE0F}]?|1[\x{20E3}\x{FE0E}\x{FE0F}]?|2[\x{20E3}\x{FE0E}\x{FE0F}]?|3[\x{20E3}\x{FE0E}\x{FE0F}]?|4[\x{20E3}\x{FE0E}\x{FE0F}]?|5[\x{20E3}\x{FE0E}\x{FE0F}]?|6[\x{20E3}\x{FE0E}\x{FE0F}]?|7[\x{20E3}\x{FE0E}\x{FE0F}]?|8[\x{20E3}\x{FE0E}\x{FE0F}]?|9[\x{20E3}\x{FE0E}\x{FE0F}]?|\\*[\x{20E3}\x{FE0E}\x{FE0F}]|\x{1F1FF}[\x{1F1E6}\x{1F1F2}\x{1F1FC}]?|\x{1F3F3}[\x{200D}\x{FE0E}\x{FE0F}]?|\x{1F415}[\x{200D}\x{FE0E}\x{FE0F}]?|#[\x{20E3}\x{FE0E}\x{FE0F}]|\x{2194}[\x{FE0E}\x{FE0F}]?|\x{2195}[\x{FE0E}\x{FE0F}]?|\x{2196}[\x{FE0E}\x{FE0F}]?|\x{2197}[\x{FE0E}\x{FE0F}]?|\x{2198}[\x{FE0E}\x{FE0F}]?|\x{2199}[\x{FE0E}\x{FE0F}]?|\x{21A9}[\x{FE0E}\x{FE0F}]?|\x{21AA}[\x{FE0E}\x{FE0F}]?|\x{231A}[\x{FE0E}\x{FE0F}]?|\x{231B}[\x{FE0E}\x{FE0F}]?|\x{23E9}[\x{FE0E}\x{FE0F}]?|\x{23EA}[\x{FE0E}\x{FE0F}]?|\x{23ED}[\x{FE0E}\x{FE0F}]?|\x{23EE}[\x{FE0E}\x{FE0F}]?|\x{23EF}[\x{FE0E}\x{FE0F}]?|\x{23F1}[\x{FE0E}\x{FE0F}]?|\x{23F2}[\x{FE0E}\x{FE0F}]?|\x{23F3}[\x{FE0E}\x{FE0F}]?|\x{23F8}[\x{FE0E}\x{FE0F}]?|\x{23F9}[\x{FE0E}\x{FE0F}]?|\x{23FA}[\x{FE0E}\x{FE0F}]?|\x{25AA}[\x{FE0E}\x{FE0F}]?|\x{25AB}[\x{FE0E}\x{FE0F}]?|\x{25FB}[\x{FE0E}\x{FE0F}]?|\x{25FC}[\x{FE0E}\x{FE0F}]?|\x{25FD}[\x{FE0E}\x{FE0F}]?|\x{25FE}[\x{FE0E}\x{FE0F}]?|\x{2600}[\x{FE0E}\x{FE0F}]?|\x{2601}[\x{FE0E}\x{FE0F}]?|\x{2602}[\x{FE0E}\x{FE0F}]?|\x{2603}[\x{FE0E}\x{FE0F}]?|\x{2604}[\x{FE0E}\x{FE0F}]?|\x{260E}[\x{FE0E}\x{FE0F}]?|\x{2611}[\x{FE0E}\x{FE0F}]?|\x{2614}[\x{FE0E}\x{FE0F}]?|\x{2615}[\x{FE0E}\x{FE0F}]?|\x{2620}[\x{FE0E}\x{FE0F}]?|\x{2622}[\x{FE0E}\x{FE0F}]?|\x{2623}[\x{FE0E}\x{FE0F}]?|\x{2626}[\x{FE0E}\x{FE0F}]?|\x{262A}[\x{FE0E}\x{FE0F}]?|\x{262E}[\x{FE0E}\x{FE0F}]?|\x{262F}[\x{FE0E}\x{FE0F}]?|\x{2638}[\x{FE0E}\x{FE0F}]?|\x{2639}[\x{FE0E}\x{FE0F}]?|\x{263A}[\x{FE0E}\x{FE0F}]?|\x{2640}[\x{FE0E}\x{FE0F}]?|\x{2642}[\x{FE0E}\x{FE0F}]?|\x{2648}[\x{FE0E}\x{FE0F}]?|\x{2649}[\x{FE0E}\x{FE0F}]?|\x{264A}[\x{FE0E}\x{FE0F}]?|\x{264B}[\x{FE0E}\x{FE0F}]?|\x{264C}[\x{FE0E}\x{FE0F}]?|\x{264D}[\x{FE0E}\x{FE0F}]?|\x{264E}[\x{FE0E}\x{FE0F}]?|\x{264F}[\x{FE0E}\x{FE0F}]?|\x{2650}[\x{FE0E}\x{FE0F}]?|\x{2651}[\x{FE0E}\x{FE0F}]?|\x{2652}[\x{FE0E}\x{FE0F}]?|\x{2653}[\x{FE0E}\x{FE0F}]?|\x{265F}[\x{FE0E}\x{FE0F}]?|\x{2660}[\x{FE0E}\x{FE0F}]?|\x{2663}[\x{FE0E}\x{FE0F}]?|\x{2665}[\x{FE0E}\x{FE0F}]?|\x{2666}[\x{FE0E}\x{FE0F}]?|\x{2668}[\x{FE0E}\x{FE0F}]?|\x{267B}[\x{FE0E}\x{FE0F}]?|\x{267E}[\x{FE0E}\x{FE0F}]?|\x{267F}[\x{FE0E}\x{FE0F}]?|\x{2692}[\x{FE0E}\x{FE0F}]?|\x{2693}[\x{FE0E}\x{FE0F}]?|\x{2694}[\x{FE0E}\x{FE0F}]?|\x{2695}[\x{FE0E}\x{FE0F}]?|\x{2696}[\x{FE0E}\x{FE0F}]?|\x{2697}[\x{FE0E}\x{FE0F}]?|\x{2699}[\x{FE0E}\x{FE0F}]?|\x{269B}[\x{FE0E}\x{FE0F}]?|\x{269C}[\x{FE0E}\x{FE0F}]?|\x{26A0}[\x{FE0E}\x{FE0F}]?|\x{26A1}[\x{FE0E}\x{FE0F}]?|\x{26AA}[\x{FE0E}\x{FE0F}]?|\x{26AB}[\x{FE0E}\x{FE0F}]?|\x{26B0}[\x{FE0E}\x{FE0F}]?|\x{26B1}[\x{FE0E}\x{FE0F}]?|\x{26BD}[\x{FE0E}\x{FE0F}]?|\x{26BE}[\x{FE0E}\x{FE0F}]?|\x{26C4}[\x{FE0E}\x{FE0F}]?|\x{26C5}[\x{FE0E}\x{FE0F}]?|\x{26C8}[\x{FE0E}\x{FE0F}]?|\x{26CF}[\x{FE0E}\x{FE0F}]?|\x{26D1}[\x{FE0E}\x{FE0F}]?|\x{26D3}[\x{FE0E}\x{FE0F}]?|\x{26D4}[\x{FE0E}\x{FE0F}]?|\x{26E9}[\x{FE0E}\x{FE0F}]?|\x{26EA}[\x{FE0E}\x{FE0F}]?|\x{26F0}[\x{FE0E}\x{FE0F}]?|\x{26F1}[\x{FE0E}\x{FE0F}]?|\x{26F2}[\x{FE0E}\x{FE0F}]?|\x{26F3}[\x{FE0E}\x{FE0F}]?|\x{26F4}[\x{FE0E}\x{FE0F}]?|\x{26F5}[\x{FE0E}\x{FE0F}]?|\x{26F7}[\x{FE0E}\x{FE0F}]?|\x{26F8}[\x{FE0E}\x{FE0F}]?|\x{26FA}[\x{FE0E}\x{FE0F}]?|\x{26FD}[\x{FE0E}\x{FE0F}]?|\x{2702}[\x{FE0E}\x{FE0F}]?|\x{2708}[\x{FE0E}\x{FE0F}]?|\x{2709}[\x{FE0E}\x{FE0F}]?|\x{270F}[\x{FE0E}\x{FE0F}]?|\x{2712}[\x{FE0E}\x{FE0F}]?|\x{2733}[\x{FE0E}\x{FE0F}]?|\x{2734}[\x{FE0E}\x{FE0F}]?|\x{2753}[\x{FE0E}\x{FE0F}]?|\x{2763}[\x{FE0E}\x{FE0F}]?|\x{2764}[\x{FE0E}\x{FE0F}]?|\x{2934}[\x{FE0E}\x{FE0F}]?|\x{2935}[\x{FE0E}\x{FE0F}]?|\x{2B05}[\x{FE0E}\x{FE0F}]?|\x{2B06}[\x{FE0E}\x{FE0F}]?|\x{2B07}[\x{FE0E}\x{FE0F}]?|\x{2B1B}[\x{FE0E}\x{FE0F}]?|\x{2B1C}[\x{FE0E}\x{FE0F}]?|\x{1F004}[\x{FE0E}\x{FE0F}]?|\x{1F170}[\x{FE0E}\x{FE0F}]?|\x{1F171}[\x{FE0E}\x{FE0F}]?|\x{1F1FC}[\x{1F1EB}\x{1F1F8}]?|\x{1F1FE}[\x{1F1EA}\x{1F1F9}]?|\x{1F202}[\x{FE0E}\x{FE0F}]?|\x{1F237}[\x{FE0E}\x{FE0F}]?|\x{1F30D}[\x{FE0E}\x{FE0F}]?|\x{1F30E}[\x{FE0E}\x{FE0F}]?|\x{1F30F}[\x{FE0E}\x{FE0F}]?|\x{1F315}[\x{FE0E}\x{FE0F}]?|\x{1F31C}[\x{FE0E}\x{FE0F}]?|\x{1F321}[\x{FE0E}\x{FE0F}]?|\x{1F324}[\x{FE0E}\x{FE0F}]?|\x{1F325}[\x{FE0E}\x{FE0F}]?|\x{1F326}[\x{FE0E}\x{FE0F}]?|\x{1F327}[\x{FE0E}\x{FE0F}]?|\x{1F328}[\x{FE0E}\x{FE0F}]?|\x{1F329}[\x{FE0E}\x{FE0F}]?|\x{1F32A}[\x{FE0E}\x{FE0F}]?|\x{1F32B}[\x{FE0E}\x{FE0F}]?|\x{1F32C}[\x{FE0E}\x{FE0F}]?|\x{1F378}[\x{FE0E}\x{FE0F}]?|\x{1F393}[\x{FE0E}\x{FE0F}]?|\x{1F396}[\x{FE0E}\x{FE0F}]?|\x{1F397}[\x{FE0E}\x{FE0F}]?|\x{1F399}[\x{FE0E}\x{FE0F}]?|\x{1F39A}[\x{FE0E}\x{FE0F}]?|\x{1F39B}[\x{FE0E}\x{FE0F}]?|\x{1F39E}[\x{FE0E}\x{FE0F}]?|\x{1F39F}[\x{FE0E}\x{FE0F}]?|\x{1F3A7}[\x{FE0E}\x{FE0F}]?|\x{1F3AC}[\x{FE0E}\x{FE0F}]?|\x{1F3AD}[\x{FE0E}\x{FE0F}]?|\x{1F3AE}[\x{FE0E}\x{FE0F}]?|\x{1F3C6}[\x{FE0E}\x{FE0F}]?|\x{1F3CD}[\x{FE0E}\x{FE0F}]?|\x{1F3CE}[\x{FE0E}\x{FE0F}]?|\x{1F3D4}[\x{FE0E}\x{FE0F}]?|\x{1F3D5}[\x{FE0E}\x{FE0F}]?|\x{1F3D6}[\x{FE0E}\x{FE0F}]?|\x{1F3D7}[\x{FE0E}\x{FE0F}]?|\x{1F3D8}[\x{FE0E}\x{FE0F}]?|\x{1F3D9}[\x{FE0E}\x{FE0F}]?|\x{1F3DA}[\x{FE0E}\x{FE0F}]?|\x{1F3DB}[\x{FE0E}\x{FE0F}]?|\x{1F3DC}[\x{FE0E}\x{FE0F}]?|\x{1F3DD}[\x{FE0E}\x{FE0F}]?|\x{1F3DE}[\x{FE0E}\x{FE0F}]?|\x{1F3DF}[\x{FE0E}\x{FE0F}]?|\x{1F3E0}[\x{FE0E}\x{FE0F}]?|\x{1F3ED}[\x{FE0E}\x{FE0F}]?|\x{1F3F4}[\x{200D}\x{E0067}]?|\x{1F3F5}[\x{FE0E}\x{FE0F}]?|\x{1F3F7}[\x{FE0E}\x{FE0F}]?|\x{1F408}[\x{FE0E}\x{FE0F}]?|\x{1F41F}[\x{FE0E}\x{FE0F}]?|\x{1F426}[\x{FE0E}\x{FE0F}]?|\x{1F441}[\x{200D}\x{FE0E}\x{FE0F}]|\x{1F453}[\x{FE0E}\x{FE0F}]?|\x{1F46A}[\x{FE0E}\x{FE0F}]?|\x{1F47D}[\x{FE0E}\x{FE0F}]?|\x{1F4A3}[\x{FE0E}\x{FE0F}]?|\x{1F4B0}[\x{FE0E}\x{FE0F}]?|\x{1F4B3}[\x{FE0E}\x{FE0F}]?|\x{1F4BB}[\x{FE0E}\x{FE0F}]?|\x{1F4BF}[\x{FE0E}\x{FE0F}]?|\x{1F4CB}[\x{FE0E}\x{FE0F}]?|\x{1F4DA}[\x{FE0E}\x{FE0F}]?|\x{1F4DF}[\x{FE0E}\x{FE0F}]?|\x{1F4E4}[\x{FE0E}\x{FE0F}]?|\x{1F4E5}[\x{FE0E}\x{FE0F}]?|\x{1F4E6}[\x{FE0E}\x{FE0F}]?|\x{1F4EA}[\x{FE0E}\x{FE0F}]?|\x{1F4EB}[\x{FE0E}\x{FE0F}]?|\x{1F4EC}[\x{FE0E}\x{FE0F}]?|\x{1F4ED}[\x{FE0E}\x{FE0F}]?|\x{1F4F7}[\x{FE0E}\x{FE0F}]?|\x{1F4F9}[\x{FE0E}\x{FE0F}]?|\x{1F4FA}[\x{FE0E}\x{FE0F}]?|\x{1F4FB}[\x{FE0E}\x{FE0F}]?|\x{1F4FD}[\x{FE0E}\x{FE0F}]?|\x{1F508}[\x{FE0E}\x{FE0F}]?|\x{1F50D}[\x{FE0E}\x{FE0F}]?|\x{1F512}[\x{FE0E}\x{FE0F}]?|\x{1F513}[\x{FE0E}\x{FE0F}]?|\x{1F549}[\x{FE0E}\x{FE0F}]?|\x{1F54A}[\x{FE0E}\x{FE0F}]?|\x{1F550}[\x{FE0E}\x{FE0F}]?|\x{1F551}[\x{FE0E}\x{FE0F}]?|\x{1F552}[\x{FE0E}\x{FE0F}]?|\x{1F553}[\x{FE0E}\x{FE0F}]?|\x{1F554}[\x{FE0E}\x{FE0F}]?|\x{1F555}[\x{FE0E}\x{FE0F}]?|\x{1F556}[\x{FE0E}\x{FE0F}]?|\x{1F557}[\x{FE0E}\x{FE0F}]?|\x{1F558}[\x{FE0E}\x{FE0F}]?|\x{1F559}[\x{FE0E}\x{FE0F}]?|\x{1F55A}[\x{FE0E}\x{FE0F}]?|\x{1F55B}[\x{FE0E}\x{FE0F}]?|\x{1F55C}[\x{FE0E}\x{FE0F}]?|\x{1F55D}[\x{FE0E}\x{FE0F}]?|\x{1F55E}[\x{FE0E}\x{FE0F}]?|\x{1F55F}[\x{FE0E}\x{FE0F}]?|\x{1F560}[\x{FE0E}\x{FE0F}]?|\x{1F561}[\x{FE0E}\x{FE0F}]?|\x{1F562}[\x{FE0E}\x{FE0F}]?|\x{1F563}[\x{FE0E}\x{FE0F}]?|\x{1F564}[\x{FE0E}\x{FE0F}]?|\x{1F565}[\x{FE0E}\x{FE0F}]?|\x{1F566}[\x{FE0E}\x{FE0F}]?|\x{1F567}[\x{FE0E}\x{FE0F}]?|\x{1F56F}[\x{FE0E}\x{FE0F}]?|\x{1F570}[\x{FE0E}\x{FE0F}]?|\x{1F573}[\x{FE0E}\x{FE0F}]?|\x{1F576}[\x{FE0E}\x{FE0F}]?|\x{1F577}[\x{FE0E}\x{FE0F}]?|\x{1F578}[\x{FE0E}\x{FE0F}]?|\x{1F579}[\x{FE0E}\x{FE0F}]?|\x{1F587}[\x{FE0E}\x{FE0F}]?|\x{1F58A}[\x{FE0E}\x{FE0F}]?|\x{1F58B}[\x{FE0E}\x{FE0F}]?|\x{1F58C}[\x{FE0E}\x{FE0F}]?|\x{1F58D}[\x{FE0E}\x{FE0F}]?|\x{1F5A5}[\x{FE0E}\x{FE0F}]?|\x{1F5A8}[\x{FE0E}\x{FE0F}]?|\x{1F5B1}[\x{FE0E}\x{FE0F}]?|\x{1F5B2}[\x{FE0E}\x{FE0F}]?|\x{1F5BC}[\x{FE0E}\x{FE0F}]?|\x{1F5C2}[\x{FE0E}\x{FE0F}]?|\x{1F5C3}[\x{FE0E}\x{FE0F}]?|\x{1F5C4}[\x{FE0E}\x{FE0F}]?|\x{1F5D1}[\x{FE0E}\x{FE0F}]?|\x{1F5D2}[\x{FE0E}\x{FE0F}]?|\x{1F5D3}[\x{FE0E}\x{FE0F}]?|\x{1F5DC}[\x{FE0E}\x{FE0F}]?|\x{1F5DD}[\x{FE0E}\x{FE0F}]?|\x{1F5DE}[\x{FE0E}\x{FE0F}]?|\x{1F5E1}[\x{FE0E}\x{FE0F}]?|\x{1F5E3}[\x{FE0E}\x{FE0F}]?|\x{1F5E8}[\x{FE0E}\x{FE0F}]?|\x{1F5EF}[\x{FE0E}\x{FE0F}]?|\x{1F5F3}[\x{FE0E}\x{FE0F}]?|\x{1F5FA}[\x{FE0E}\x{FE0F}]?|\x{1F610}[\x{FE0E}\x{FE0F}]?|\x{1F687}[\x{FE0E}\x{FE0F}]?|\x{1F68D}[\x{FE0E}\x{FE0F}]?|\x{1F691}[\x{FE0E}\x{FE0F}]?|\x{1F694}[\x{FE0E}\x{FE0F}]?|\x{1F698}[\x{FE0E}\x{FE0F}]?|\x{1F6AD}[\x{FE0E}\x{FE0F}]?|\x{1F6B2}[\x{FE0E}\x{FE0F}]?|\x{1F6B9}[\x{FE0E}\x{FE0F}]?|\x{1F6BA}[\x{FE0E}\x{FE0F}]?|\x{1F6BC}[\x{FE0E}\x{FE0F}]?|\x{1F6CB}[\x{FE0E}\x{FE0F}]?|\x{1F6CD}[\x{FE0E}\x{FE0F}]?|\x{1F6CE}[\x{FE0E}\x{FE0F}]?|\x{1F6CF}[\x{FE0E}\x{FE0F}]?|\x{1F6E0}[\x{FE0E}\x{FE0F}]?|\x{1F6E1}[\x{FE0E}\x{FE0F}]?|\x{1F6E2}[\x{FE0E}\x{FE0F}]?|\x{1F6E3}[\x{FE0E}\x{FE0F}]?|\x{1F6E4}[\x{FE0E}\x{FE0F}]?|\x{1F6E5}[\x{FE0E}\x{FE0F}]?|\x{1F6E9}[\x{FE0E}\x{FE0F}]?|\x{1F6F0}[\x{FE0E}\x{FE0F}]?|\x{1F6F3}[\x{FE0E}\x{FE0F}]?|\xA9[\x{FE0E}\x{FE0F}]|\xAE[\x{FE0E}\x{FE0F}]|\x{203C}[\x{FE0E}\x{FE0F}]|\x{2049}[\x{FE0E}\x{FE0F}]|\x{2122}[\x{FE0E}\x{FE0F}]|\x{2139}[\x{FE0E}\x{FE0F}]|\x{2328}[\x{FE0E}\x{FE0F}]|\x{23CF}[\x{FE0E}\x{FE0F}]|\x{24C2}[\x{FE0E}\x{FE0F}]|\x{25B6}[\x{FE0E}\x{FE0F}]|\x{25C0}[\x{FE0E}\x{FE0F}]|\x{2618}[\x{FE0E}\x{FE0F}]|\x{2714}[\x{FE0E}\x{FE0F}]|\x{2716}[\x{FE0E}\x{FE0F}]|\x{271D}[\x{FE0E}\x{FE0F}]|\x{2721}[\x{FE0E}\x{FE0F}]|\x{2744}[\x{FE0E}\x{FE0F}]|\x{2747}[\x{FE0E}\x{FE0F}]|\x{2757}[\x{FE0E}\x{FE0F}]|\x{27A1}[\x{FE0E}\x{FE0F}]|\x{2B50}[\x{FE0E}\x{FE0F}]|\x{2B55}[\x{FE0E}\x{FE0F}]|\x{3030}[\x{FE0E}\x{FE0F}]|\x{303D}[\x{FE0E}\x{FE0F}]|\x{3297}[\x{FE0E}\x{FE0F}]|\x{3299}[\x{FE0E}\x{FE0F}]|\x{1F17E}[\x{FE0E}\x{FE0F}]|\x{1F17F}[\x{FE0E}\x{FE0F}]|\x{1F21A}[\x{FE0E}\x{FE0F}]|\x{1F22F}[\x{FE0E}\x{FE0F}]|\x{1F336}[\x{FE0E}\x{FE0F}]|\x{1F37D}[\x{FE0E}\x{FE0F}]|\x{1F43F}[\x{FE0E}\x{FE0F}]|\x{1F1F4}\x{1F1F2}?|\x{1F1F6}\x{1F1E6}?|\x{1F1FD}\x{1F1F0}?|\x{1F46F}\x{200D}?|\x{1F93C}\x{200D}?|\x{1F9DE}\x{200D}?|\x{1F9DF}\x{200D}?)
This is the equivalent version in PHP:
preg_replace("/\u{00a9}|\u{00ae}|[\u{2000}-\u{3300}]|[\u{1e400}-\u{1f3ff}]|[\u{1e800}-\u{1f7ff}]|[\u{1ec00}-\u{1fbff}]/u",'', $value);
To create it I have converted the surrogate ranges to int, thanks to: How to convert between a Unicode/UCS codepoint and a UTF16 surrogate pair?
// PHP equivalent
function combine($surrogateHigh, $surrogateLow){
return (($surrogateHigh - 0xd800) * 0x400) + ($surrogateLow - 0xdc00) + 0x10000;
}
Then I have converted the ranges
echo dechex(combine(0xd83c, 0xd000)). "\n";
echo dechex(combine(0xd83c, 0xdfff)). "\n";
echo dechex(combine(0xd83d, 0xd000)). "\n";
echo dechex(combine(0xd83d, 0xdfff)). "\n";
echo dechex(combine(0xd83e, 0xd000)). "\n";
echo dechex(combine(0xd83e, 0xdfff)). "\n";
As far as PHP goes, you can json_encode() the string you're trying to apply the "illegal" REGEX pattern on, and this will convert the string to UTF-8 friendly chars.
From there you can just check for the literal unicode string:
$value = "Sup 🚫";
$res = json_decode(preg_replace('/\\\ud83d\\\udeab/i', 'REPLACED', json_encode($value)));
// $res is now "Sup REPLACED", yes some emojis are made up of 2 unicodes :\
Note: I wrapped it in a json_decode() to get the original string back.
Also, >= 0xd800 && <= 0xdfff just says any unicode in that Hex range will throw this error. The emoji I used in my example above is indeed in the illegal range.
Downside: You can't apply Hex ranges with this solution, you'll have to know which emojis are problematic exactly, and deal with them precisely (i.e. '/' . implode('|', EmojiClass::BAD_EMOJI_HEXES_ARRAY) . '/i')

How to convert MS dot character to Unicode [duplicate]

This question looks embarrassingly simple, but I haven't been able to find an answer.
What is the PHP equivalent to the following C# line of code?
string str = "\u1000";
This sample creates a string with a single Unicode character whose "Unicode numeric value" is 1000 in hexadecimal (4096 in decimal).
That is, in PHP, how can I create a string with a single Unicode character whose "Unicode numeric value" is known?
PHP 7.0.0 has introduced the "Unicode codepoint escape" syntax.
It's now possible to write Unicode characters easily by using a double-quoted or a heredoc string, without calling any function.
$unicodeChar = "\u{1000}";
Because JSON directly supports the \uxxxx syntax the first thing that comes into my mind is:
$unicodeChar = '\u1000';
echo json_decode('"'.$unicodeChar.'"');
Another option would be to use mb_convert_encoding()
echo mb_convert_encoding('က', 'UTF-8', 'HTML-ENTITIES');
or make use of the direct mapping between UTF-16BE (big endian) and the Unicode codepoint:
echo mb_convert_encoding("\x10\x00", 'UTF-8', 'UTF-16BE');
I wonder why no one has mentioned this yet, but you can do an almost equivalent version using escape sequences in double quoted strings:
\x[0-9A-Fa-f]{1,2}
The sequence of characters matching the regular expression is a
character in hexadecimal notation.
ASCII example:
<?php
echo("\x48\x65\x6C\x6C\x6F\x20\x57\x6F\x72\x6C\x64\x21");
?>
Hello World!
So for your case, all you need to do is $str = "\x30\xA2";. But these are bytes, not characters. The byte representation of the Unicode codepoint coincides with UTF-16 big endian, so we could print it out directly as such:
<?php
header('content-type:text/html;charset=utf-16be');
echo("\x30\xA2");
?>
ã‚¢
If you are using a different encoding, you'll need alter the bytes accordingly (mostly done with a library, though possible by hand too).
UTF-16 little endian example:
<?php
header('content-type:text/html;charset=utf-16le');
echo("\xA2\x30");
?>
ã‚¢
UTF-8 example:
<?php
header('content-type:text/html;charset=utf-8');
echo("\xE3\x82\xA2");
?>
ã‚¢
There is also the pack function, but you can expect it to be slow.
PHP does not know these Unicode escape sequences. But as unknown escape sequences remain unaffected, you can write your own function that converts such Unicode escape sequences:
function unicodeString($str, $encoding=null) {
if (is_null($encoding)) $encoding = ini_get('mbstring.internal_encoding');
return preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/u', create_function('$match', 'return mb_convert_encoding(pack("H*", $match[1]), '.var_export($encoding, true).', "UTF-16BE");'), $str);
}
Or with an anonymous function expression instead of create_function:
function unicodeString($str, $encoding=null) {
if (is_null($encoding)) $encoding = ini_get('mbstring.internal_encoding');
return preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/u', function($match) use ($encoding) {
return mb_convert_encoding(pack('H*', $match[1]), $encoding, 'UTF-16BE');
}, $str);
}
Its usage:
$str = unicodeString("\u1000");
html_entity_decode('エ', 0, 'UTF-8');
This works too. However the json_decode() solution is a lot faster (around 50 times).
Try Portable UTF-8:
$str = utf8_chr( 0x1000 );
$str = utf8_chr( '\u1000' );
$str = utf8_chr( 4096 );
All work exactly the same way. You can get the codepoint of a character with utf8_ord(). Read more about Portable UTF-8.
As mentioned by others, PHP 7 introduces support for the \u Unicode syntax directly.
As also mentioned by others, the only way to obtain a string value from any sensible Unicode character description in PHP, is by converting it from something else (e.g. JSON parsing, HTML parsing or some other form). But this comes at a run-time performance cost.
However, there is one other option. You can encode the character directly in PHP with \x binary escaping. The \x escape syntax is also supported in PHP 5.
This is especially useful if you prefer not to enter the character directly in a string through its natural form. For example, if it is an invisible control character, or other hard to detect whitespace.
First, a proof example:
// Unicode Character 'HAIR SPACE' (U+200A)
$htmlEntityChar = " ";
$realChar = html_entity_decode($htmlEntityChar);
$phpChar = "\xE2\x80\x8A";
echo 'Proof: ';
var_dump($realChar === $phpChar); // bool(true)
Note that, as mentioned by Pacerier in another answer, this binary code is unique to a specific character encoding. In the above example, \xE2\x80\x8A is the binary coding for U+200A in UTF-8.
The next question is, how do you get from U+200A to \xE2\x80\x8A?
Below is a PHP script to generate the escape sequence for any character, based on either a JSON string, HTML entity, or any other method once you have it as a native string.
function str_encode_utf8binary($str) {
/** #author Krinkle 2018 */
$output = '';
foreach (str_split($str) as $octet) {
$ordInt = ord($octet);
// Convert from int (base 10) to hex (base 16), for PHP \x syntax
$ordHex = base_convert($ordInt, 10, 16);
$output .= '\x' . $ordHex;
}
return $output;
}
function str_convert_html_to_utf8binary($str) {
return str_encode_utf8binary(html_entity_decode($str));
}
function str_convert_json_to_utf8binary($str) {
return str_encode_utf8binary(json_decode($str));
}
// Example for raw string: Unicode Character 'INFINITY' (U+221E)
echo str_encode_utf8binary('∞') . "\n";
// \xe2\x88\x9e
// Example for HTML: Unicode Character 'HAIR SPACE' (U+200A)
echo str_convert_html_to_utf8binary(' ') . "\n";
// \xe2\x80\x8a
// Example for JSON: Unicode Character 'HAIR SPACE' (U+200A)
echo str_convert_json_to_utf8binary('"\u200a"') . "\n";
// \xe2\x80\x8a
function unicode_to_textstring($str){
$rawstr = pack('H*', $str);
$newstr = iconv('UTF-16BE', 'UTF-8', $rawstr);
return $newstr;
}
$msg = '67714eac99c500200054006f006b0079006f002000530074006100740069006f006e003a0020';
echo unicode_to_textstring($str);

How to get the "rendered length" of Unicode string containing combining characters in PHP?

Considered that not all unicode combining characters have an equivalent precomposed one (NFC), is there a way to get the string's "rendered" length using PHP, if this is possible / makes semantically sense?
http://3v4l.org/L1kPl (using php7 escape syntax)
<?php
echo $s = "\u{0071}\u{0307}\u{0323}";
echo "\n";
echo mb_strlen(Normalizer::normalize($s, Normalizer::FORM_C), "UTF-8");
// Shows 3 because there is no precomposed equivalent
// for such glyph. I want to get 1 instead
What I achieved so far: http://3v4l.org/4NSCi
<?php
echo $s = "\u{0071}\u{0307}\u{0323}";
$r = Normalizer::normalize($s, Normalizer::FORM_C);
echo mb_strlen(preg_replace("#\p{Mn}#u", "", $r), "UTF-8");
You are probably looking for:
grapheme_strlen()
It takes one argument that needs to be a valid utf-8 string.
Here's the reference: Graphme cluster boundaries

Echo new line in PHP

<?php
include 'db_connect.php';
$q = mysql_real_escape_string($_GET['q']);
$arr = explode('+', $q);
foreach($arr as $ing)
{
echo $ing;
echo "<br/>";
}
mysql_close($db);
?>
Calling:
findByIncredients.php?q=Hans+Wurst+Wurstel
Source code HTML:
Hans Wurst Wurstel<br/>
Why is there only one newline?
+s in URL are urlencoded spaces. So what php sees in the variable is "Hans Wurst Wurstel". You need to split by space ' ', not +
arr = explode (' ',$q);
"+" gets converted to SPACE on URL decoding.
You may want to pass your string as str1-str2-str3 in get parameter.
Try:
<?php
include 'db_connect.php';
$q = mysql_real_escape_string($_GET['q']);
$arr = explode (' ',$q);
foreach($arr as $ing)
{
echo $ing;
echo "<br/>";
}
mysql_close($db);
?>
Hans+Wurst+Wurstel is the url escaped query string. The php page will likely process it once unescaped (in this case, all +s will be translated into spaces). You should choose a delimiter for explode according to the string as it is in that moment. You can use print_r() for a raw print if you don't know how the string (or any kind of variable) looks like.
Easy. While the standard RFC 3986 url encoding would encode the space " " as "%20", due to historical reasons, it can also be encoded as "+". When PHP parses the query string, it will convert the "+" character to a space.
This is also illustrated by the existence of both:
urlencode: equivalent of what PHP uses internally, will convert " " to "+".
rawurlencode: RFC-conformant encoder, will convert " " to "%20".
I'm assuming you want to explode by space. If you really wanted to encode a "+" character, you could use "%2B", which is the rawurlencode version and will always work.
(EDIT)
Related questions:
When to encode space to plus (+) or %20?
PHP - Plus sign with GET query

Unicode character in PHP string

This question looks embarrassingly simple, but I haven't been able to find an answer.
What is the PHP equivalent to the following C# line of code?
string str = "\u1000";
This sample creates a string with a single Unicode character whose "Unicode numeric value" is 1000 in hexadecimal (4096 in decimal).
That is, in PHP, how can I create a string with a single Unicode character whose "Unicode numeric value" is known?
PHP 7.0.0 has introduced the "Unicode codepoint escape" syntax.
It's now possible to write Unicode characters easily by using a double-quoted or a heredoc string, without calling any function.
$unicodeChar = "\u{1000}";
Because JSON directly supports the \uxxxx syntax the first thing that comes into my mind is:
$unicodeChar = '\u1000';
echo json_decode('"'.$unicodeChar.'"');
Another option would be to use mb_convert_encoding()
echo mb_convert_encoding('က', 'UTF-8', 'HTML-ENTITIES');
or make use of the direct mapping between UTF-16BE (big endian) and the Unicode codepoint:
echo mb_convert_encoding("\x10\x00", 'UTF-8', 'UTF-16BE');
I wonder why no one has mentioned this yet, but you can do an almost equivalent version using escape sequences in double quoted strings:
\x[0-9A-Fa-f]{1,2}
The sequence of characters matching the regular expression is a
character in hexadecimal notation.
ASCII example:
<?php
echo("\x48\x65\x6C\x6C\x6F\x20\x57\x6F\x72\x6C\x64\x21");
?>
Hello World!
So for your case, all you need to do is $str = "\x30\xA2";. But these are bytes, not characters. The byte representation of the Unicode codepoint coincides with UTF-16 big endian, so we could print it out directly as such:
<?php
header('content-type:text/html;charset=utf-16be');
echo("\x30\xA2");
?>
ã‚¢
If you are using a different encoding, you'll need alter the bytes accordingly (mostly done with a library, though possible by hand too).
UTF-16 little endian example:
<?php
header('content-type:text/html;charset=utf-16le');
echo("\xA2\x30");
?>
ã‚¢
UTF-8 example:
<?php
header('content-type:text/html;charset=utf-8');
echo("\xE3\x82\xA2");
?>
ã‚¢
There is also the pack function, but you can expect it to be slow.
PHP does not know these Unicode escape sequences. But as unknown escape sequences remain unaffected, you can write your own function that converts such Unicode escape sequences:
function unicodeString($str, $encoding=null) {
if (is_null($encoding)) $encoding = ini_get('mbstring.internal_encoding');
return preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/u', create_function('$match', 'return mb_convert_encoding(pack("H*", $match[1]), '.var_export($encoding, true).', "UTF-16BE");'), $str);
}
Or with an anonymous function expression instead of create_function:
function unicodeString($str, $encoding=null) {
if (is_null($encoding)) $encoding = ini_get('mbstring.internal_encoding');
return preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/u', function($match) use ($encoding) {
return mb_convert_encoding(pack('H*', $match[1]), $encoding, 'UTF-16BE');
}, $str);
}
Its usage:
$str = unicodeString("\u1000");
html_entity_decode('エ', 0, 'UTF-8');
This works too. However the json_decode() solution is a lot faster (around 50 times).
Try Portable UTF-8:
$str = utf8_chr( 0x1000 );
$str = utf8_chr( '\u1000' );
$str = utf8_chr( 4096 );
All work exactly the same way. You can get the codepoint of a character with utf8_ord(). Read more about Portable UTF-8.
As mentioned by others, PHP 7 introduces support for the \u Unicode syntax directly.
As also mentioned by others, the only way to obtain a string value from any sensible Unicode character description in PHP, is by converting it from something else (e.g. JSON parsing, HTML parsing or some other form). But this comes at a run-time performance cost.
However, there is one other option. You can encode the character directly in PHP with \x binary escaping. The \x escape syntax is also supported in PHP 5.
This is especially useful if you prefer not to enter the character directly in a string through its natural form. For example, if it is an invisible control character, or other hard to detect whitespace.
First, a proof example:
// Unicode Character 'HAIR SPACE' (U+200A)
$htmlEntityChar = " ";
$realChar = html_entity_decode($htmlEntityChar);
$phpChar = "\xE2\x80\x8A";
echo 'Proof: ';
var_dump($realChar === $phpChar); // bool(true)
Note that, as mentioned by Pacerier in another answer, this binary code is unique to a specific character encoding. In the above example, \xE2\x80\x8A is the binary coding for U+200A in UTF-8.
The next question is, how do you get from U+200A to \xE2\x80\x8A?
Below is a PHP script to generate the escape sequence for any character, based on either a JSON string, HTML entity, or any other method once you have it as a native string.
function str_encode_utf8binary($str) {
/** #author Krinkle 2018 */
$output = '';
foreach (str_split($str) as $octet) {
$ordInt = ord($octet);
// Convert from int (base 10) to hex (base 16), for PHP \x syntax
$ordHex = base_convert($ordInt, 10, 16);
$output .= '\x' . $ordHex;
}
return $output;
}
function str_convert_html_to_utf8binary($str) {
return str_encode_utf8binary(html_entity_decode($str));
}
function str_convert_json_to_utf8binary($str) {
return str_encode_utf8binary(json_decode($str));
}
// Example for raw string: Unicode Character 'INFINITY' (U+221E)
echo str_encode_utf8binary('∞') . "\n";
// \xe2\x88\x9e
// Example for HTML: Unicode Character 'HAIR SPACE' (U+200A)
echo str_convert_html_to_utf8binary(' ') . "\n";
// \xe2\x80\x8a
// Example for JSON: Unicode Character 'HAIR SPACE' (U+200A)
echo str_convert_json_to_utf8binary('"\u200a"') . "\n";
// \xe2\x80\x8a
function unicode_to_textstring($str){
$rawstr = pack('H*', $str);
$newstr = iconv('UTF-16BE', 'UTF-8', $rawstr);
return $newstr;
}
$msg = '67714eac99c500200054006f006b0079006f002000530074006100740069006f006e003a0020';
echo unicode_to_textstring($str);

Categories