I'm trying to detect the emoji that I get through e.g. a POST (the source ist not necessary).
As an example I'm using this emoji: ✊🏾 (I hope it's visible)
The code for it is U+270A U+1F3FE (I'm using http://unicode.org/emoji/charts/full-emoji-list.html for the codes)
Now I converted the emoji with json_encode and I get: \u270a\ud83c\udffe
Here the only part that is equal is 270a. \ud83c\udffe is not equal to U+1F3FE, not even if I add them together (1B83A)
How do I get from ✊🏾 to U+270A U+1F3FE with e.g. php?
Use mb_convert_encoding and convert from UTF-8 to UTF-32. Then do some additional formatting:
// Strips leading zeros
// And returns str in UPPERCASE letters with a U+ prefix
function format($str) {
$copy = false;
$len = strlen($str);
$res = '';
for ($i = 0; $i < $len; ++$i) {
$ch = $str[$i];
if (!$copy) {
if ($ch != '0') {
$copy = true;
}
// Prevent format("0") from returning ""
else if (($i + 1) == $len) {
$res = '0';
}
}
if ($copy) {
$res .= $ch;
}
}
return 'U+'.strtoupper($res);
}
function convert_emoji($emoji) {
// ✊🏾 --> 0000270a0001f3fe
$emoji = mb_convert_encoding($emoji, 'UTF-32', 'UTF-8');
$hex = bin2hex($emoji);
// Split the UTF-32 hex representation into chunks
$hex_len = strlen($hex) / 8;
$chunks = array();
for ($i = 0; $i < $hex_len; ++$i) {
$tmp = substr($hex, $i * 8, 8);
// Format each chunk
$chunks[$i] = format($tmp);
}
// Convert chunks array back to a string
return implode($chunks, ' ');
}
echo convert_emoji('✊🏾'); // U+270A U+1F3FE
Simple function, inspired by #d3L answer above
function emoji_to_unicode($emoji) {
$emoji = mb_convert_encoding($emoji, 'UTF-32', 'UTF-8');
$unicode = strtoupper(preg_replace("/^[0]+/","U+",bin2hex($emoji)));
return $unicode;
}
Exmaple
emoji_to_unicode("💵");//returns U+1F4B5
You can do like this, consider the emoji a normal character.
$emoji = "✊🏾";
$str = str_replace('"', "", json_encode($emoji, JSON_HEX_APOS));
$myInput = $str;
$myHexString = str_replace('\\u', '', $myInput);
$myBinString = hex2bin($myHexString);
print iconv("UTF-16BE", "UTF-8", $myBinString);
Related
I know that Bulgarian-MIK character set can be converted with the ord function and adding 64, and the bulgarian-MIK characters are from 127 to 191 but i cant get the letter "а"(ord - 127).I tried a lot of ways but it seems that php is processing "а" with a blank symbol and i cant get it.
define("PHP_NL", "<br>");
$string = '-------------- 1 --------------'.PHP_NL;
$string .= '413 …±Ї°Ґ±® €¶® X1.000'.PHP_NL;
$string .= '358 ЊЁ ‚®¤ 0.5 X1.000'.PHP_NL;
$string .= '--------------------------------'.PHP_NL;
$string .= '1 -Ђ¤°Ё - ЊЂ‘Ђ: 1 - 6'.PHP_NL;
$string .= '17-08-2018 09:05:32'.PHP_NL;
$string .= '--------------------------------';
That is my string with Bulgarian-MIK encoding.I tried to convert it and i every letter is converted fine, but only "а" i cant get.
My function
function ConvertDosToWin($string) {
$chr = null;
for ($i = 1;$i<strlen($string);$i++) {
$chr = mb_convert_encoding($string[$i],'utf-8','windows-1251');
if((ord($chr) >= 127) && (ord($chr)<=(127+64)) ) {
echo 'inside if';
$string[$i] = chr(ord($chr)+64);
}
}
return $string;
}
I fixed the problem using iconv.
function ConvertWinToDos($string) {
$chr = null;
for ($i = 1;$i<strlen($string);$i++) {
$string = iconv(mb_detect_encoding($string,mb_detect_order(),true),'windows-1251',$string);
$chr = $string[$i];
if ((ord($chr) >= 192) && (ord($chr) <= 255)) {
$string[$i] = chr(ord($chr) - 64);
}
}
return $string;
}
I think that this approach may help. I used this in an old project and next is working example. PHP file is Windows-1251 encoded. If your text is in different encoding, you need to convert text using mb_convert_encoding() or iconv(), because ord() returns the binary value of the first byte of text as an unsigned integer between 0 and 255.
Test.php:
<?php
// Functions
function ConvertDosToWin($string) {
$chr = null;
for ($i = 0; $i<strlen($string); $i++) {
if ((ord($chr) >= 128) && (ord($chr) <= 191)) {
$string[$i] = chr(ord($chr) + 64);
}
}
return $string;
}
function ConvertWinToDos($string) {
$chr = null;
for ($i = 0; $i<strlen($string); $i++) {
$chr = $string[$i];
if ((ord($chr) >= 192) && (ord($chr) <= 255)) {
$string[$i] = chr(ord($chr) - 64);
}
}
return $string;
}
// Output
$text = 'АБВГДЕЖЗИЙ';
$text = ConvertWinToDos($text);
file_put_contents('dos.txt', $text);
?>
I wrote this function that creates a random string of UTF-8 characters. It works well, but the regular expression [^\p{L}] is not filtering all non-letter characters it seems. I can't think of a better way to generate the full range of unicode without non-letter characters.. short of manually searching for and defining the decimal letter ranges between 65 and 65533.
function rand_str($max_length, $min_length = 1, $utf8 = true) {
static $utf8_chars = array();
if ($utf8 && !$utf8_chars) {
for ($i = 1; $i <= 65533; $i++) {
$utf8_chars[] = mb_convert_encoding("&#$i;", 'UTF-8', 'HTML-ENTITIES');
}
$utf8_chars = preg_replace('/[^\p{L}]/u', '', $utf8_chars);
foreach ($utf8_chars as $i => $char) {
if (trim($utf8_chars[$i])) {
$chars[] = $char;
}
}
$utf8_chars = $chars;
}
$chars = $utf8 ? $utf8_chars : str_split('abcdefghijklmnopqrstuvwxyz');
$num_chars = count($chars);
$string = '';
$length = mt_rand($min_length, $max_length);
for ($i = 0; $i < $length; $i++) {
$string .= $chars[mt_rand(1, $num_chars) - 1];
}
return $string;
}
\p{L} might be catching too much. Try to limit to {Ll} and {LU} -- {L} includes {Lo} -- others.
With PHP7 and IntlChar there is now a better way:
function utf8_random_string(int $length) : string {
$r = "";
for ($i = 0; $i < $length; $i++) {
$codePoint = mt_rand(0x80, 0xffff);
$char = \IntlChar::chr($codePoint);
if ($char !== null && \IntlChar::isprint($char)) {
$r .= $char;
} else {
$i--;
}
}
return $r;
}
I need to scramble/encode all e-mail addresses in a string, turn them into links and leave the rest of the string intact?
I'm using
$withlinks = preg_replace("/([\w-?&;#~=\.\/]+\#(\[?)[a-zA-Z0-9\-\.]+\.([a-zA-Z]{2,3}|[0-9]{1,3})(\]?))/i","$1",$nolinks);
to make links out of e-mails and
function encode_email($str) {
$str = mb_convert_encoding($str , 'UTF-32', 'UTF-8'); //big endian
$split = str_split($str, 4);
$res = "";
foreach ($split as $c) {
$cur = 0;
for ($i = 0; $i < 4; $i++) {
$cur |= ord($c[$i]) << (8*(3 - $i));
}
$res .= "&#" . $cur . ";";
}
return $res;
}
to encode the addresses but I can't figure out how to put them together, so that only e-mails would be encoded and turned into links.
You can use preg_replace_callback so that you can manipulate the replacement text to be exactly what you want...
<?php
// test string
$nolinks = "amy#winehous.com is an email for bobby#fisher.com plays chess";
// your original function
function encode_email($str)
{
$str = mb_convert_encoding($str, 'UTF-32', 'UTF-8'); //big endian
$split = str_split($str, 4);
$res = "";
foreach ($split as $c) {
$cur = 0;
for ($i = 0; $i < 4; $i++) {
$cur |= ord($c[$i]) << (8 * (3 - $i));
}
$res .= "&#" . $cur . ";";
}
return $res;
}
// function used for callback
function encode_email_and_add_link($in)
{
// get encoded email address (don't actually know what this function does)
$encoded = encode_email($in[1]);
// return a hyperlink string built with encoded email address
return "$encoded";
}
// do the regex with callback
$withlinks = preg_replace_callback("/([\w-?&;#~=\.\/]+\#(\[?)[a-zA-Z0-9\-\.]+\.([a-zA-Z]{2,3}|[0-9]{1,3})(\]?))/i", 'encode_email_and_add_link', $nolinks);
// output the results
echo $withlinks;
I have a problem when using the wordwrap() function in php with for example chinese characters. When the $cut parameter in the wordwrap function is set to true, it ruins the string by inserting question marks.
Is there a solution to this?
The native wordwrap function is not safe to use for unicode. Here's a mb_wordwrap by Sam B.:
<?php
/**
* Multibyte capable wordwrap
*
* #param string $str
* #param int $width
* #param string $break
* #return string
*/
function mb_wordwrap($str, $width=74, $break="\r\n")
{
// Return short or empty strings untouched
if(empty($str) || mb_strlen($str, 'UTF-8') <= $width)
return $str;
$br_width = mb_strlen($break, 'UTF-8');
$str_width = mb_strlen($str, 'UTF-8');
$return = '';
$last_space = false;
for($i=0, $count=0; $i < $str_width; $i++, $count++)
{
// If we're at a break
if (mb_substr($str, $i, $br_width, 'UTF-8') == $break)
{
$count = 0;
$return .= mb_substr($str, $i, $br_width, 'UTF-8');
$i += $br_width - 1;
continue;
}
// Keep a track of the most recent possible break point
if(mb_substr($str, $i, 1, 'UTF-8') == " ")
{
$last_space = $i;
}
// It's time to wrap
if ($count > $width)
{
// There are no spaces to break on! Going to truncate :(
if(!$last_space)
{
$return .= $break;
$count = 0;
}
else
{
// Work out how far back the last space was
$drop = $i - $last_space;
// Cutting zero chars results in an empty string, so don't do that
if($drop > 0)
{
$return = mb_substr($return, 0, -$drop);
}
// Add a break
$return .= $break;
// Update pointers
$i = $last_space + ($br_width - 1);
$last_space = false;
$count = 0;
}
}
// Add character from the input string to the output
$return .= mb_substr($str, $i, 1, 'UTF-8');
}
return $return;
}
?>
I want to convert this hello#domain.com to
hello#domain.com
I have tried:
url_encode($string)
this provides the same string I entered, returned with the # symbol converted to %40
also tried:
htmlentities($string)
this provides the same string right back.
I am using a UTF8 charset. not sure if this makes a difference....
Here it goes (assumes UTF-8, but it's trivial to change):
function encode($str) {
$str = mb_convert_encoding($str , 'UTF-32', 'UTF-8'); //big endian
$split = str_split($str, 4);
$res = "";
foreach ($split as $c) {
$cur = 0;
for ($i = 0; $i < 4; $i++) {
$cur |= ord($c[$i]) << (8*(3 - $i));
}
$res .= "&#" . $cur . ";";
}
return $res;
}
EDIT Recommended alternative using unpack:
function encode2($str) {
$str = mb_convert_encoding($str , 'UTF-32', 'UTF-8');
$t = unpack("N*", $str);
$t = array_map(function($n) { return "&#$n;"; }, $t);
return implode("", $t);
}
Much easier way to do this:
function convertToNumericEntities($string) {
$convmap = array(0x80, 0x10ffff, 0, 0xffffff);
return mb_encode_numericentity($string, $convmap, "UTF-8");
}
You can change the encoding if you are using anything different.
Fixed map range. Thanks to Artefacto.
function uniord($char) {
$k=mb_convert_encoding($char , 'UTF-32', 'UTF-8');
$k1=ord(substr($k,0,1));
$k2=ord(substr($k,1,1));
$value=(string)($k2*256+$k1);
return $value;
}
the above function works for 1 character but if you have a string you can do like this
$string="anytext";
$arr=preg_split(//u,$string,-1,PREG_SPLIT_NO_EMPTY);
$temp=" ";
foreach($arr as $v){
$temp="&#".uniord($v);//prints the equivalent html entity of string
}