How to convert all characters to their html entity equivalent using PHP - php

I want to convert this hello#domain.com to
hello#domain.com
I have tried:
url_encode($string)
this provides the same string I entered, returned with the # symbol converted to %40
also tried:
htmlentities($string)
this provides the same string right back.
I am using a UTF8 charset. not sure if this makes a difference....

Here it goes (assumes UTF-8, but it's trivial to change):
function encode($str) {
$str = mb_convert_encoding($str , 'UTF-32', 'UTF-8'); //big endian
$split = str_split($str, 4);
$res = "";
foreach ($split as $c) {
$cur = 0;
for ($i = 0; $i < 4; $i++) {
$cur |= ord($c[$i]) << (8*(3 - $i));
}
$res .= "&#" . $cur . ";";
}
return $res;
}
EDIT Recommended alternative using unpack:
function encode2($str) {
$str = mb_convert_encoding($str , 'UTF-32', 'UTF-8');
$t = unpack("N*", $str);
$t = array_map(function($n) { return "&#$n;"; }, $t);
return implode("", $t);
}

Much easier way to do this:
function convertToNumericEntities($string) {
$convmap = array(0x80, 0x10ffff, 0, 0xffffff);
return mb_encode_numericentity($string, $convmap, "UTF-8");
}
You can change the encoding if you are using anything different.
Fixed map range. Thanks to Artefacto.

function uniord($char) {
$k=mb_convert_encoding($char , 'UTF-32', 'UTF-8');
$k1=ord(substr($k,0,1));
$k2=ord(substr($k,1,1));
$value=(string)($k2*256+$k1);
return $value;
}
the above function works for 1 character but if you have a string you can do like this
$string="anytext";
$arr=preg_split(//u,$string,-1,PREG_SPLIT_NO_EMPTY);
$temp=" ";
foreach($arr as $v){
$temp="&#".uniord($v);//prints the equivalent html entity of string
}

Related

Encode all characters to entities

I would like to convert all characters to character entities to act as a spam protection for email address. I need entities in this format
y o u ...
just like on this web (here done by JS):
character entities encoding
Is there any simple way to do that in PHP like with built-in functions?
function encode2($str) {
$str = mb_convert_encoding($str , 'UTF-32', 'UTF-8');
$t = unpack("N*", $str);
$t = array_map(function($n) { return "&#$n;"; }, $t);
return implode("", $t);
}
or
function encode($str) {
$str = mb_convert_encoding($str , 'UTF-32', 'UTF-8'); //big endian
$split = str_split($str, 4);
$res = "";
foreach ($split as $c) {
$cur = 0;
for ($i = 0; $i < 4; $i++) {
$cur |= ord($c[$i]) << (8*(3 - $i));
}
$res .= "&#" . $cur . ";";
}
return $res;
}

convert emoji to their hex code

I'm trying to detect the emoji that I get through e.g. a POST (the source ist not necessary).
As an example I'm using this emoji: ✊🏾 (I hope it's visible)
The code for it is U+270A U+1F3FE (I'm using http://unicode.org/emoji/charts/full-emoji-list.html for the codes)
Now I converted the emoji with json_encode and I get: \u270a\ud83c\udffe
Here the only part that is equal is 270a. \ud83c\udffe is not equal to U+1F3FE, not even if I add them together (1B83A)
How do I get from ✊🏾 to U+270A U+1F3FE with e.g. php?
Use mb_convert_encoding and convert from UTF-8 to UTF-32. Then do some additional formatting:
// Strips leading zeros
// And returns str in UPPERCASE letters with a U+ prefix
function format($str) {
$copy = false;
$len = strlen($str);
$res = '';
for ($i = 0; $i < $len; ++$i) {
$ch = $str[$i];
if (!$copy) {
if ($ch != '0') {
$copy = true;
}
// Prevent format("0") from returning ""
else if (($i + 1) == $len) {
$res = '0';
}
}
if ($copy) {
$res .= $ch;
}
}
return 'U+'.strtoupper($res);
}
function convert_emoji($emoji) {
// ✊🏾 --> 0000270a0001f3fe
$emoji = mb_convert_encoding($emoji, 'UTF-32', 'UTF-8');
$hex = bin2hex($emoji);
// Split the UTF-32 hex representation into chunks
$hex_len = strlen($hex) / 8;
$chunks = array();
for ($i = 0; $i < $hex_len; ++$i) {
$tmp = substr($hex, $i * 8, 8);
// Format each chunk
$chunks[$i] = format($tmp);
}
// Convert chunks array back to a string
return implode($chunks, ' ');
}
echo convert_emoji('✊🏾'); // U+270A U+1F3FE
Simple function, inspired by #d3L answer above
function emoji_to_unicode($emoji) {
$emoji = mb_convert_encoding($emoji, 'UTF-32', 'UTF-8');
$unicode = strtoupper(preg_replace("/^[0]+/","U+",bin2hex($emoji)));
return $unicode;
}
Exmaple
emoji_to_unicode("💵");//returns U+1F4B5
You can do like this, consider the emoji a normal character.
$emoji = "✊🏾";
$str = str_replace('"', "", json_encode($emoji, JSON_HEX_APOS));
$myInput = $str;
$myHexString = str_replace('\\u', '', $myInput);
$myBinString = hex2bin($myHexString);
print iconv("UTF-16BE", "UTF-8", $myBinString);

mb_stripos() in PHP won't work correctly

This code:
setlocale(LC_ALL, 'pl_PL', 'pl', 'Polish_Poland.28592');
$result = mb_stripos("ĘÓĄŚŁŻŹĆŃ",'ęóąśłżźćń');
returns false;
How to fix that?
P.S. This stripos returns false when special characters is used is not correct answer.
UPDATE: I made a test:
function test() {
$search = "zawór"; $searchlen=strlen($search);
$opentag="<valve>"; $opentaglen=strlen($opentag);
$closetag="</valve>"; $closetaglen=strlen($closetag);
$test[0]['input']="test ZAWÓR test"; //normal test
$test[1]['input']="X\nX\nX ZAWÓR X\nX\nX"; //white char test
$test[2]['input']="<br> ZAWÓR <br>"; //html newline test
$test[3]['input']="ĄąĄą ZAWÓR ĄąĄą"; //polish diacritical test
$test[4]['input']="テスト ZAWÓR テスト"; //japanese katakana test
foreach ($test as $key => $val) {
$position = mb_stripos($val['input'],$search,0,'UTF-8');
if($position!=false) {
$output = $val['input'];
$output = substr_replace($output, $opentag, $position, 0);
$output = substr_replace($output, $closetag, $position+$opentaglen+$searchlen, 0);
$test[$key]['output'] = $output;
}
else {
$test[$key]['output'] = null;
}
}
return $test;
}
FIREFOX OUTPUT:
$test[0]['output'] == "test <valve>ZAWÓR</valve> test" // ok
$test[1]['output'] == "X\nX\nX <valve>ZAWÓR</valve> X\nX\nX" // ok
$test[2]['output'] == "<br> <valve>ZAWÓR</valve> <br>" // ok
$test[3]['output'] == "Ąą�<valve>�ą ZA</valve>WÓR ĄąĄą" // WTF??
$test[4]['output'] == "テ�<valve>��ト </valve>ZAWÓR テスト" // WTF??
Solution https://drupal.org/node/1107268 does not change anything.
The function works fine when told what encoding your strings are in:
var_dump(mb_stripos("ĘÓĄŚŁŻŹĆŃ",'ęóąśłżźćń', 0, 'UTF-8')); // 0
^^^^^^^
Without the explicit encoding argument, it may assume the wrong encoding and cannot treat your string correctly.
The problem with your test code is that you're mixing character-based indices with byte-offset-based indices. mb_strpos returns offsets in characters, while substr_replace works with byte offsets. Read about the topic here: What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.
If you want to wrap a certain word in tags in a multi-byte string, I'd rather suggest this approach:
preg_replace('/zawór/iu', '<valve>$0</valve>', $text)
Note that $text must be UTF-8 encoded, /u regular expressions only work with UTF-8.
I'm not sure why mb_stripos function dose not worked but workaround will work as below,
$str = mb_convert_case("ęóąśłżźćń", MB_CASE_UPPER, "UTF-8");
$result = mb_strrichr($str,"ĘÓĄŚŁŻŹĆŃ");
var_dump($result);
DEMO.
Using your tip, dear Rikesh, I wrote that:
function patched_mb_stripos($content,$search) {
$content=mb_convert_case($content, MB_CASE_LOWER, "UTF-8");
$search=mb_convert_case($search, MB_CASE_LOWER, "UTF-8");
return mb_stripos($content,$search);
}
and it seems to work :)
Solution from https://gist.github.com/stemar/8287074 :
function mb_substr_replace($string, $replacement, $start, $length=NULL) {
if (is_array($string)) {
$num = count($string);
// $replacement
$replacement = is_array($replacement) ? array_slice($replacement, 0, $num) : array_pad(array($replacement), $num, $replacement);
// $start
if (is_array($start)) {
$start = array_slice($start, 0, $num);
foreach ($start as $key => $value)
$start[$key] = is_int($value) ? $value : 0;
}
else {
$start = array_pad(array($start), $num, $start);
}
// $length
if (!isset($length)) {
$length = array_fill(0, $num, 0);
}
elseif (is_array($length)) {
$length = array_slice($length, 0, $num);
foreach ($length as $key => $value)
$length[$key] = isset($value) ? (is_int($value) ? $value : $num) : 0;
}
else {
$length = array_pad(array($length), $num, $length);
}
// Recursive call
return array_map(__FUNCTION__, $string, $replacement, $start, $length);
}
preg_match_all('/./us', (string)$string, $smatches);
preg_match_all('/./us', (string)$replacement, $rmatches);
if ($length === NULL) $length = mb_strlen($string);
array_splice($smatches[0], $start, $length, $rmatches[0]);
return join("",$smatches[0]);
}
solves the problem with function test()

PHP Calling a function on matches in regexp

I need to scramble/encode all e-mail addresses in a string, turn them into links and leave the rest of the string intact?
I'm using
$withlinks = preg_replace("/([\w-?&;#~=\.\/]+\#(\[?)[a-zA-Z0-9\-\.]+\.([a-zA-Z]{2,3}|[0-9]{1,3})(\]?))/i","$1",$nolinks);
to make links out of e-mails and
function encode_email($str) {
$str = mb_convert_encoding($str , 'UTF-32', 'UTF-8'); //big endian
$split = str_split($str, 4);
$res = "";
foreach ($split as $c) {
$cur = 0;
for ($i = 0; $i < 4; $i++) {
$cur |= ord($c[$i]) << (8*(3 - $i));
}
$res .= "&#" . $cur . ";";
}
return $res;
}
to encode the addresses but I can't figure out how to put them together, so that only e-mails would be encoded and turned into links.
You can use preg_replace_callback so that you can manipulate the replacement text to be exactly what you want...
<?php
// test string
$nolinks = "amy#winehous.com is an email for bobby#fisher.com plays chess";
// your original function
function encode_email($str)
{
$str = mb_convert_encoding($str, 'UTF-32', 'UTF-8'); //big endian
$split = str_split($str, 4);
$res = "";
foreach ($split as $c) {
$cur = 0;
for ($i = 0; $i < 4; $i++) {
$cur |= ord($c[$i]) << (8 * (3 - $i));
}
$res .= "&#" . $cur . ";";
}
return $res;
}
// function used for callback
function encode_email_and_add_link($in)
{
// get encoded email address (don't actually know what this function does)
$encoded = encode_email($in[1]);
// return a hyperlink string built with encoded email address
return "$encoded";
}
// do the regex with callback
$withlinks = preg_replace_callback("/([\w-?&;#~=\.\/]+\#(\[?)[a-zA-Z0-9\-\.]+\.([a-zA-Z]{2,3}|[0-9]{1,3})(\]?))/i", 'encode_email_and_add_link', $nolinks);
// output the results
echo $withlinks;

How to convert text to \x codes?

I want to convert normal text to \x codes for e.g \x14\x65\x60
For example :
normal text = "base64_decode"
converted \x codes for above text = "\x62\141\x73\145\x36\64\x5f\144\x65\143\x6f\144\x65"
How to do this? Thanks in advance.
PHP 5.3 one-liner:
echo preg_replace_callback("/./", function($matched) {
return '\x'.dechex(ord($matched[0]));
}, 'base64_decode');
Outputs \x62\x61\x73\x65\x36\x34\x5f\x64\x65\x63\x6f\x64\x65
The ord() function gives you the decimal value for a single byte. dechex() converts it to hex. So to do this, loop through the every character in the string and apply both functions.
$str = 'base64_decode';
$length = strlen($str);
$result = '';
for ($i = 0; $i < $length; $i++) $result .= '\\x'.str_pad(dechex(ord($str[$i])),2,'0',STR_PAD_LEFT);
print($result);
Here's working code:
function make_hexcodes($text) {
$retval = '';
for($i = 0; $i < strlen($text); ++$i) {
$retval .= '\x'.dechex(ord($text[$i]));
}
return $retval;
}
echo make_hexcodes('base64_decode');
See it in action.
For an alternative to dechex(ord()) you can also use bin2hex($char), sprintf('\x%02X') or unpack('H*', $char). Additionally instead of using preg_replace_callback, you can use array_map with str_split.
Hexadecimal Encoding: https://3v4l.org/Ai3HZ
bin2hex
$word = 'base64_decode';
echo implode(array_map(function($char) {
return '\x' . bin2hex($char);
}, (array) str_split($word)));
unpack
$word = 'base64_decode';
echo implode(array_map(function($char) {
return '\x' . implode(unpack('H*', $char));
}, (array) str_split($word)));
sprintf
$word = 'base64_decode';
echo implode(array_map(function($char) {
return sprintf('\x%02X', ord($char));
}, (array) str_split($word)));
Result
\x62\x61\x73\x65\x36\x34\x5f\x64\x65\x63\x6f\x64\x65
Hexadecimal Decoding
To decode the encoded string back to the plain-text, use one of the following methods.
$encoded = '\x62\x61\x73\x65\x36\x34\x5f\x64\x65\x63\x6f\x64\x65';
$hexadecimal = str_replace('\x', '', $encoded);
hex2bin
echo hex2bin($hexadecimal);
pack
echo pack('H*', $hexadecimal);
sscanf + vprintf
vprintf(str_repeat('%c', count($f = sscanf($hexadecimal, str_repeat('%02X', substr_count($encoded , '\x'))))), $f);
Result
base64_decode
im not read this code \ud83d\udc33 🐳
function unicode_decode(string $str)
{
str="Learn Docker in 12 Minutes \ud83d\udc33"
return preg_replace_callback('/u([0-9a-f]{4})/i', function ($match) {
return mb_convert_encoding(pack('H*', $match[1]), 'UTF-8', 'UCS-2BE');
}, $str);
}

Categories