Get hexcode of html entities - php

I have a string as "€".
I want to convert it to hex to get the value as "\u20AC" so that I can send it to flash.
Same for all currency symbol..
£ -> \u00A3
$ -> \u0024
etc

First, note that $ is not a known entity in HTML 4.01. It is, however, in HTML 5, and, in PHP 5.4, you can call html_entity_decode with ENT_QUOTES | ENT_HTML5 to decode it.
You have to decode the entity and only then convert it:
//assumes $str is in UTF-8 (or ASCII)
function foo($str) {
$dec = html_entity_decode($str, ENT_QUOTES, "UTF-8");
//convert to UTF-16BE
$enc = mb_convert_encoding($dec, "UTF-16BE", "UTF-8");
$out = "";
foreach (str_split($enc, 2) as $f) {
$out .= "\\u" . sprintf("%04X", ord($f[0]) << 8 | ord($f[1]));
}
return $out;
}
If you want to replace only the entities, you can use preg_replace_callback to match the entities and then use foo as a callback.
function repl_only_ent($str) {
return preg_replace_callback('/&[^;]+;/',
function($m) { return foo($m[0]); },
$str);
}
echo repl_only_ent("€foobar ´");
gives:
\u20ACfoobar \u00B4

You might try the following function for string to hex conversion:
function strToHex($string) {
$hex='';
for ($i=0; $i < strlen($string); $i++) {
$hex .= dechex(ord($string[$i]));
}
return $hex;
}
From Greg Winiarski which is the fourth hit on Google.
In combination with html_entity_decode(). So something like this:
$currency_symbol = "€";
$hex = strToHex(html_entity_decode($currency_symbol));
This code is untested and therefore may require further modification to return the exact result you require

Related

Encoding smileys in a string with mb_convert_encoding [duplicate]

How to convert a Unicode string to HTML entities? (HEX not decimal)
For example, convert Français to Français.
For the missing hex-encoding in the related question:
$output = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) {
list($utf8) = $match;
$binary = mb_convert_encoding($utf8, 'UTF-32BE', 'UTF-8');
$entity = vsprintf('&#x%X;', unpack('N', $binary));
return $entity;
}, $input);
This is similar to #Baba's answer using UTF-32BE and then unpack and vsprintf for the formatting needs.
If you prefer iconv over mb_convert_encoding, it's similar:
$output = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) {
list($utf8) = $match;
$binary = iconv('UTF-8', 'UTF-32BE', $utf8);
$entity = vsprintf('&#x%X;', unpack('N', $binary));
return $entity;
}, $input);
I find this string manipulation a bit more clear then in Get hexcode of html entities.
Your string looks like UCS-4 encoding you can try
$first = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($m) {
$char = current($m);
$utf = iconv('UTF-8', 'UCS-4', $char);
return sprintf("&#x%s;", ltrim(strtoupper(bin2hex($utf)), "0"));
}, $string);
Output
string 'Français' (length=13)
Firstly, when I faced this problem recently, I solved it by making sure my code-files, DB connection, and DB tables were all UTF-8 Then, simply echoing the text works. If you must escape the output from the DB use htmlspecialchars() and not htmlentities() so that the UTF-8 symbols are left alone and not attempted to be escaped.
Would like to document an alternative solution because it solved a similar problem for me.
I was using PHP's utf8_encode() to escape 'special' characters.
I wanted to convert them into HTML entities for display, I wrote this code because I wanted to avoid iconv or such functions as far as possible since not all environments necessarily have them (do correct me if it is not so!)
function unicode2html($string) {
return preg_replace('/\\\\u([0-9a-z]{4})/', '&#x$1;', $string);
}
$foo = 'This is my test string \u03b50';
echo unicode2html($foo);
Hope this helps somebody in need :-)
See How to get the character from unicode code point in PHP? for some code that allows you to do the following :
Example use :
echo "Get string from numeric DEC value\n";
var_dump(mb_chr(50319, 'UCS-4BE'));
var_dump(mb_chr(271));
echo "\nGet string from numeric HEX value\n";
var_dump(mb_chr(0xC48F, 'UCS-4BE'));
var_dump(mb_chr(0x010F));
echo "\nGet numeric value of character as DEC string\n";
var_dump(mb_ord('ď', 'UCS-4BE'));
var_dump(mb_ord('ď'));
echo "\nGet numeric value of character as HEX string\n";
var_dump(dechex(mb_ord('ď', 'UCS-4BE')));
var_dump(dechex(mb_ord('ď')));
echo "\nEncode / decode to DEC based HTML entities\n";
var_dump(mb_htmlentities('tchüß', false));
var_dump(mb_html_entity_decode('tchüß'));
echo "\nEncode / decode to HEX based HTML entities\n";
var_dump(mb_htmlentities('tchüß'));
var_dump(mb_html_entity_decode('tchüß'));
echo "\nUse JSON encoding / decoding\n";
var_dump(codepoint_encode("tchüß"));
var_dump(codepoint_decode('tch\u00fc\u00df'));
Output :
Get string from numeric DEC value
string(4) "ď"
string(2) "ď"
Get string from numeric HEX value
string(4) "ď"
string(2) "ď"
Get numeric value of character as DEC int
int(50319)
int(271)
Get numeric value of character as HEX string
string(4) "c48f"
string(3) "10f"
Encode / decode to DEC based HTML entities
string(15) "tchüß"
string(7) "tchüß"
Encode / decode to HEX based HTML entities
string(15) "tchüß"
string(7) "tchüß"
Use JSON encoding / decoding
string(15) "tch\u00fc\u00df"
string(7) "tchüß"
You can also use mb_encode_numericentity which is supported by PHP 4.0.6+ (link to PHP doc).
function unicode2html($value) {
return mb_encode_numericentity($value, [
// start codepoint
// | end codepoint
// | | offset
// | | | mask
0x0000, 0x001F, 0x0000, 0xFFFF,
0x0021, 0x002C, 0x0000, 0xFFFF,
0x002E, 0x002F, 0x0000, 0xFFFF,
0x003C, 0x003C, 0x0000, 0xFFFF,
0x003E, 0x003E, 0x0000, 0xFFFF,
0x0060, 0x0060, 0x0000, 0xFFFF,
0x0080, 0xFFFF, 0x0000, 0xFFFF
], 'UTF-8', true);
}
In this way it is also possible to indicate which ranges of characters to convert into hexadecimal entities and which ones to preserve as characters.
Usage example:
$input = array(
'"Meno più, PIÙ o meno"',
'\'ÀÌÙÒLÈ PERCHÉ perché è sempre così non si sà\'',
'<script>alert("XSS");</script>',
'"`'
);
$output = array();
foreach ($input as $str)
$output[] = unicode2html($str)
Result:
$output = array(
'"Meno più, PIÙ o meno"',
''ÀÌÙÒLÈ PERCHÉ perché è sempre così non si sà'',
'<script>alert("XSS");</script>',
'"`'
);
This is solution like #hakre (Nov 8, 2012 at 0:35) but to html entity names:
$output = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) {
list($utf8) = $match;
$char = htmlentities($utf8, ENT_HTML5 | ENT_IGNORE);
if ($char[0]!=='&' || (strlen($char)<2)) {
$binary = mb_convert_encoding($utf8, 'UTF-32BE', 'UTF-8');
$char = vsprintf('&#x%X;', unpack('N', $binary));
} // (else $char is "&entity;", which is better)
return $char;
}, $input);
$input = "Ob\xC3\xB3z w\xC4\x99drowny Ko\xC5\x82a";
// => $output: "Obóz w&eogon;drowny Ko&lstrok;a"
//while #hakre/#Baba both codes:
// => $output: "Obóz wędrowny Koła"
But always is problem with encountered not proper UTF-8, i.e.:
$input = "Ob\xC3\xB3z w\xC4\x99drowny Ko\xC5\x82a - ok\xB3adka";
// means "Obóz w&eogon;drowny Ko&lstrok;a - - ok&lstrok;adka" in html ("\xB3" is ISO-8859-2/windows-1250 "ł")
but here
// => $output: (empty)
also with #hakre code... :(
It was hard to find out the cause, the only solution I know (maybe does anyone know a simpler one? please):
function utf_entities($input) {
$output = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) {
list($utf8) = $match;
$char = htmlentities($utf8, ENT_HTML5 | ENT_IGNORE);
if ($char[0]!=='&' || (strlen($char)<2)) {
$binary = mb_convert_encoding($utf8, 'UTF-32BE', 'UTF-8');
$char = vsprintf('&#x%X;', unpack('N', $binary));
} // (else $char is "&entity;", which is better)
return $char;
}, $input);
if (empty($output) && (!empty($input))) { // Trouble... Maybe not UTF-8 code inside UTF-8 string...
/* Processing string against not UTF-8 chars... */
$output = ''; // New - repaired
for ($i=0; $i<strlen($input); $i++) {
if (($char = $input[$i])<"\x80") {
$output .= $char;
} else { // maybe UTF-8 (0b ..110xx..) or not UTF-8 (i.e. 0b11111111 etc.)
$j = 0; // how many chars more in UTF-8
$char = ord($char);
do { // checking first UTF-8 code char bits
$char = ($char << 1) % 0x100;
$j++;
} while (($j<4 /* 6 before RFC 3629 */)&& (($char & 0b11000000) === 0b11000000));
$k = $i+1;
if ($j<4 /* 6 before RFC 3629 */ && (($char & 0b11000000) === 0b10000000)) { // maybe UTF-8...
for ($k=$i+$j; $k>$i && ((ord($input[$k]) & 0b11000000) === 0b10000000); $k--) ; // ...checking next bytes for valid UTF-8 codes
}
if ($k>$i || ($j>=4 /* 6 before RFC 3629 */) || (($char & 0b11000000) !== 0b10000000)) { // Not UTF-8
$output .= '&#x'.dechex(ord($input[$i])).';'; // "&#xXX;"
} else { // UTF=8 !
$output .= substr($input, $i, 1+$j);
$i += $j;
}
}
}
return utf_entities($output); // recursively after repairing
}
return $output;
}
I.e.:
echo utf_entities("o\xC5\x82a - k\xB3a"); // o&lstrok;a - k³a - UTF-8 + fixed
echo utf_entities("o".chr(0b11111101).chr(0b10111000).chr(0b10111000).chr(0b10111000).chr(0b10111000).chr(0b10111000)."a");
// oñ¸¸¸¸¸a - invalid UTF-8 (6-bytes UTF-8 valid before RFC 3629), fixed
echo utf_entities("o".chr(0b11110001).chr(0b10111000).chr(0b10111000).chr(0b10111000)."a - k\xB3a");
// o񸸸a - k³a - UTF-8 + fixed ("\xB3")
echo utf_entities("o".chr(0b11110001).chr(0b10111000).chr(0b10111000).chr(0b10111000)."a");
// o񸸸a - valid UTF-8!
echo utf_entities("o".chr(0b11110001).'a'.chr(0b10111000).chr(0b10111000)."a");
// oña¸¸a - invalid UTF-8, fixed

How can I pad each multibyte character / emoji with spaces around it in a string?

I'd like to pad each multibyte character with spaces on either side. I can strip them out just fine, but I'd like to leave them in and just pad them.
For example: 👉😀👈 to 👉 😀 👈.
Using underscores to represent spaces: 👉😀👈 to _👉__😀__👈_
Use this monsterous-already-cooked regex:
$regex = "[\\x{fe00}-\\x{fe0f}\\x{2712}\\x{2714}\\x{2716}\\x{271d}\\x{2721}\\x{2728}\\x{2733}\\x{2734}\\x{2744}\\x{2747}\\x{274c}\\x{274e}\\x{2753}-\\x{2755}\\x{2757}\\x{2763}\\x{2764}\\x{2795}-\\x{2797}\\x{27a1}\\x{27b0}\\x{27bf}\\x{2934}\\x{2935}\\x{2b05}-\\x{2b07}\\x{2b1b}\\x{2b1c}\\x{2b50}\\x{2b55}\\x{3030}\\x{303d}\\x{1f004}\\x{1f0cf}\\x{1f170}\\x{1f171}\\x{1f17e}\\x{1f17f}\\x{1f18e}\\x{1f191}-\\x{1f19a}\\x{1f201}\\x{1f202}\\x{1f21a}\\x{1f22f}\\x{1f232}-\\x{1f23a}\\x{1f250}\\x{1f251}\\x{1f300}-\\x{1f321}\\x{1f324}-\\x{1f393}\\x{1f396}\\x{1f397}\\x{1f399}-\\x{1f39b}\\x{1f39e}-\\x{1f3f0}\\x{1f3f3}-\\x{1f3f5}\\x{1f3f7}-\\x{1f4fd}\\x{1f4ff}-\\x{1f53d}\\x{1f549}-\\x{1f54e}\\x{1f550}-\\x{1f567}\\x{1f56f}\\x{1f570}\\x{1f573}-\\x{1f579}\\x{1f587}\\x{1f58a}-\\x{1f58d}\\x{1f590}\\x{1f595}\\x{1f596}\\x{1f5a5}\\x{1f5a8}\\x{1f5b1}\\x{1f5b2}\\x{1f5bc}\\x{1f5c2}-\\x{1f5c4}\\x{1f5d1}-\\x{1f5d3}\\x{1f5dc}-\\x{1f5de}\\x{1f5e1}\\x{1f5e3}\\x{1f5ef}\\x{1f5f3}\\x{1f5fa}-\\x{1f64f}\\x{1f680}-\\x{1f6c5}\\x{1f6cb}-\\x{1f6d0}\\x{1f6e0}-\\x{1f6e5}\\x{1f6e9}\\x{1f6eb}\\x{1f6ec}\\x{1f6f0}\\x{1f6f3}\\x{1f910}-\\x{1f918}\\x{1f980}-\\x{1f984}\\x{1f9c0}\\x{3297}\\x{3299}\\x{a9}\\x{ae}\\x{203c}\\x{2049}\\x{2122}\\x{2139}\\x{2194}-\\x{2199}\\x{21a9}\\x{21aa}\\x{231a}\\x{231b}\\x{2328}\\x{2388}\\x{23cf}\\x{23e9}-\\x{23f3}\\x{23f8}-\\x{23fa}\\x{24c2}\\x{25aa}\\x{25ab}\\x{25b6}\\x{25c0}\\x{25fb}-\\x{25fe}\\x{2600}-\\x{2604}\\x{260e}\\x{2611}\\x{2614}\\x{2615}\\x{2618}\\x{261d}\\x{2620}\\x{2622}\\x{2623}\\x{2626}\\x{262a}\\x{262e}\\x{262f}\\x{2638}-\\x{263a}\\x{2648}-\\x{2653}\\x{2660}\\x{2663}\\x{2665}\\x{2666}\\x{2668}\\x{267b}\\x{267f}\\x{2692}-\\x{2694}\\x{2696}\\x{2697}\\x{2699}\\x{269b}\\x{269c}\\x{26a0}\\x{26a1}\\x{26aa}\\x{26ab}\\x{26b0}\\x{26b1}\\x{26bd}\\x{26be}\\x{26c4}\\x{26c5}\\x{26c8}\\x{26ce}\\x{26cf}\\x{26d1}\\x{26d3}\\x{26d4}\\x{26e9}\\x{26ea}\\x{26f0}-\\x{26f5}\\x{26f7}-\\x{26fa}\\x{26fd}\\x{2702}\\x{2705}\\x{2708}-\\x{270d}\\x{270f}]|\\x{23}\\x{20e3}|\\x{2a}\\x{20e3}|\\x{30}\\x{20e3}|\\x{31}\\x{20e3}|\\x{32}\\x{20e3}|\\x{33}\\x{20e3}|\\x{34}\\x{20e3}|\\x{35}\\x{20e3}|\\x{36}\\x{20e3}|\\x{37}\\x{20e3}|\\x{38}\\x{20e3}|\\x{39}\\x{20e3}|\\x{1f1e6}[\\x{1f1e8}-\\x{1f1ec}\\x{1f1ee}\\x{1f1f1}\\x{1f1f2}\\x{1f1f4}\\x{1f1f6}-\\x{1f1fa}\\x{1f1fc}\\x{1f1fd}\\x{1f1ff}]|\\x{1f1e7}[\\x{1f1e6}\\x{1f1e7}\\x{1f1e9}-\\x{1f1ef}\\x{1f1f1}-\\x{1f1f4}\\x{1f1f6}-\\x{1f1f9}\\x{1f1fb}\\x{1f1fc}\\x{1f1fe}\\x{1f1ff}]|\\x{1f1e8}[\\x{1f1e6}\\x{1f1e8}\\x{1f1e9}\\x{1f1eb}-\\x{1f1ee}\\x{1f1f0}-\\x{1f1f5}\\x{1f1f7}\\x{1f1fa}-\\x{1f1ff}]|\\x{1f1e9}[\\x{1f1ea}\\x{1f1ec}\\x{1f1ef}\\x{1f1f0}\\x{1f1f2}\\x{1f1f4}\\x{1f1ff}]|\\x{1f1ea}[\\x{1f1e6}\\x{1f1e8}\\x{1f1ea}\\x{1f1ec}\\x{1f1ed}\\x{1f1f7}-\\x{1f1fa}]|\\x{1f1eb}[\\x{1f1ee}-\\x{1f1f0}\\x{1f1f2}\\x{1f1f4}\\x{1f1f7}]|\\x{1f1ec}[\\x{1f1e6}\\x{1f1e7}\\x{1f1e9}-\\x{1f1ee}\\x{1f1f1}-\\x{1f1f3}\\x{1f1f5}-\\x{1f1fa}\\x{1f1fc}\\x{1f1fe}]|\\x{1f1ed}[\\x{1f1f0}\\x{1f1f2}\\x{1f1f3}\\x{1f1f7}\\x{1f1f9}\\x{1f1fa}]|\\x{1f1ee}[\\x{1f1e8}-\\x{1f1ea}\\x{1f1f1}-\\x{1f1f4}\\x{1f1f6}-\\x{1f1f9}]|\\x{1f1ef}[\\x{1f1ea}\\x{1f1f2}\\x{1f1f4}\\x{1f1f5}]|\\x{1f1f0}[\\x{1f1ea}\\x{1f1ec}-\\x{1f1ee}\\x{1f1f2}\\x{1f1f3}\\x{1f1f5}\\x{1f1f7}\\x{1f1fc}\\x{1f1fe}\\x{1f1ff}]|\\x{1f1f1}[\\x{1f1e6}-\\x{1f1e8}\\x{1f1ee}\\x{1f1f0}\\x{1f1f7}-\\x{1f1fb}\\x{1f1fe}]|\\x{1f1f2}[\\x{1f1e6}\\x{1f1e8}-\\x{1f1ed}\\x{1f1f0}-\\x{1f1ff}]|\\x{1f1f3}[\\x{1f1e6}\\x{1f1e8}\\x{1f1ea}-\\x{1f1ec}\\x{1f1ee}\\x{1f1f1}\\x{1f1f4}\\x{1f1f5}\\x{1f1f7}\\x{1f1fa}\\x{1f1ff}]|\\x{1f1f4}\\x{1f1f2}|\\x{1f1f5}[\\x{1f1e6}\\x{1f1ea}-\\x{1f1ed}\\x{1f1f0}-\\x{1f1f3}\\x{1f1f7}-\\x{1f1f9}\\x{1f1fc}\\x{1f1fe}]|\\x{1f1f6}\\x{1f1e6}|\\x{1f1f7}[\\x{1f1ea}\\x{1f1f4}\\x{1f1f8}\\x{1f1fa}\\x{1f1fc}]|\\x{1f1f8}[\\x{1f1e6}-\\x{1f1ea}\\x{1f1ec}-\\x{1f1f4}\\x{1f1f7}-\\x{1f1f9}\\x{1f1fb}\\x{1f1fd}-\\x{1f1ff}]|\\x{1f1f9}[\\x{1f1e6}\\x{1f1e8}\\x{1f1e9}\\x{1f1eb}-\\x{1f1ed}\\x{1f1ef}-\\x{1f1f4}\\x{1f1f7}\\x{1f1f9}\\x{1f1fb}\\x{1f1fc}\\x{1f1ff}]|\\x{1f1fa}[\\x{1f1e6}\\x{1f1ec}\\x{1f1f2}\\x{1f1f8}\\x{1f1fe}\\x{1f1ff}]|\\x{1f1fb}[\\x{1f1e6}\\x{1f1e8}\\x{1f1ea}\\x{1f1ec}\\x{1f1ee}\\x{1f1f3}\\x{1f1fa}]|\\x{1f1fc}[\\x{1f1eb}\\x{1f1f8}]|\\x{1f1fd}\\x{1f1f0}|\\x{1f1fe}[\\x{1f1ea}\\x{1f1f9}]|\\x{1f1ff}[\\x{1f1e6}\\x{1f1f2}\\x{1f1fc}]";
Inside a preg_replace_callback():
var_dump(preg_replace_callback("#$regex#u", function($match) {
return $match[0]." ";
}, '👉😀👈'));
Outputs:
string(18) "👉 😀 👈 "
Live demo
I found this function that someone had added in the PHP docs that splits a multibyte string into an array of characters (like str_split) and modified it.
function addSpaces($string) {
$strlen = mb_strlen($string);
$new_string = '';
while ($strlen) {
$char = mb_substr($string,0,1,"UTF-8");
if (strlen($char) > 1) {
$new_string .= " $char ";
} else {
$new_string .= $char;
}
$string = mb_substr($string,1,$strlen,"UTF-8");
$strlen = mb_strlen($string);
}
return $new_string;
}
This question has other ways to do that split that could be similarly modified. The modification is, if strlen of one of the split characters is greater than 1, then it's multibyte, so add the spaces.
Simple regex replace could work as well...
mb_regex_encoding("UTF-8");
echo mb_ereg_replace(
'([^\p{L}\s])',
' \\1 ',
'text 👉😀👈 other text 👉😀👈'
);
outputs: text 👉 😀 👈 other text 👉 😀 👈
function pad_emojis($string) {
$default_encoding = mb_regex_encoding();
mb_regex_encoding("UTF-8");
$string = mb_ereg_replace('([^\p{L}\s])', ' \\1 ', $string);
mb_regex_encoding($default_encoding);
return $string;
}

How can I split html value and normal string into different array in php?

Say I have string such as below:
"b<a=2<sup>2</sup>"
Actually its a formula. I need to display this formula on webpage but after b string is hiding because its considered as broken anchor tag. I tried with htmlspecialchars method but it returns complete string as plain text. I am trying with some regex but I can get only text between some tags.
UPDATE:
This seems to work with this formula:
"(c<a) = (b<a) = 2<sup>2</sup>"
And even with this formula:
"b<a=2<sup>2</sup>"
HERE'S THE MAGIC:
<?php
$_string = "b<a=2<sup>2</sup>";
$string = "(c<a) = (b<a) = 2<sup>2</sup>";
$open_sup = strpos($string,"<sup>");
$close_sup = strpos($string,"</sup>");
$chars_array = str_split($string);
foreach($chars_array as $index => $char)
{
if($index != $open_sup && $index != $close_sup)
{
if($char == "<")
{
echo "<";
}
else{
echo $char;
}
}
else{
echo $char;
}
}
OLD SOLUTION (DOESN'T WORK)
Maybe this can help:
I've tried to backslash chars, but it doesn't work as expected.
Then i've tried this one:
<?php
$string = "b&lta=2<sup>2</sup>";
echo $string;
?>
Using &lt html entity it seems to work if i understood your problem...
Let me know
Probably you can give spaces such as :
b < a = 2<sup>2</sup>
It does not disappear the tag and looks much more understanding....
You could try this regex approach, which should skip elements.
$regex = '/<(.*?)\h*.*>.+<\/\1>(*SKIP)(*FAIL)|(<|>)/';
$string = 'b<a=2<sup>2</sup>';
$string = preg_replace_callback($regex, function($match) {
return htmlentities($match[2]);
}, $string);
echo $string;
Output:
b<a=2<sup>2</sup>
PHP Demo: https://eval.in/507605
Regex101: https://regex101.com/r/kD0iM0/1

Encode the url including hyphen(-) and dot(.) in php

I need the encoded URL for processing in one of the API, but it requires the full encoded URL. For example, the URL from:
http://test.site-raj.co/999999?lpp=1&px2=IjN
has to become an encoded URL, like:
http%3a%2f%test%site%2draj%2eco%2f999999%3flpp%3d1%26px2%3dIjN
I need every symbol to be encoded, even the dot(.) and hyphen(-) like above.
Try this. Inside a function maybe if you are using it more than once...
$str = 'http://test.site.co/999999?lpp=1&p---x2=IjN';
$str = urlencode($str);
$str = str_replace('.', '%2E', $str);
$str = str_replace('-', '%2D', $str);
echo $str;
This will encode all characters that are not plain letters or numbers. You can still decode this with the standard urldecode or rawurldecode:
function urlencodeall($x) {
$out = '';
for ($i = 0; isset($x[$i]); $i++) {
$c = $x[$i];
if (!ctype_alnum($c)) $c = '%' . sprintf('%02X', ord($c));
$out .= $c;
}
return $out;
}
Why don't you use rawurlencode
for example rawurlencode("http://test.site-raj.co/999999?lpp=1&px2=IjN")

How to remove html special chars? [duplicate]

This question already has an answer here:
Convert HTML entities and special characters to UTF8 text in PHP
(1 answer)
Closed 9 months ago.
I am creating a RSS feed file for my application in which I want to remove HTML tags, which is done by strip_tags. But strip_tags is not removing HTML special code chars:
& ©
etc.
Please tell me any function which I can use to remove these special code chars from my string.
Either decode them using html_entity_decode or remove them using preg_replace:
$Content = preg_replace("/&#?[a-z0-9]+;/i","",$Content);
(From here)
EDIT: Alternative according to Jacco's comment
might be nice to replace the '+' with
{2,8} or something. This will limit
the chance of replacing entire
sentences when an unencoded '&' is
present.
$Content = preg_replace("/&#?[a-z0-9]{2,8};/i","",$Content);
Use html_entity_decode to convert HTML entities.
You'll need to set charset to make it work correctly.
In addition to the good answers above, PHP also has a built-in filter function that is quite useful: filter_var.
To remove HTML characters, use:
$cleanString = filter_var($dirtyString, FILTER_SANITIZE_STRING);
More info:
function.filter-var
filter_sanitize_string
You may want take a look at htmlentities() and html_entity_decode() here
$orig = "I'll \"walk\" the <b>dog</b> now";
$a = htmlentities($orig);
$b = html_entity_decode($a);
echo $a; // I'll "walk" the <b>dog</b> now
echo $b; // I'll "walk" the <b>dog</b> now
This might work well to remove special characters.
$modifiedString = preg_replace("/[^a-zA-Z0-9_.-\s]/", "", $content);
If you want to convert the HTML special characters and not just remove them as well as strip things down and prepare for plain text this was the solution that worked for me...
function htmlToPlainText($str){
$str = str_replace(' ', ' ', $str);
$str = html_entity_decode($str, ENT_QUOTES | ENT_COMPAT , 'UTF-8');
$str = html_entity_decode($str, ENT_HTML5, 'UTF-8');
$str = html_entity_decode($str);
$str = htmlspecialchars_decode($str);
$str = strip_tags($str);
return $str;
}
$string = '<p>this is ( ) a test</p>
<div>Yes this is! & does it get "processed"? </div>'
htmlToPlainText($string);
// "this is ( ) a test. Yes this is! & does it get processed?"`
html_entity_decode w/ ENT_QUOTES | ENT_XML1 converts things like '
htmlspecialchars_decode converts things like &
html_entity_decode converts things like '<
and strip_tags removes any HTML tags left over.
EDIT - Added str_replace(' ', ' ', $str); and several other html_entity_decode() as continued testing has shown a need for them.
A plain vanilla strings way to do it without engaging the preg regex engine:
function remEntities($str) {
if(substr_count($str, '&') && substr_count($str, ';')) {
// Find amper
$amp_pos = strpos($str, '&');
//Find the ;
$semi_pos = strpos($str, ';');
// Only if the ; is after the &
if($semi_pos > $amp_pos) {
//is a HTML entity, try to remove
$tmp = substr($str, 0, $amp_pos);
$tmp = $tmp. substr($str, $semi_pos + 1, strlen($str));
$str = $tmp;
//Has another entity in it?
if(substr_count($str, '&') && substr_count($str, ';'))
$str = remEntities($tmp);
}
}
return $str;
}
What I have done was to use: html_entity_decode, then use strip_tags to removed them.
try this
<?php
$str = "\x8F!!!";
// Outputs an empty string
echo htmlentities($str, ENT_QUOTES, "UTF-8");
// Outputs "!!!"
echo htmlentities($str, ENT_QUOTES | ENT_IGNORE, "UTF-8");
?>
It looks like what you really want is:
function xmlEntities($string) {
$translationTable = get_html_translation_table(HTML_ENTITIES, ENT_QUOTES);
foreach ($translationTable as $char => $entity) {
$from[] = $entity;
$to[] = '&#'.ord($char).';';
}
return str_replace($from, $to, $string);
}
It replaces the named-entities with their number-equivalent.
<?php
function strip_only($str, $tags, $stripContent = false) {
$content = '';
if(!is_array($tags)) {
$tags = (strpos($str, '>') !== false
? explode('>', str_replace('<', '', $tags))
: array($tags));
if(end($tags) == '') array_pop($tags);
}
foreach($tags as $tag) {
if ($stripContent)
$content = '(.+</'.$tag.'[^>]*>|)';
$str = preg_replace('#</?'.$tag.'[^>]*>'.$content.'#is', '', $str);
}
return $str;
}
$str = '<font color="red">red</font> text';
$tags = 'font';
$a = strip_only($str, $tags); // red text
$b = strip_only($str, $tags, true); // text
?>
The function I used to perform the task, joining the upgrade made by schnaader is:
mysql_real_escape_string(
preg_replace_callback("/&#?[a-z0-9]+;/i", function($m) {
return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES");
}, strip_tags($row['cuerpo'])))
This function removes every html tag and html symbol, converted in UTF-8 ready to save in MySQL
You can try htmlspecialchars_decode($string). It works for me.
http://www.w3schools.com/php/func_string_htmlspecialchars_decode.asp
If you are working in WordPress and are like me and simply need to check for an empty field (and there are a copious amount of random html entities in what seems like a blank string) then take a look at:
sanitize_title_with_dashes( string $title, string $raw_title = '', string $context = 'display' )
Link to wordpress function page
For people not working on WordPress, I found this function REALLY useful to create my own sanitizer, take a look at the full code and it's really in depth!
$string = "äáčé";
$convert = Array(
'ä'=>'a',
'Ä'=>'A',
'á'=>'a',
'Á'=>'A',
'à'=>'a',
'À'=>'A',
'ã'=>'a',
'Ã'=>'A',
'â'=>'a',
'Â'=>'A',
'č'=>'c',
'Č'=>'C',
'ć'=>'c',
'Ć'=>'C',
'ď'=>'d',
'Ď'=>'D',
'ě'=>'e',
'Ě'=>'E',
'é'=>'e',
'É'=>'E',
'ë'=>'e',
);
$string = strtr($string , $convert );
echo $string; //aace
What If By "Remove HTML Special Chars" You Meant "Replace Appropriately"?
After all, just look at your example...
& ©
If you're stripping this for an RSS feed, shouldn't you want the equivalents?
" ", &, ©
Or maybe you don't exactly want the equivalents. Maybe you'd want to have just be ignored (to prevent too much space), but then have © actually get replaced. Let's work out a solution that solves anyone's version of this problem...
How to SELECTIVELY-REPLACE HTML Special Chars
The logic is simple: preg_match_all('/(&#[0-9]+;)/' grabs all of the matches, and then we simply build a list of matchables and replaceables, such as str_replace([searchlist], [replacelist], $term). Before we do this, we also need to convert named entities to their numeric counterparts, i.e., " " is unacceptable, but "&#00A0;" is fine. (Thanks to it-alien's solution to this part of the problem.)
Working Demo
In this demo, I replace { with "HTML Entity #123". Of course, you can fine-tune this to any kind of find-replace you want for your case.
Why did I make this? I use it with generating Rich Text Format from UTF8-character-encoded HTML.
See full working demo:
Full Online Working Demo
function FixUTF8($args) {
$output = $args['input'];
$output = convertNamedHTMLEntitiesToNumeric(['input'=>$output]);
preg_match_all('/(&#[0-9]+;)/', $output, $matches, PREG_OFFSET_CAPTURE);
$full_matches = $matches[0];
$found = [];
$search = [];
$replace = [];
for($i = 0; $i < count($full_matches); $i++) {
$match = $full_matches[$i];
$word = $match[0];
if(!$found[$word]) {
$found[$word] = TRUE;
$search[] = $word;
$replacement = str_replace(['&#', ';'], ['HTML Entity #', ''], $word);
$replace[] = $replacement;
}
}
$new_output = str_replace($search, $replace, $output);
return $new_output;
}
function convertNamedHTMLEntitiesToNumeric($args) {
$input = $args['input'];
return preg_replace_callback("/(&[a-zA-Z][a-zA-Z0-9]*;)/",function($m){
$c = html_entity_decode($m[0],ENT_HTML5,"UTF-8");
# return htmlentities($c,ENT_XML1,"UTF-8"); -- see update below
$convmap = array(0x80, 0xffff, 0, 0xffff);
return mb_encode_numericentity($c, $convmap, 'UTF-8');
}, $input);
}
print(FixUTF8(['input'=>"Oggi è un bel giorno"]));
Input:
"Oggi è un bel giorno"
Output:
Oggi HTML Entity #232 un belHTML Entity #160giorno

Categories