convert encoded html entities to utf-8

convert encoded html entities to utf-8 - php

How do I convert this string into UTF-8:
&#0000106&#0000097&#0000118&#0000097&#0000115&#0000099&#0000114&#0000105&#0000112&#0000116&#0000058&#0000097&#0000108&#0000101&#0000114&#0000116&#0000040&#0000039&#0000088&#0000083&#0000083&#0000039&#0000041
I want to also convert this one:
&#x6A&#x61&#x76&#x61&#x73&#x63&#x72&#x69&#x70&#x74&#x3A&#x61&#x6C&#x65&#x72&#x74&#x28&#x27&#x58&#x53&#x53&#x27&#x29
I want to prevent XSS attacks and I am using this article as a cheat sheet https://www.owasp.org/index.php/XSS_Filter_Evasion_Cheat_Sheet
My strategy is to convert the above string to UTF-8 and check if it contains javascript.

I made a simple functions to get the possible HTML, check:
$decimalHTML = '&#0000106&#0000097&#0000118&#0000097&#0000115&#0000099&#0000114&#0000105&#0000112&#0000116&#0000058&#0000097&#0000108&#0000101&#0000114&#0000116&#0000040&#0000039&#0000088&#0000083&#0000083&#0000039&#0000041';
$hexHTML = '&#x6A&#x61&#x76&#x61&#x73&#x63&#x72&#x69&#x70&#x74&#x3A&#x61&#x6C&#x65&#x72&#x74&#x28&#x27&#x58&#x53&#x53&#x27&#x29';
function getDecimalHTML($str) {
return str_replace(
'&#',
'',
preg_replace_callback(
'/\d+/',
function($v) {
return str_replace(';', '', implode(array_map('chr', $v)));
}, $str
)
);
}
function getHexDecimalHTML($str) {
return str_replace(
array('&#', 'x'),
'',
preg_replace_callback(
'/(?<=x)\w+/',
function($v) {
return str_replace(';', '', implode(array_map('hex2bin', $v)));
},
$str
)
);
}
echo getDecimalHTML($decimalHTML) . "\n";
echo getHexDecimalHTML($hexHTML);
Show me:
javascript:alert('XSS')
javascript:alert('XSS')
I used chr to get de char from ASCII and hex2bin to get the string from hexadecimal code....
I recommend not reinvent the wheel and use libraries that work for you and they cover all aspects of this problem, like AntiXSS

Related

PHP escape unicode characters only

in Facebook validation documentation
Please note that we generate the signature using an escaped unicode
version of the payload, with lowercase hex digits. If you just
calculate against the decoded bytes, you will end up with a different
signature. For example, the string äöå should be escaped to
\u00e4\u00f6\u00e5.
I'm trying to make a unittest for the validation that I have, but I don't seem to be able to produce the signutre because I can't escape the payload. I've tried
mb_convert_encoding($payload, 'unicode')
But this encodes all the payload, and not just the needed string, as Facebook does.
My full code:
// on the unittest
$content = file_get_contents(__DIR__.'/../Responses/whatsapp_webhook.json');
// trim whitespace at the end of the file
$content = trim($content);
$secret = config('externals.meta.config.app_secret');
$signature = hash_hmac(
'sha256',
mb_convert_encoding($content, 'unicode'),
$secret
);
$response = $this->postJson(
route('whatsapp.webhook.message'),
json_decode($content, true),
[
'CONTENT_TYPE' => 'text/plain',
'X-Hub-Signature-256' => $signature,
]
);
$response->assertOk();
// on the request validation
/**
* #var string $signature
*/
$signature = $request->header('X-Hub-Signature-256');
if (!$signature) {
abort(Response::HTTP_FORBIDDEN);
}
$signature = Str::after($signature, '=');
$secret = config('externals.meta.config.app_secret');
/**
* #var string $content
*/
$content = $request->getContent();
$payloadSignature = hash_hmac(
'sha256',
$content,
$secret
);
if ($payloadSignature !== $signature) {
abort(Response::HTTP_FORBIDDEN);
}

For one, mb_convert_encoding($payload, 'unicode') converts the input to UTF-16BE, not UTF-8. You would want mb_convert_encoding($payload, 'UTF-8').
For two, using mb_convert_encoding() without specifying the source encoding causes the function to assume that the input is using the system's default encoding, which is frequently incorrect and will cause your data to be mangled. You would want mb_convert_encoding($payload, 'UTF-8', $source_encoding). [Also, you cannot reliably detect string encoding, you need to know what it is.]
For three, mb_convert_encoding() is entirely the wrong function to use to apply the desired escape sequences to the data. [and good lord are the google results for "php escape UTF-8" awful]
Unfortunately, PHP doesn't have a UTF-8 escape function that isn't baked into another function, but it's not terribly difficult to write in userland.
function utf8_escape($input) {
$output = '';
for( $i=0,$l=mb_strlen($input); $i<$l; ++$i ) {
$cur = mb_substr($input, $i, 1);
if( strlen($cur) === 1 ) {
$output .= $cur;
} else {
$output .= sprintf('\\u%04x', mb_ord($cur));
}
}
return $output;
}
$in = "asdf äöå";
var_dump(
utf8_escape($in),
);
Output:
string(23) "asdf \u00e4\u00f6\u00e5"

Instead of trying to re-assemble the payload from the already decoded JSON, you should take the data directly as you received it.
Facebook sends Content-Type: application/json, which means PHP will not populate $_POST to begin with - but you can read the entire request body using file_get_contents('php://input').
Try and calculate the signature based on that, that should work without having to deal with any hassles of encoding & escaping.

DOMDocument->saveHTML() vs urlencode with commercial at symbol (#)

Using DOMDocument(), I'm replacing links in a $message and adding some things, like [#MERGEID]. When I save the changes with $dom_document->saveHTML(), the links get "sort of" url-encoded. [#MERGEID] becomes %5B#MERGEID%5D.
Later in my code I need to replace [#MERGEID] with an ID. So I search for urlencode('[#MERGEID]') - however, urlencode() changes the commercial at symbol (#) to %40, while saveHTML() has left it alone. So there is no match - '%5B#MERGEID%5D' != '%5B%40MERGEID%5D'
Now, I know can run str_replace('%40', '#', urlencode('[#MERGEID]')) to get what I need to locate the merge variable in $message.
My question is, what RFC spec is DOMDocument using, and why is it different than urlencode or even rawurlencode? Is there anything I can do about that to save a str_replace?
Demo code:
$message = 'Google';
$dom_document = new \DOMDocument();
libxml_use_internal_errors(true); //Supress content errors
$dom_document->loadHTML(mb_convert_encoding($message, 'HTML-ENTITIES', 'UTF-8'));
$elements = $dom_document->getElementsByTagName('a');
foreach($elements as $element) {
$link = $element->getAttribute('href'); //http://www.google.com?ref=abc
$tag = $element->getAttribute('data-tag'); //thebottomlink
if ($link) {
$newlink = 'http://www.example.com/click/[#MERGEID]?url=' . $link;
if ($tag) {
$newlink .= '&tag=' . $tag;
}
$element->setAttribute('href', $newlink);
}
}
$message = $dom_document->saveHTML();
$urlencodedmerge = urlencode('[#MERGEID]');
die($message . ' and url encoded version: ' . $urlencodedmerge);
//<a data-tag="thebottomlink" href="http://www.example.com/click/%5B#MERGEID%5D?url=http://www.google.com?ref=abc&tag=thebottomlink">Google</a> and url encoded version: %5B%40MERGEID%5D

I believe that those two encoding serve different purposes. urlencode() encodes "a string to be used in a query part of a URL", while $element->setAttribute('href', $newlink); encodes a complete URL to be used as an URL.
For example:
urlencode('http://www.google.com'); // -> http%3A%2F%2Fwww.google.com
This is convenient for encoding the query part, but it cannot be used on <a href='...'>.
However:
$element->setAttribute('href', $newlink); // -> http://www.google.com
will properly encode the string so that it is still usable in href. The reason that it cannot encode # because it cannot tell whether # is a part of the query or is it part of the userinfo or email url (for example: mailto:invisal#google.com or invisal#127.0.0.1)
Solution
Instead of using [#MERGEID], you can use ##MERGEID##. Then, you replace that with your ID later. This solution does not require you to even use urlencode.
If you insist to use urlencode, you can just use %40 instead of #. So, your code will be like this $newlink = 'http://www.example.com/click/[%40MERGEID]?url=' . $link;
You can also do something like $newlink = 'http://www.example.com/click/' . urlencode('[#MERGEID]') . '?url=' . $link;

urlencode function and rawurlencode are mostly based on RFC 1738. However, since 2005 the current RFC in use for URIs standard is RFC 3986.
On the other hand, The DOM extension uses UTF-8 encoding, which is based on RFC 3629 . Use utf8_encode() and utf8_decode() to work with texts in ISO-8859-1 encoding or Iconv for other encodings.
The generic URI syntax mandates that new URI schemes that provide for
the representation of character data in a URI must, in effect,
represent characters from the unreserved set without translation, and
should convert all other characters to bytes according to UTF-8, and
then percent-encode those values.
Here is a function to decode URLs according to RFC 3986.
<?php
function myUrlEncode($string) {
$entities = array('%21', '%2A', '%27', '%28', '%29', '%3B', '%3A', '%40', '%26', '%3D', '%2B', '%24', '%2C', '%2F', '%3F', '%25', '%23', '%5B', '%5D');
$replacements = array('!', '*', "'", "(", ")", ";", ":", "#", "&", "=", "+", "$", ",", "/", "?", "%", "#", "[", "]");
return str_replace($entities, $replacements, urldecode($string));
}
?>
PHP Fiddle.
Update:
Since UTF8 has been used to encode $message:
$dom_document->loadHTML(mb_convert_encoding($message, 'HTML-ENTITIES', 'UTF-8'))
Use urldecode($message) when returning the URL without percents.
die(urldecode($message) . ' and url encoded version: ' . $urlencodedmerge);

The root cause of your problem has been very well explained from a technical point of view.
In my opinion, however, there is a conceptual flaw in your approach, and it created the situation that you are now trying to fix.
By processing your input $message through a DomDocument object, you have moved to a higher level of abstraction. It is wrong to manipulate as a unique plain string something that has been "promoted" to a HTML stream.
Instead of trying to reproduce DomDocument's behaviour, use the library itself to locate, extract and replace the values of interest:
$token = 'blah blah [#MERGEID]';
$message = '<a id="' . $token . '" href="' . $token . '"></a>';
$dom = new DOMDocument();
$dom->loadHTML($message);
echo $dom->saveHTML(); // now we have an abstract HTML document
// extract a raw value
$rawstring = $dom->getElementsByTagName('a')->item(0)->getAttribute('href');
// do the low-level fiddling
$newstring = str_replace($token, 'replaced', $rawstring);
// push the new value back into the abstract black box.
$dom->getElementsByTagName('a')->item(0)->setAttribute('href', $newstring);
// less code written, but works all the time
$rawstring = $dom->getElementsByTagName('a')->item(0)->getAttribute('id');
$newstring = str_replace($token, 'replaced', $rawstring);
$dom->getElementsByTagName('a')->item(0)->setAttribute('id', $newstring);
echo $dom->saveHTML();
As illustrated above, today we are trying to fix the problem when your token is inside a href, but one day we may want to search and replace the tag elsewhere in the document. To account for this case, do not bother making your low-level code HTML-aware.
(an alternative option would be not loading a DomDocument until all low-level replacements are done, but I am guessing this is not practical)
Complete proof of concept:
function searchAndReplace(DOMNode $node, $search, $replace) {
if($node->hasAttributes()) {
foreach ($node->attributes as $attribute) {
$input = $attribute->nodeValue;
$output = str_replace($search, $replace, $input);
$attribute->nodeValue = $output;
}
}
if(!$node instanceof DOMElement) { // this test needs double-checking
$input = $node->nodeValue;
$output = str_replace($search, $replace, $input);
$node->nodeValue = $output;
}
if($node->hasChildNodes()) {
foreach ($node->childNodes as $child) {
searchAndReplace($child, $search, $replace);
}
}
}
$token = '<>&;[#MERGEID]';
$message = '<a/>';
$dom = new DOMDocument();
$dom->loadHTML($message);
$dom->getElementsByTagName('a')->item(0)->setAttribute('id', "foo$token");
$dom->getElementsByTagName('a')->item(0)->setAttribute('href', "http://foo#$token");
$textNode = new DOMText("foo$token");
$dom->getElementsByTagName('a')->item(0)->appendchild($textNode);
echo $dom->saveHTML();
searchAndReplace($dom, $token, '*replaced*');
echo $dom->saveHTML();

If you use saveXML() it won't mess with the encoding the way saveHTML() does:
PHP
//your code...
$message = $dom_document->saveXML();
EDIT: also remove the XML tag:
//this will add an xml tag, so just remove it
$message=preg_replace("/\<\?xml(.*?)\?\>/","",$message);
echo $message;
Output
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>Google</body></html>
Notice that both still correctly convert & to &

Would it not make sense to just urlencode the original [#mergeid] whan saving it in the first place as well? Your search should then match without the need for the str_replace?
$newlink = 'http://www.example.com/click/'.urlencode('[#MERGEID]').'?url=' . $link;
I know this does not answer the first post of the question, but you cannot post code in comments as far as I can tell.

Decoding base64 strings in PHP

I am trying to decode base64 string in PHP.
For example, I can do this in a python script by doing:
s = "0CC0QFjAA"
base64.b64decode(str(s)+'=====', '_-');
How can I decode base64 strings in PHP?

I think this would do it:
$decoded = base64_decode(str_replace(array('_', '-'), array('+', '/'), $s));
If you need the more general form:
function b64decode($str, $altchars = null) {
if ($altchars) {
$altarray = array($altchars[0], $altchars[1]);
$str = str_replace($altarray, array('+', '/'), $str);
}
return base64_decode($str);
}

The PHP function is called base64_decode. You could call it like base64_decode($s);

CakePHP 1.3 warning array_merge sanitize

I'm using CakePHP 1.3.7 and ran into a very specific issue.
The Sanitize core class method used in my application is the one of version 1.2. When I want to save particular data, it gives me a warning :
Warning: array_merge(): Argument #2 is not an array in
/usr/share/php/cake/libs/sanitize.php on line 113
but it does save, and with the right encoding/format.
Here's the method who causes this warning (version 1.2, which is NOT on line 113, but I'll come to that later)
function html($string, $remove = false) {
if ($remove) {
$string = strip_tags($string);
} else {
$patterns = array("/\&/", "/%/", "/</", "/>/", '/"/', "/'/", "/\(/", "/\)/", "/\+/", "/-/");
$replacements = array("&", "%", "<", ">", """, "'", "(", ")", "+", "-");
$string = preg_replace($patterns, $replacements, $string);
}
return $string;
}
And here's how this method is called
$value = Sanitize::html($value,true);
Now as you can see, array_merge() is not called in this method, but if I replace the html() method by the 1.3 version
function html($string, $options = array()) {
static $defaultCharset = false;
if ($defaultCharset === false) {
$defaultCharset = Configure::read('App.encoding');
if ($defaultCharset === null) {
$defaultCharset = 'UTF-8';
}
}
$default = array(
'remove' => false,
'charset' => $defaultCharset,
'quotes' => ENT_QUOTES
);
$options = array_merge($default, $options);
if ($options['remove']) {
$string = strip_tags($string);
}
return htmlentities($string, $options['quotes'], $options['charset']);
}
array_merge() falls exactly on line 113.
If I now call html() this way
$value = Sanitize::html($value,array('remove' => true));
I don't get the warning anymore. However, my data doesn't save with the right encoding/format anymore.
Here's an example of text I need to save (it is french and needs UTF-8 encoding)
L'envoi d'une communication & à la fenêtre
I can't overcome this doing
$value = Sanitize::html($value,array('remove' => true, 'quotes' => ENT_HTML401));
because I'm using PHP 5.3.6 thus I can't use the constant ENT_HTML401
If I use another constant like ENT_NOQUOTES, it ignores the quotes (obviously) but not the french accents and other special chars, which is intented to work this way but I want to save the text exactly like I quoted (or at least read it).
I'm guessing I wouldn't need to use htmlentities, but I think it is safer to and updating the core method is the only way I found to not get the warning. I also suppose I should not really modify these files other than for updating them?
So, briefly, I want to :
Get rid of the warning
Save/read data in the right format
I might have forgotten some infos, thanks

I ended up updating the html() method of the Sanitize class to match version 1.3 as follow
function html($string, $options = array()) {
static $defaultCharset = false;
if ($defaultCharset === false) {
$defaultCharset = Configure::read('App.encoding');
if ($defaultCharset === null) {
$defaultCharset = 'UTF-8';
}
}
$default = array(
'remove' => false,
'charset' => $defaultCharset,
'quotes' => ENT_QUOTES
);
$options = array_merge($default, $options);
if ($options['remove']) {
$string = strip_tags($string);
}
return htmlentities($string, $options['quotes'], $options['charset']);
}
I call it like this
$value = Sanitize::html($value, array('remove'=>true,'quotes'=>ENT_NOQUOTES));
And I simply decode the text fields this way whenever I read their value from database
$data['Model']['field'] = html_entity_decode($data['Model']['field'], ENT_NOQUOTES, "UTF-8");
EDIT : I had to undo what I described above because the way data was encoded in the 1.3 version of the function made it so we had to decode the data in the whole application when reading it.
Also, I am NOT using CakePHP 1.3.7 (got confused with cake console); I'm using 1.2.4 so updating the function was not appropriate afterall.
I kept the version 1.2 and this time I simply changed the second parameter to an array as follow and it seemed to do the trick as I'm not getting the warning anymore.
function html($string, $options = array()) {
if ($options['remove']) {
$string = strip_tags($string);
} else {
$patterns = array("/\&/", "/%/", "/</", "/>/", '/"/', "/'/", "/\(/", "/\)/", "/\+/", "/-/");
$replacements = array("&", "%", "<", ">", """, "'", "(", ")", "+", "-");
$string = preg_replace($patterns, $replacements, $string);
}
return $string;
}

Best way to convert title into url compatible mode in PHP?

http://domain.name/1-As Low As 10% Downpayment, Free Golf Membership!!!
The above url will report 400 bad request,
how to convert such title to user friendly good request?

You may want to use a "slug" instead. Rather than using the verbatim title as the URL, you strtolower() and replace all non-alphanumeric characters with hyphens, then remove duplicate hyphens. If you feel like extra credit, you can strip out stopwords, too.
So "1-As Low As 10% Downpayment, Free Golf Membership!!!" becomes:
as-low-as-10-downpayment-free-gold-membership
Something like this:
function sluggify($url)
{
# Prep string with some basic normalization
$url = strtolower($url);
$url = strip_tags($url);
$url = stripslashes($url);
$url = html_entity_decode($url);
# Remove quotes (can't, etc.)
$url = str_replace('\'', '', $url);
# Replace non-alpha numeric with hyphens
$match = '/[^a-z0-9]+/';
$replace = '-';
$url = preg_replace($match, $replace, $url);
$url = trim($url, '-');
return $url;
}
You could probably shorten it with longer regexps but it's pretty straightforward as-is. The bonus is that you can use the same function to validate the query parameter before you run a query on the database to match the title, so someone can't stick silly things into your database.

See the first answer here URL Friendly Username in PHP?:
function Slug($string)
{
return strtolower(trim(preg_replace('~[^0-9a-z]+~i', '-', html_entity_decode(preg_replace('~&([a-z]{1,2})(?:acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1', htmlentities($string, ENT_QUOTES, 'UTF-8')), ENT_QUOTES, 'UTF-8')), '-'));
}
$user = 'Alix Axel';
echo Slug($user); // alix-axel
$user = 'Álix Ãxel';
echo Slug($user); // alix-axel
$user = 'Álix----_Ãxel!?!?';
echo Slug($user); // alix-axel

You can use urlencode or rawurlencode... for example Wikipedia do that. See this link:
http://en.wikipedia.org/wiki/Ichigo_100%25
that's the php encoding for % = %25

I just create a gist with a useful slug function:
https://gist.github.com/ninjagab/11244087
You can use it to convert title to seo friendly url.
<?php
class SanitizeUrl {
public static function slug($string, $space="-") {
$string = utf8_encode($string);
if (function_exists('iconv')) {
$string = iconv('UTF-8', 'ASCII//TRANSLIT', $string);
}
$string = preg_replace("/[^a-zA-Z0-9 \-]/", "", $string);
$string = trim(preg_replace("/\\s+/", " ", $string));
$string = strtolower($string);
$string = str_replace(" ", $space, $string);
return $string;
}
}
$title = 'Thi is a test string with some "strange" chars ò à ù...';
echo SanitizeUrl::slug($title);
//this will output:
//thi-is-a-test-string-with-some-strange-chars-o-a-u

You could use the rawurlencode() function

To simplify just full the list of the variable $change_to and $to_change
<?php
// Just full the array list to make replacement complete
// In this space will change to _, à to just a
$to_change = [
' ', 'à', 'à', 'â','é', 'è', 'ê', 'ç', 'ù', 'ô', 'ö' // and so on
];
$change_to = [
'_', 'a', 'a', 'a', 'e', 'e', 'e','c', 'u', 'o', 'o' // and so on
];
$texts = 'This is my slug in êlàb élaboré par';
$page_id = str_replace($to_change, $change_to, $texts);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

convert encoded html entities to utf-8 - php

Related

PHP escape unicode characters only

DOMDocument->saveHTML() vs urlencode with commercial at symbol (#)

Decoding base64 strings in PHP

CakePHP 1.3 warning array_merge sanitize

Best way to convert title into url compatible mode in PHP?

Categories

Resources