PHP: Remove javascript events from html

PHP: Remove javascript events from html - php

Is there any way to remove js events like 'onload', 'onclick',... from html elements in PHP?
For example if <a (onclick)="alert('hi')">Link</a> is given, the desired output should be <a>Link</a>.
I did it this way:
$dom = new DOMDocument;
$dom->loadHTML($request->request->get('description'));
$nodes = $dom->getElementsByTagName('*');
foreach($nodes as $node)
{
if ($node->hasAttribute('onload'))
{
$node->removeAttribute('onload');
}
if ($node->hasAttribute('onclick'))
{
$node->removeAttribute('onclick');
}
}
$dom->saveHTML();
However I'm not sure if it's a safe way to that, because if later a new js event will be created the chance that I'll forget to blacklist it is real.

function filterText($value)
{
if(!$value) return $value;
return escapeJsEvent(removeScriptTag($value));
}
function escapeJsEvent($value){
return preg_replace('/(<.+?)(?<=\s)on[a-z]+\s*=\s*(?:([\'"])(?!\2).+?\2|(?:\S+?\(.*?\)(?=[\s>])))(.*?>)/i', "$1 $3", $value);
}
function removeScriptTag($text)
{
$search = array("'<script[^>]*?>.*?</script>'si",
"'<iframe[^>]*?>.*?</iframe>'si");
$replace = array('','');
$text = preg_replace($search, $replace, $text);
return preg_replace_callback("'&#(\d+);'", function ($m) {
return chr($m[1]);
}, $text);
}
echo filterText('<img src=1 href=1 onerror="javascript:alert(1)"></img>');

You should build a Javascript method that does this for you, and can apply it after the body loads, because php code executes at page load and you can't check later in the document if theres other event, until it loads again.

Related

set tags in html using domdocument and preg_replace_callback

I try to replace words that are in my dictionary of terminology with an (html)anchor so it gets a tooltip. I get the replace-part done, but I just can't get it back in the DomDocument object.
I've made a recursive function that iterates the DOM, it iterates every childnode, searching for the word in my dictionary and replacing it with an anchor.
I've been using this with an ordinary preg_match on HTML, but that just runs into problems.. when HTML gets complex
The recursive function:
$terms = array(
'example'=>'explanation about example'
);
function iterate_html($doc, $original_doc = null)
{
global $terms;
if(is_null($original_doc)) {
self::iterate_html($doc, $doc);
}
foreach($doc->childNodes as $childnode)
{
$children = $childnode->childNodes;
if($children) {
self::iterate_html($childnode);
} else {
$regexes = '~\b' . implode('\b|\b',array_keys($terms)) . '\b~i';
$new_nodevalue = preg_replace_callback($regexes, function($matches) {
$doc = new DOMDocument();
$anchor = $doc->createElement('a', $matches[0]);
$anchor->setAttribute('class', 'text-info');
$anchor->setAttribute('data-toggle', 'tooltip');
$anchor->setAttribute('data-original-title', $terms[strtolower($matches[0])]);
return $doc->saveXML($anchor);
}, $childnode->nodeValue);
$dom = new DOMDocument();
$template = $dom->createDocumentFragment();
$template->appendXML($new_nodevalue);
$original_doc->importNode($template->childNodes, true);
$childnode->parentNode->replaceChild($template, $childnode);
}
}
}
echo iterate_html('this is just some example text.');
I expect the result to be:
this is just some <a class="text-info" data-toggle="tooltip" data-original-title="explanation about example">example</a> text

I don't think building a recursive function to walk the DOM is usefull when you can use an XPath query. Also, I'm not sure that preg_replace_callback is an adapted function for this case. I prefer to use preg_split. Here is an example:
$html = 'this is just some example text.';
$terms = array(
'example'=>'explanation about example'
);
// sort by reverse order of key size
// (to be sure that the longest string always wins instead of the first in the pattern)
uksort($terms, function ($a, $b) {
$diff = mb_strlen($b) - mb_strlen($a);
return ($diff) ? $diff : strcmp($a, $b);
});
// build the pattern inside a capture group (to have delimiters in the results with the PREG_SPLIT_DELIM_CAPTURE option)
$pattern = '~\b(' . implode('|', array_map(function($i) { return preg_quote($i, '~'); }, array_keys($terms))) . ')\b~i';
// prevent eventual html errors to be displayed
$libxmlInternalErrors = libxml_use_internal_errors(true);
// determine if the html string have a root html element already, if not add a fake root.
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$fakeRootElement = false;
if ( $dom->documentElement->nodeName !== 'html' ) {
$dom->loadHTML("<div>$html</div>", LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
$fakeRootElement = true;
}
libxml_use_internal_errors($libxmlInternalErrors);
// find all text nodes (not already included in a link or between other unwanted tags)
$xp = new DOMXPath($dom);
$textNodes = $xp->query('//text()[not(ancestor::a)][not(ancestor::style)][not(ancestor::script)]');
// replacement
foreach ($textNodes as $textNode) {
$parts = preg_split($pattern, $textNode->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE);
$fragment = $dom->createDocumentFragment();
foreach ($parts as $k=>$part) {
if ($k&1) {
$anchor = $dom->createElement('a', $part);
$anchor->setAttribute('class', 'text-info');
$anchor->setAttribute('data-toggle', 'tooltip');
$anchor->setAttribute('data-original-title', $terms[strtolower($part)]);
$fragment->appendChild($anchor);
} else {
$fragment->appendChild($dom->createTextNode($part));
}
}
$textNode->parentNode->replaceChild($fragment, $textNode);
}
// building of the result string
$result = '';
if ( $fakeRootElement ) {
foreach ($dom->documentElement->childNodes as $childNode) {
$result .= $dom->saveHTML($childNode);
}
} else {
$result = $dom->saveHTML();
}
echo $result;
demo
Feel free to put that into one or more functions/methods, but keep in mind that this kind of editing has a non-neglictable weight and should be used each time the html is edited (and not each time the html is displayed).

Remove <div> innerHTML with php

I try to change a html page through php. The idea is to reinvent the "contenteditable" attribute and change text on the fly. But I want to save it in the original html.
For this I have some initial text in a div element. This I convert to a form with a textarea, reload the page and then I can play with the text. Next I want to return the content of the textarea into the original div. It should replace the old text. It seems to work, except that the old text is always appended and I cannot get rid of it. The problem is probably in the setInnerHTML function. I tried:
$element->parentNode->removeChild($element);
but it did not work for some reason.
Thanks!
<?php
$text = $_POST["text"];
$id = $_GET["id"];
$ref = $_GET["ref"];
$html = new DOMDocument();
$html->loadHTMLFile($ref.".html");
$html->preserveWhiteSpace = false;
$html->formatOutput = true;
$elem = $html->getElementById($id);
function setInnerHTML($DOM, $element, $innerHTML)
{
$DOM->deleteTextNode($innerHTML);
$element->parentNode->removeChild($element);
$node = $DOM->createTextNode($innerHTML);
$element->appendChild($node);
}
setInnerHTML($html, $elem, $text);
$html->saveHTMLFile($ref.".html");
?>

Try changing your setInnerHTML to look like this:
function setInnerHTML($DOM, $element, $innerHTML) {
$node = $DOM->createTextNode($innerHTML);
$children = $element->childNodes;
foreach ($children as $child) {
$element->removeChild($child);
}
$element->appendChild($node);
}
Tell me if it is the result you desired.

How to get plain text inside body tag using dom..and get the words into an array?

I want to get the contents inside body tag..seperate them as words and get the words into an array..am using php
This is what i have done
$content=file_get_contents($_REQUEST['url']);
$content=html_entity_decode($content);
$content = preg_replace("/&#?Ã[a-z0-9]+;/i"," ",$content);
$dom = new DOMDocument;
#$dom->loadHTML($content);
$tags=$dom->getElementsByTagName('body');
foreach($tags as $h)
{
echo "<li>".$h->tagName;
getChilds2($h);
function getChilds2($node)
{
if($node->hasChildNodes())
{
foreach($node->childNodes as $c)
{
if($c->nodeType==3)
{
$nodeValue=$c->nodeValue;
$words=feature_node($c,$nodeValue,true);
if($words!=false)
{
$_ENV["words"][]=$words;
}
else if($c->tagName!="")
{
getChilds2($c);
}
}
}
}
else
{
return;
}
}
function feature_node($node,$content,$display)
{
if(strlen($content)<=0)
{
return;
}
$content=strtolower($content);
$content=mb_convert_encoding($content, 'UTF-8',
mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
$content= drop_script_tags($content);
$temp=$content;
$content=strip_punctuation($content);
$content=strip_symbols($content);
$content=strip_numbers($content);
$words_after_noise_removal=mb_split( ' +',$content);
$words_after_stop_words_removal=remove_stop_words($words_after_noise_removal);
if(count($words_after_stop_words_removal)==0)
return(false);
$i=0;
foreach($words_after_stop_words_removal as $w)
{
$words['word'][$i]=$w;
$i++;
}
for($i=0;$i<sizeof($words['word']);$i++)
{
$words['stemmed'][$i]= PorterStemmer::Stem($words['word'][$i],true)."<br/>";
}
return($words);
}
Here i have used some functions like strip_punctuation,strip_symbols,strip_numbers,remove stop_words and porterstemmer for preprocessing of the page..they ar eworking fine..but am not getting the contents into array and print_r() or echo gives nothing..help plz?

You dont have to to iterate over the nodes.
$tags = $dom->getElementsByTagName('body');
will give you just one result in the DOMNodeList. So all you need to do to get the text is
$plainText = $tags->item(0)->nodeValue;
or
$plainText = $tags->item(0)->textContent;
To get the separate words into an array, you can use
str_word_count — Return information about words used in a string
on the resulting $plainText then

Get text contained within a specific html element using php

I need to get all of the text contained between a specific div. In the following example I want to get everything between the div with class name "st" :
<div class="title">This is a title</div>
<div class="st">Some example <em>text</em> here.</div>
<div class="footer">Footer text</div>
So the result would be
Some example <em>text</em> here.
or even just
Some example text here.
Does anyone know how to accomplish this?

Server-side in PHP
A very basic way would be something like this:
$data = ''; // your HTML data from the question
preg_match( '/<div class="\st\">(.*?)<\/div>/', $data, $match );
Then iterate the $match object. However, this could return bad data if your .st DIV has another DIV inside it.
A more proper way would be:
function getData()
{
$dom = new DOMDocument;
$dom -> loadHTML( $data );
$divs = $dom -> getElementsByTagName('div');
foreach ( $divs as $div )
{
if ( $div -> hasAttribute('class') && strpos( $div -> getAttribute('class'), 'st' ) !== false )
{
return $div -> nodeValue;
}
}
}
Client-side
If you're using jQuery, it would be easy like this:
$('.st').text();
or
$('.st').html();
If you're using plain JavaScript, it would be a little complicated cause you'll have to check all DIV elements until you find the one with your desired CSS class:
function foo()
{
var divs = document.getElementsByTagName('div'), i;
for (i in divs)
{
if (divs[i].className.indexOf('st') > -1)
{
return divs[i].innerHTML;
}
}
}

Use DOM. Example:
$html_str = "<html><body><div class='st'>Some example <em>text</em> here.</div></body></html>";
$dom = new DOMDocument('1.0', 'iso-8859-1');
$dom->loadHTML($html_str); // just one method of loading html.
$dom->loadHTMLFile("some_url_to_html_file");
$divs = getElementsByClassName($dom,"st");
$div = $divs[0];
$str = '';
foreach ($div->childNodes as $node) {
$str .= $dom->saveHTML($node);
}
print_r($str);
The below function is not mine, but this user's. If you find this function useful, go to the previously linked answer and vote it up.
function getElementsByClassName(DOMDocument $domNode, $className) {
$elements = $domNode->getElementsByTagName('*');
$matches = array();
foreach($elements as $element) {
if (!$element->hasAttribute('class')) {
continue;
}
$classes = preg_split('/\s+/', $element->getAttribute('class'));
if (!in_array($className, $classes)) {
continue;
}
$matches[] = $element;
}
return $matches;
}

PHP is a server side language, to do this you should use a client side language like javascript (and possibly a library like jQuery for easy ad fast cross-browser coding). And then use javascript to send the data you need to the backend for processing (Ajax).
jQuery example:
var myText = jQuery(".st").text();
jQuery.ajax({
type: 'POST',
url: 'myBackendUrl',
myTextParam: myText,
success: function(){
alert('done!');
},
});
Then, in php:
<?php
$text = $_POST['myTextParam'];
// do something with text

Using a XML parser:
$htmlDom = simple_load_string($htmlSource);
$results = $htmlDom->xpath("//div[#class='st']/text()");
while(list( , $node) = each($result)) {
echo $node, "\n";
}

use jquery/ajax
then do something like:
<script>
$(document).ready(function() {
$.ajax({
type: "POST",
url: "urltothepageyouneed the info",
data: { ajax: "ajax", divcontent:$(".st").html()}
})
});
</script>
Basically
$(".st").html()
will return the HTML
and
$(".st").text()
Will return the text
Hope that helps

remove script tag from HTML content

I am using HTML Purifier (http://htmlpurifier.org/)
I just want to remove <script> tags only.
I don't want to remove inline formatting or any other things.
How can I achieve this?
One more thing, it there any other way to remove script tags from HTML

Because this question is tagged with regex I'm going to answer with poor man's solution in this situation:
$html = preg_replace('#<script(.*?)>(.*?)</script>#is', '', $html);
However, regular expressions are not for parsing HTML/XML, even if you write the perfect expression it will break eventually, it's not worth it, although, in some cases it's useful to quickly fix some markup, and as it is with quick fixes, forget about security. Use regex only on content/markup you trust.
Remember, anything that user inputs should be considered not safe.
Better solution here would be to use DOMDocument which is designed for this.
Here is a snippet that demonstrate how easy, clean (compared to regex), (almost) reliable and (nearly) safe is to do the same:
<?php
$html = <<<HTML
...
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$script = $dom->getElementsByTagName('script');
$remove = [];
foreach($script as $item)
{
$remove[] = $item;
}
foreach ($remove as $item)
{
$item->parentNode->removeChild($item);
}
$html = $dom->saveHTML();
I have removed the HTML intentionally because even this can bork.

Use the PHP DOMDocument parser.
$doc = new DOMDocument();
// load the HTML string we want to strip
$doc->loadHTML($html);
// get all the script tags
$script_tags = $doc->getElementsByTagName('script');
$length = $script_tags->length;
// for each tag, remove it from the DOM
for ($i = 0; $i < $length; $i++) {
$script_tags->item($i)->parentNode->removeChild($script_tags->item($i));
}
// get the HTML string back
$no_script_html_string = $doc->saveHTML();
This worked me me using the following HTML document:
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<title>
hey
</title>
<script>
alert("hello");
</script>
</head>
<body>
hey
</body>
</html>
Just bear in mind that the DOMDocument parser requires PHP 5 or greater.

$html = <<<HTML
...
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$tags_to_remove = array('script','style','iframe','link');
foreach($tags_to_remove as $tag){
$element = $dom->getElementsByTagName($tag);
foreach($element as $item){
$item->parentNode->removeChild($item);
}
}
$html = $dom->saveHTML();

A simple way by manipulating string.
function stripStr($str, $ini, $fin)
{
while (($pos = mb_stripos($str, $ini)) !== false) {
$aux = mb_substr($str, $pos + mb_strlen($ini));
$str = mb_substr($str, 0, $pos);
if (($pos2 = mb_stripos($aux, $fin)) !== false) {
$str .= mb_substr($aux, $pos2 + mb_strlen($fin));
}
}
return $str;
}

Shorter:
$html = preg_replace("/<script.*?\/script>/s", "", $html);
When doing regex things might go wrong, so it's safer to do like this:
$html = preg_replace("/<script.*?\/script>/s", "", $html) ? : $html;
So that when the "accident" happen, we get the original $html instead of empty string.

this is a merge of both ClandestineCoder & Binh WPO.
the problem with the script tag arrows is that they can have more than one variant
ex. (< = < = &lt;) & ( > = > = &gt;)
so instead of creating a pattern array with like a bazillion variant,
imho a better solution would be
return preg_replace('/script.*?\/script/ius', '', $text)
? preg_replace('/script.*?\/script/ius', '', $text)
: $text;
this will remove anything that look like script.../script regardless of the arrow code/variant and u can test it in here https://regex101.com/r/lK6vS8/1

Try this complete and flexible solution. It works perfectly, and is based in-part by some previous answers, but contains additional validation checks, and gets rid of additional implied HTML from the loadHTML(...) function. It is divided into two separate functions (one with a previous dependency so don't re-order/rearrange) so you can use it with multiple HTML tags that you would like to remove simultaneously (i.e. not just 'script' tags). For example removeAllInstancesOfTag(...) function accepts an array of tag names, or optionally just one as a string. So, without further ado here is the code:
/* Remove all instances of a particular HTML tag (e.g. <script>...</script>) from a variable containing raw HTML data. [BEGIN] */
/* Usage Example: $scriptless_html = removeAllInstancesOfTag($html, 'script'); */
if (!function_exists('removeAllInstancesOfTag'))
{
function removeAllInstancesOfTag($html, $tag_nm)
{
if (!empty($html))
{
$html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'); /* For UTF-8 Compatibility. */
$doc = new DOMDocument();
$doc->loadHTML($html,LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD|LIBXML_NOWARNING);
if (!empty($tag_nm))
{
if (is_array($tag_nm))
{
$tag_nms = $tag_nm;
unset($tag_nm);
foreach ($tag_nms as $tag_nm)
{
$rmvbl_itms = $doc->getElementsByTagName(strval($tag_nm));
$rmvbl_itms_arr = [];
foreach ($rmvbl_itms as $itm)
{
$rmvbl_itms_arr[] = $itm;
}
foreach ($rmvbl_itms_arr as $itm)
{
$itm->parentNode->removeChild($itm);
}
}
}
else if (is_string($tag_nm))
{
$rmvbl_itms = $doc->getElementsByTagName($tag_nm);
$rmvbl_itms_arr = [];
foreach ($rmvbl_itms as $itm)
{
$rmvbl_itms_arr[] = $itm;
}
foreach ($rmvbl_itms_arr as $itm)
{
$itm->parentNode->removeChild($itm);
}
}
}
return $doc->saveHTML();
}
else
{
return '';
}
}
}
/* Remove all instances of a particular HTML tag (e.g. <script>...</script>) from a variable containing raw HTML data. [END] */
/* Remove all instances of dangerous and pesky <script> tags from a variable containing raw user-input HTML data. [BEGIN] */
/* Prerequisites: 'removeAllInstancesOfTag(...)' */
if (!function_exists('removeAllScriptTags'))
{
function removeAllScriptTags($html)
{
return removeAllInstancesOfTag($html, 'script');
}
}
/* Remove all instances of dangerous and pesky <script> tags from a variable containing raw user-input HTML data. [END] */
And here is a test usage example:
$html = 'This is a JavaScript retention test.<br><br><span id="chk_frst_scrpt">Congratulations! The first \'script\' tag was successfully removed!</span><br><br><span id="chk_secd_scrpt">Congratulations! The second \'script\' tag was successfully removed!</span><script>document.getElementById("chk_frst_scrpt").innerHTML = "Oops! The first \'script\' tag was NOT removed!";</script><script>document.getElementById("chk_secd_scrpt").innerHTML = "Oops! The second \'script\' tag was NOT removed!";</script>';
echo removeAllScriptTags($html);
I hope my answer really helps someone. Enjoy!

An example modifing ctf0's answer. This should only do the preg_replace once but also check for errors and block char code for forward slash.
$str = '<script> var a - 1; </script>';
$pattern = '/(script.*?(?:\/|/|/)script)/ius';
$replace = preg_replace($pattern, '', $str);
return ($replace !== null)? $replace : $str;
If you are using php 7 you can use the null coalesce operator to simplify it even more.
$pattern = '/(script.*?(?:\/|/|/)script)/ius';
return (preg_replace($pattern, '', $str) ?? $str);

function remove_script_tags($html){
$dom = new DOMDocument();
$dom->loadHTML($html);
$script = $dom->getElementsByTagName('script');
$remove = [];
foreach($script as $item){
$remove[] = $item;
}
foreach ($remove as $item){
$item->parentNode->removeChild($item);
}
$html = $dom->saveHTML();
$html = preg_replace('/<!DOCTYPE.*?<html>.*?<body><p>/ims', '', $html);
$html = str_replace('</p></body></html>', '', $html);
return $html;
}
Dejan's answer was good, but saveHTML() adds unnecessary doctype and body tags, this should get rid of it. See https://3v4l.org/82FNP

I would use BeautifulSoup if it's available. Makes this sort of thing very easy.
Don't try to do it with regexps. That way lies madness.

I had been struggling with this question. I discovered you only really need one function. explode('>', $html); The single common denominator to any tag is < and >. Then after that it's usually quotation marks ( " ). You can extract information so easily once you find the common denominator. This is what I came up with:
$html = file_get_contents('http://some_page.html');
$h = explode('>', $html);
foreach($h as $k => $v){
$v = trim($v);//clean it up a bit
if(preg_match('/^(<script[.*]*)/ius', $v)){//my regex here might be questionable
$counter = $k;//match opening tag and start counter for backtrace
}elseif(preg_match('/([.*]*<\/script$)/ius', $v)){//but it gets the job done
$script_length = $k - $counter;
$counter = 0;
for($i = $script_length; $i >= 0; $i--){
$h[$k-$i] = '';//backtrace and clear everything in between
}
}
}
for($i = 0; $i <= count($h); $i++){
if($h[$i] != ''){
$ht[$i] = $h[$i];//clean out the blanks so when we implode it works right.
}
}
$html = implode('>', $ht);//all scripts stripped.
echo $html;
I see this really only working for script tags because you will never have nested script tags. Of course, you can easily add more code that does the same check and gather nested tags.
I call it accordion coding. implode();explode(); are the easiest ways to get your logic flowing if you have a common denominator.

This is a simplified variant of Dejan Marjanovic's answer:
function removeTags($html, $tag) {
$dom = new DOMDocument();
$dom->loadHTML($html);
foreach (iterator_to_array($dom->getElementsByTagName($tag)) as $item) {
$item->parentNode->removeChild($item);
}
return $dom->saveHTML();
}
Can be used to remove any kind of tag, including <script>:
$scriptlessHtml = removeTags($html, 'script');

use the str_replace function to replace them with empty space or something
$query = '<script>console.log("I should be banned")</script>';
$badChar = array('<script>','</script>');
$query = str_replace($badChar, '', $query);
echo $query;
//this echoes console.log("I should be banned")
?>

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP: Remove javascript events from html - php

You should build a Javascript method that does this for you, and can apply it after the body loads, because php code executes at page load and you can't check later in the document if theres other event, until it loads again.

Related

set tags in html using domdocument and preg_replace_callback

Remove <div> innerHTML with php

How to get plain text inside body tag using dom..and get the words into an array?

Get text contained within a specific html element using php

remove script tag from HTML content

Categories

Resources