PHP DOM challenge - parsing problematic javascript - php

This is a wierd problem, and I can't see an easy solution.
If you attempt to use DOM to parse a document that has a </head> tag contained within a javascript function, it doesn't work correctly. It takes the </head> inside the javascript function as the closing </head> tag.
I have been wrestling with this for hours - any ideas?
<?php
$contents =
<<<EOF
<!DOCTYPE html>
<html><head>
<script>function myFunc() { var myVar = "<head></head>"; } </script>
</head>
<body><p>This is a test</p></body>
</html>
EOF;
//GET CONTENT & LOAD INTO DOM
$doc = new DOMDocument('1.0', 'UTF-8');
$doc->loadHTML($contents);
//STRIP OUT THE JAVASCRIPT
$scripts = $doc->getElementsByTagName('script');
$length = $scripts->length;
for ($i = 0; $i < $length; $i++) {
$scripts->item(0)->parentNode->removeChild($scripts->item(0));
}
echo htmlentities($doc->saveHTML());

Common Javascript issue: Use this instead:
var myVar = "<head><\/head>";

You can escape characters that you don't want interpreted. For example:
var myVar = "\x3chead\x3e\x3c/head\x3e";
console.log(myVar);
Will create "<head></head>" without actual < > characters.

Related

Get <script> in head HTML using DOM Parser

I'm currently using DOM Parser for my project. Also, I'm using CURL in php to scraping the website. I want to get a value from the script tag in the head of the HTML I get. But I really confused how to do that. If run the code bellow :
$data_dom = new simple_html_dom();
$data_dom->load($html);
foreach($data_dom->find('script') as $script){
echo $script->plaintext."<br>";
}
The result was the empty value, when I inspect it, only br tag appear. I want to get everything that using script tag. Here is the head value :
<head>
I will give you the script I want to get
.....
<script type="text/javascript">
var keysearch = {"departureLabel":"Surabaya (SUB : Juanda) Jawa Timur Indonesia","arrivalLabel":"Palangkaraya (PKY : Tjilik Riwut | Panarung) Kalimantan Tengah Indonesia","adultNum":"1","childNum":"0","infantNum":"0","departure":"SUB","arrival":"PKY","departDate":"20181115","roundTrip":0,"cabinType":-1,"departureCode":"ID-Surabaya-SUB","arrivalCode":"ID-Palangkaraya-PKY"};
(function(window, _gtm, keysearch){
if (window.gtmInstance){
var departureExp = keysearch.departureCode.split("-");
var arrivalExp = keysearch.arrivalCode.split("-");
gtmInstance.setFlightData({
'ITEM_TYPE': 'flight',
'FLY_OUTB_CODE': departureExp[2],
'FLY_OUTB_CITY': departureExp[1],
'FLY_OUTB_COUNTRYCODE': departureExp[0],
'FLY_OUTB_DATE': keysearch.departDate,
'FLY_INB_CODE': arrivalExp[2],
'FLY_INB_CITY': arrivalExp[1],
'FLY_INB_COUNTRYCODE': arrivalExp[0],
'FLY_INB_DATE': keysearch.returnDate,
'FLY_NBPAX_ADL': keysearch.adultNum,
'FLY_NBPAX_CHL': keysearch.childNum,
'FLY_NBPAX_INF': keysearch.infantNum,
});
gtmInstance.pushFlightSearchEvent();
}
}(window, gtmInstance, keysearch));
var key = "rkey=10fe7b6fd1f7fa1ef0f4fa538f917811dbc7f4628a791ba69962f2ed305fb72d061b67737afd843aaaeeee946f1442bb";
var staticRoot = 'http://sta.nusatrip.net';
$(function() {
$("#currencySelector").nusaCurrencyOptions({
selected: getCookie("curCode"),
});
});
</script>
</head>
I want to get the key variable. I will use it to get the data from the website. Thanks
Depending on what the rest of the markup looks like, you may be able to just use DOMDocument and XPath, then parse out the value of the var with preg_match. This example will echo the key.
<?php
$html = <<<END
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
<script type="text/javascript">
var keysearch = {"departureLabel":"Surabaya (SUB : Juanda) Jawa Timur Indonesia","arrivalLabel":"Palangkaraya (PKY : Tjilik Riwut | Panarung) Kalimantan Tengah Indonesia","adultNum":"1","childNum":"0","infantNum":"0","departure":"SUB","arrival":"PKY","departDate":"20181115","roundTrip":0,"cabinType":-1,"departureCode":"ID-Surabaya-SUB","arrivalCode":"ID-Palangkaraya-PKY"};
(function(window, _gtm, keysearch){
if (window.gtmInstance){
var departureExp = keysearch.departureCode.split("-");
var arrivalExp = keysearch.arrivalCode.split("-");
gtmInstance.setFlightData({
'ITEM_TYPE': 'flight',
'FLY_OUTB_CODE': departureExp[2],
'FLY_OUTB_CITY': departureExp[1],
'FLY_OUTB_COUNTRYCODE': departureExp[0],
'FLY_OUTB_DATE': keysearch.departDate,
'FLY_INB_CODE': arrivalExp[2],
'FLY_INB_CITY': arrivalExp[1],
'FLY_INB_COUNTRYCODE': arrivalExp[0],
'FLY_INB_DATE': keysearch.returnDate,
'FLY_NBPAX_ADL': keysearch.adultNum,
'FLY_NBPAX_CHL': keysearch.childNum,
'FLY_NBPAX_INF': keysearch.infantNum,
});
gtmInstance.pushFlightSearchEvent();
}
}(window, gtmInstance, keysearch));
var key = "rkey=10fe7b6fd1f7fa1ef0f4fa538f917811dbc7f4628a791ba69962f2ed305fb72d061b67737afd843aaaeeee946f1442bb";
var staticRoot = 'http://sta.nusatrip.net';
$(function() {
$("#currencySelector").nusaCurrencyOptions({
selected: getCookie("curCode"),
});
});
</script>
</head>
<body>foo</body>
</html>
END;
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$result = $xpath->query('//script');
foreach($result as $currScriptTag)
{
$currScriptContent = $currScriptTag->nodeValue;
$matchFound = preg_match('/var key = "(.*)"/', $currScriptContent, $matches);
if($matchFound)
{
/*
* $matches[0] will contain the whole line like var key = "..."
* $matches[1] just contains the value of the var
*/
$key = $matches[1];
echo $key.PHP_EOL;
}
}

how to find http from saved file in php

I created a program in php using CURL, in which i can take data of any site and can display it in the browser. Another part of the program is that the data can be saved in the file using file handling and after saving this data, I can find all the http links within the body tag of the saved file. My code is showing all the sites in the browser which I took, but I can not find the http links and some unnecessary code is also occurring like this image, though I don't want it to come.
https://www.screencast.com/t/Nwaz93oU
PHP Code:
<!DOCTYPE html>
<html>
<?php
function get_all_links(){
$html = file_get_contents('http://www.ucertify.com');
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
echo $url.'<br />';
}
}
function get_site_data($uc_url){
$get_uc = curl_init();
curl_setopt($get_uc,CURLOPT_URL,$uc_url);
curl_setopt($get_uc,CURLOPT_RETURNTRANSFER,true);
$output=curl_exec($get_uc);
curl_close($get_uc);
$fp=fopen("mohit.txt","w");
fputs($fp,$output);
return $output;
}
?>
<body>
<div>
<?php
$site_content = get_site_data("http://www.ucertify.com");
echo $site_content;
?>
</div>
<div >
<?php
echo get_all_links("http://www.ucertify.com");
?>
</div>
</body>
</html>
On get_all_links method validate if $url variable is a valid url in some pages may have onclick handler to javascript. In order to validate if a url you can use regex and php's preg_match. Also you can look on What is a good regular expression to match a URL? about the needed regex in order to validate a url.

Trying to remove script tags in HTML

I am trying to remove script tags from HTML using PHP but it doesn't work if there's HTML inside the javascript.
For example, if the script tags contain something like this:
function tip(content) {
$('<div id="tip">' + content + '</div>').css
It will stop at </div> and the rest of the script will still be taken into account.
This is what I have been using to remove the script tags:
foreach ($doc->getElementsByTagName('script') as $node)
{
$node->parentNode->removeChild($node);
}
How about some regex-based pre-processing?
Example input.html:
<html>
<head>
<title>My example</title>
</head>
<body>
<h1>Test</h1>
<div id="foo"> </div>
<script type="text/javascript">
document.getElementById('foo').innerHTML = '<span style="color:red;">Hello World!</span>';
</script>
</body>
</html>
Script tag removing php script:
<?php
// unformatted source output:
header("Content-Type: text/plain");
// read the example input file given above into a string:
$input = file_get_contents('input.html');
echo "Before:\r\n";
echo $input;
echo "\r\n\r\n-----------------------\r\n\r\n";
// replace script tags including their contents by ""
$output = preg_replace("~<script[^<>]*>.*</script>~Uis", "", $input);
echo "After:\r\n";
echo $output;
echo "\r\n\r\n-----------------------\r\n\r\n";
?>
You can use strip_tags function. In which you can allow the HTML attributes which you want allowed.
I think this is 'here and now' problem, and you need no something special. Just do something like this:
$text = file_get_content('index.html');
while(mb_strpos($text, '<script') != false) {
$startPosition = mb_strpos($text, '<script');
$endPosition = mb_strpos($text, '</script>');
$text = mb_substr($text, 0, $startPosition).mb_substr($text, $endPosition + 7, mb_strlen($text));
}
echo $text;
Only set encoding for 'mb_' like functions

how to access DOM in php that will echo out everything between <html></html> [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Export particular element in DOMDocument to string
i know how to access different element depending on id but don't know how to get everything between html start tag to html end tag. Can anyone please help me.
thanks.
If you would like to parse an html page with PHP, you could use PHP's DOMDocument extension, as such:
// a new dom object
$dom = new domDocument;
// load the html into the object
$dom->loadHTML($html);
// keep white space
$dom->preserveWhiteSpace = true;
// nicely format output
$dom ->formatOutput = true;
//get element by tag name
$htmlRootElement = $dom->getElementsByTagName('html');
echo htmlspecialchars($dom->saveHTML(), ENT_QUOTES);
Or you could do this with JavaScript on the client side:
var htmlRootElement = document.getElementsByTagName("html");
alert(htmlRootElement.innerHTML);
You can access each element in the <html> tag with the DOMDocument class.
Example
$htmlDoc = new DOMDocument;
$html = <<<HTML
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<title>My Site</title>
<meta name="description" content="DOM test">
</head>
<body>
<h1>Hello</h1>
<p>This is a DOM test</p>
</body>
</html>
HTML;
$htmlDoc->loadHTML($html);
$htmlElement = $htmlDoc->getElementsByTagName("html");
foreach ($htmlElement->item(0)->childNodes as $element) {
echo 'Element name: ' . $element->nodeName . PHP_EOL;
echo 'Element value: '. $element->nodeValue . PHP_EOL;
}

get a text file using php and send the string to javascript

Ok, what I am trying to do is make a javascript loop of images, but first I have to get a list of the images. In javascript there is no way to directly grab this text file... http://www.ssd.noaa.gov/goes/east/tatl/txtfiles/ft_names.txt but it can be done eaisly in php, I am currently gettung the txt file using php, but the javascript cannot read the variable. How can I make javascript be able to read this variable. Here is what I have...
<?php
$file = "http://www.ssd.noaa.gov/goes/east/tatl/txtfiles/ft_names.txt"; //Path to your *.txt file
$contents = file($file);
$string = implode($contents);
echo $string;
?>
<script type="text/javascript">
function prnt() {
var whatever = "<?= $string ?>";
alert(whatever);
}
</script>
You can use echo or print to write to the page in PHP.
var whatever = "<?php echo $string; ?>";
Although, if the file has line breaks in it, you will need to remove those.
Make it a bit more interesting: go ahead and split the fields and use JSON encoding. It should read directly in javascript without needing to call JSON.parse() on the client.
<?php
$lines = file_get_contents('http://...');
$lines = explode("\n",trim($lines));
foreach ($lines as &$line) {
$line = preg_split('/,? /',$line);
}
$js = json_encode($lines);
?>
<!DOCTYPE HTML>
<html lang="en-US">
<head>
<meta charset="UTF-8">
<title></title>
</head>
<body>
<script type="text/javascript">
var dar = <?php echo $js; ?>;
</script>
</body>
</html>
You should also consider using a local proxy to cache the results of that file if you plan to run this frequently and especially if you are going to serve it up on a public web server somewhere. Store the file locally as "noaa_data.txt" and have a second script on a cron job (12 hours or something):
<?php
file_put_contents("/var/www/noaa_data.txt",file_get_contents("http://www.ssd.noaa.gov/goes/east/tatl/txtfiles/ft_names.txt"));
?>

Categories