Replace <script src="url"> with <script>contents</script> - php

I'm using simple_html_dom [ http://sourceforge.net/projects/simplehtmldom/ ] to parse through HTML.
I'm trying to get all of the <script> urls, grab the contents, and then replace it in the $html variable... I have this and it almost works like I want:
$html_elements = str_get_html( $html );
$current_src = array( );
$new_src = array( );
foreach($html_elements->find('script') as $element) {
if( $element->src != '' )
{
$script_url = $element->src;
$script_data = get_script( $script_url );
$current_src[] = $element->outertext;
$new_src[] = "<script>" . $element->innertext . "\n" . $script_data . "</script>";
}
}
$html = str_replace( $current_src, $new_src, $html );
function get_script( $url )
{
$data = file_get_contents( $url );
return $data;
}
The problem is that it seems to be turning the plus signs in the javascript files in to spaces when it's all said and done?

Please refer to the comment section above.
After further debugging, I was parsing the data one to many times through urldecode() later on in the code.

Related

Simple DOM HTML returns wrong URL

I have the following code:
<?php
include("simple_html_dom.php");
crawl('http://www.google.com/search?hl=en&output=search&q=inurl:https://website.com/folder/');
function crawl($url){
$html = file_get_html($url);
$links = $html->find('a');
foreach($links as $link)
{
$new_link = str_replace("url?q=", "/" ,$link->href);
$new_link = $newstr = substr( $new_link, 0, strpos( $new_link, '&' ) );
echo "<a href='".$new_link."'>".$link->plaintext."</a><br />";
}
}
?>
it returns url like this: http//website.com/folder/stuff
without the : which makes the URL inaccessible.
I think there is nothing wrong in your code here is my approach using DOMDocument
$xml = new DOMDocument();
#$xml->loadHTMLFile("http://www.google.com/search?hl=en&output=search&q=inurl:https://github");
$links = array();
foreach($xml->getElementsByTagName('a') as $link) {
//skip if url don't contain url?q
if (false === strpos($link->getAttribute('href'), '/url?q')) continue;
$href = str_replace("url?q=", "/" ,$link->getAttribute('href'));
$href = substr( $href, 0, strpos( $href, '&' ) );
$links[] = array('url' => str_replace("//","", $href), 'text' => $link->nodeValue);
}
print_r($links);
See Demo at Viper
What if you take out the "http://" all together? Wouldn't it put you on the correct website? I don't know php, but I'm going to take a guess based on what I know about HTML and how browsers work.

Simple HTML Dom PHP RECURSION Error in return value

I am using Simple HTML Dom, trying to get strings from a website. When I print out $title[0] within the function it shows just one string, but when I safe it in the return array and print out the return value, I receive a never ending text with RECURSION.
I don't understand why it would work with the second variable $oTitle.
<?php
include 'scripts/simple_html_dom.php';
function getDetails($id) {
$url = "http://www.something.com";
$html = file_get_html ( $url );
$title = $html->find('span[itemprop=name]');
print_r($title[0] . PHP_EOL); //prints out the correct title
$oTitle = "Something"; //there is also code for this variable but it works as it should
$details = array("Title" => $title[0], "Original Title" => $oTitle);
return $details;
flush ();
}
$values = getDetails($number);
print_r($values); //code breakes here
?>
Take a look at this page: http://simplehtmldom.sourceforge.net/
As I can see, you're using this parser.
In order to get HTML content you should use something like this:
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
In order to drop content, you should use something like this:
// Dump contents (without tags) from HTML
echo file_get_html('http://www.google.com/')->plaintext;
Try this code:
<?php
include 'simple_html_dom.php';
function getDetails() {
$url = "http://www.godaddy.com";
$html = file_get_html ( $url );
$title = getTitle($url);
echo $title; //prints out the correct title
$oTitle = "Something"; //there is also code for this variable but it works as it should
$details = array("Title" => $title, "Original Title" => $oTitle);
return $details;
flush ();
}
function getTitle($Url){
$str = file_get_contents($Url);
if(strlen($str)>0){
preg_match("/\<title\>(.*)\<\/title\>/",$str,$title);
return $title[1];
}
}
$values = getDetails();
print_r($values); //code breakes here
?>

PHP: How to find an element with particular name attribute in html (from url)

I am currently using PHP's file_get_contents($url) to fetch content from a URL. After getting the contents I need to inspect the given HTML chunk, find a 'select' that has a given name attribute, extract its options, and their values text. I am not sure how to go about this, I can use PHP's simplehtmldom class to parse html, but how do I get a particular 'select' with name 'union'
<span class="d3-box">
<select name='union' class="blockInput" >
<option value="">Select a option</option> ..
Page can have multiple 'select' boxes and hence I need to specifically look by name attribute
<?php
include_once("simple_html_dom.php");
$htmlContent = file_get_contents($url);
foreach($htmlContent->find(byname['union']) as $element)
echo 'option : value';
?>
Any sort of help is appreciated. Thank you in advance.
Try this PHP code:
<?php
require_once dirname(__FILE__) . "/simple_html_dom.php";
$url = "Your link here";
$htmlContent = str_get_html(file_get_contents($url));
foreach ($htmlContent->find("select[name='union'] option") as $element) {
$option = $element->plaintext;
$value = $element->getAttribute("value");
echo $option . ":" . $value . "<br>";
}
?>
how about this:
$htmlContent = file_get_html('your url');
$htmlContent->find('select[name= "union"]');
in object oriented way:
$html = new simple_html_dom();
$htmlContent = $html->load_file('your url');
$htmlContent->find('select[name= "union"]');
From DOMDocument documentation: http://www.php.net/manual/en/class.domdocument.php
$html = file_get_contents( $url );
$dom = new DOMDocument();
$dom->loadHTML( $html );
$selects = $dom->getElementsByTagName( 'select' );
$select = $selects->item(0);
// Assuming all children are options.
$children = $select->childNodes;
$options_values = array();
for ( $i = 0; $i < $children->length; $i++ )
{
$item = $children->item( $i );
$options_values[] = $item->nodeValue;
}

using str_replace before simple_html_dom

I'm using the simple HTML dom to grab scraped data and it's been working well. However, one of the source I have doesn't have any unique fields so I'm trying to str_replace and then grab the elements that I've renamed and then use simple_html_dom.
However, it doesn't work. my code is:
require('simple_html_dom.php');
// Create DOM from URL or file
$html = file_get_html('http://www.url.com');
$html = str_replace('<strong>','',$html);
$html = str_replace('</strong>','',$html);
$html = str_replace('<span class="pound">£</span>','',$html);
$html = str_replace('<td>','<td class="myclass">',$html);
foreach($html->find('td.myclass') as $element)
$price = $element->innertext;
$price = preg_replace('/[^(\x20-\x7F)]*/','', $price);
echo $price;
try
<?php
require('simple_html_dom.php');
// Create DOM from URL or file
$html = file_get_html( 'http://www.url.com' );
foreach( $html->find( 'td' ) as $element ) {
$price = trim( str_replace( "£", "", $element->plaintext ) );
}
$price = preg_replace('/[^(\x20-\x7F)]*/','', $price);
echo $price;
?>

extracting and printing an html element by it's id using DOMDocument

i want to extract couple of tables from a web page and show them in my page
i was going to use regex to extract them but then i saw the DOMDocument class
and it seems cleaner i've looked in stackoverflow and it seems all the questions are about getting inner text or using a loop to get inner nodes of elements . i want to now how can i extract and print a html element by it's id .
$html = file_get_contents("www.site.com");
$xml = new DOMDocument();
$xml->loadHTML($html);
$xpath = new DOMXPath($xml);
$table =$xpath->query("//*[#id='myid']");
$table->saveHTML(); // this obviously doesn't work
how can i show or echo the $table as an actual html table on my page ?
Firstly, DOMDocument has a getElementById() method so your XPath is unnecessary - although I suspect that is how it works underneath.
Secondly, in order to get fragments of markup rather than a whole document, you use DOMNode::C41N(), so your code would look like this:
<?php
// Load the HTML into a DOMDocument
// Don't forget you could just pass the URL to loadHTML()
$html = file_get_contents("www.site.com");
$dom = new DOMDocument('1.0');
$dom->loadHTML($html);
// Get the target element
$element = $dom->getElementById('myid');
// Get the HTML as a string
$string = $element->C14N();
See a working example.
You can use DOMElement::C14N() to get the canonicalized HTML(XML) representation of a DOMElement, or if you like a bit more control so that you can filter certain elements and attributes you can use something like this:
function toHTML($nodeList, $tagsToStrip=array('script','object','noscript','form','style'),$attributesToSkip=array('on*')) {
$html = '';
foreach($nodeList as $subIndex => $values) {
if(!in_array(strtolower($values->nodeName), $tagsToStrip)) {
if(substr($values->nodeName,0,1) != '#') {
$html .= ' <'.$values->nodeName;
if($values->attributes) {
for($i=0;$values->attributes->item($i);$i++) {
if( !in_array( strtolower($values->attributes->item($i)->nodeName) , $attributesToSkip ) && (in_array('on*',$attributesToSkip) && substr( strtolower($values->attributes->item($i)->nodeName) ,0 , 2) != 'on') ) {
$vvv = $values->attributes->item($i)->nodeValue;
if( in_array( strtolower($values->attributes->item($i)->nodeName) , array('src','href') ) ) {
$vvv = resolve_href( $this->url , $vvv );
}
$html .= ' '.$values->attributes->item($i)->nodeName.'="'.$vvv.'"';
}
}
}
if(in_array(strtolower($values->nodeName), array('br','img'))) {
$html .= ' />';
} else {
$html .= '> ';
if(!$values->firstChild) {
$html .= htmlspecialchars( $values->textContent , ENT_COMPAT , 'UTF-8' , true );
} else {
$html .= toHTML($values->childNodes,$tagsToStrip,$attributesToSkip);
}
$html .= ' </'.$values->nodeName.'> ';
}
} elseif(substr($values->nodeName,1,1) == 't') {
$inner = htmlspecialchars( $values->textContent , ENT_COMPAT , 'UTF-8' , true );
$html .= $inner;
}
}
}
return $html;
}
echo toHTML($table);

Categories