Get the html from another webpage from another machine - php

I am trying to get the html of this page: http://213.177.10.50:6060/itn/default.asp and from this page go to 'Drumuri' where the cars is placed.
For short I am trying to get that tabel from 'Drumuri ' page.
I have tried this code:
<?php
$DOM = new DOMDocument;
$DOM->loadHTML('http://213.177.10.50:6060/itn/default.asp');
$items = $DOM->getElementsByTagName('a');
print_R($items);
?>
And I also tried with cURL, but no results. This my county's website and I think it is very secured so that's why I can not get it's html. Can you please try and give me the right answer. If I can or not and how or why?

You cannot load directly http://213.177.10.50:6060/itn/default.asp because it contains iFrame. The iFrame source is http://213.177.10.50:6060/itn/dreapta.asp
Here is how to go thru the links and find the DRUMURI link:
<?php
$baseUrl = 'http://213.177.10.50:6060/itn/';
$DOM = new DOMDocument;
$DOM->loadHTMLFile($baseUrl.'dreapta.asp');
foreach($DOM->getElementsByTagName('a') as $link) {
if ($link->nodeValue == 'DRUMURI') {
echo "Label -> ".$link->nodeValue."\n";
echo "Link -> ".$baseUrl.$link->getAttribute('href')."\n";
}
}
->
Label -> DRUMURI
Link -> http://213.177.10.50:6060/itn/drumuri.asp

Related

CSS is not working for my proxy request

I'm trying to create a proxy request to a different domain from my own and doing some changes on the code before outputting the HTML to be displayed. And all works well except that my CSS file doesn't seem to take effect.
<?php
if (isset($_GET['$url']))
{
$html = file_get_contents($_GET['url']);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$a = array();
foreach ($dom->getElementsByTagName('link') as $href)
{
$a[] = $href->getAttribute('href');
}
echo str_replace($a[0],$url."/".$a[0], $html);
}
?>
The result is an HTML document but without CSS styling. But if I check the source code in my browser it shows that the link to the CSS file is okay and clicking on it takes me to that CSS file, but its not taking effect in styling the output

DOMDocument->saveHTML isn't working

An api returns me couple of html code (only part of the body, not full html) and i want to change all images src's with others.
I get and set attributes then if i echo it in foreach loop i see old and new value but when i try to save it with saveHTML then dump the full html block which is returned from api, i don't see replaced paths.
$page = json_decode($page);
$page = (array) $page->rows;
$page = ($page[0]->_->content);
$dom = new \DOMDocument();
$dom->loadHTML($page);
$tag = $dom->getElementsByTagName('img');
foreach($tag as $t)
{
echo $t->getAttribute('src').'<br'>; //showing old src
$t->setAttribute('src', 'bla');
echo $t->getAttribute('src').'<br'>; //showing new src
}
$dom->saveHTML();
var_dump($page); //nothing is changed
My_ friend this is not how it works.
You should have your edited HTML in the result of saveHTML() so:
$editedHtml = $dom->saveHTML()
var_dump($editedHtml);
Now you should see your changed HTML.
Explanation is that $page is completely different object that has nothing to do with $dom object.
Cheers!

Scraping Thumbnail from NYtimes

My scraping code works for just about every site i've come accross while testing... except for nytimes.com articles. I use ajax with the following PHP code (i've left out some details to focus on my specific problem):
$link = "http://www.nytimes.com/2014/02/07/us/huge-leak-of-coal-ash-slows-at-north-carolina-power-plant.html?hp";
$article = new DOMDocument;
$article->loadHTMLFile($link);
//generate image array
$images = $article->getElementsByTagName("img");
foreach ($images as $image) {
$source = $image->getAttribute("src");
echo '<img src="' . $source . '" alt="alt"><br><br>';
}
My problem is that the main images on nytimes pages don't even seem to get picked up by the getElementsByTagName. Pinterest finds a way to scrape the main images from this site for example: http://www.nytimes.com/2014/02/07/us/huge-leak-of-coal-ash-slows-at-north-carolina-power-plant.html?hp whereas I cannot. Any suggestions?
OK. So this is what I tried so far as I found your question interesting.
When I do this on browser console using jQuery, I do get results on images. My query was
var a= new Array();
$('img[src]').each(function(){ a.push($(this).attr('src'));});
console.log(a);
Also see screenshot of results
Note that console.log(arrayname) work in Chrome browser.
So ideally your code must work. Please consider adding a is_null check like I've done.
Below is the code where I try loading the URL using a different approach(perhaps better too) and get the root cause of why you get only single image of NYT logo.
The resultant HTML screenshot is attached .
<?php
$html = file_get_contents("http://www.nytimes.com/2014/02/07/us/huge-leak-of-coal-ash-slows-at-north-carolina-power-plant.html?hp");
echo $html;
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->recover=true;
#$doc->loadHTML("<html><body>".$html."</body></html>");
$xpath = new DOMXpath($doc);
$images = $xpath->query("//*/img");
if (!is_null($images)) {
echo sizeof($images);
foreach ($images as $image) {
$source = $image->getAttribute('src');
echo '<img src="' . $source . '" alt="alt"><br><br>';
}
}
?>
You can't get the content via feed unless you are authenticated.
You can try-
To use context parameter in file_get_contents method
You can try consuming the RSS/ATOM feeds of the article.
you download the page as HTML and then load it in file_get_contents methods. PS: This works.

Missing html content when using dom->saveHTML in PHP

I am getting data from a website using DOM. I've tested my code in my local server and it works perfectly however, when I uploaded it on a server and ran the code, the script I created returned html tags without any content. My code looks something like this:
$divs = $dom->getElementsByTagName('div');
foreach($divs as $div){
if($div->getAttribute('class') == "content1"){
$dom = new DOMDocument();
$dom->appendChild($dom->importNode($div, true));
$content1 = $dom->saveHTML();
echo "content:".$content1;
}
}
In my localhost, it returns something like so:
<div class="content1">This is my content</div>
However, in the server, I strangely get the empty html tags like so:
<div class="content1"></div>
What are possible causes of this problem? Is there any way I can fix it? Please advise.
PHP version under 5.3.6 :
create a variable that will contains a clone of the current node with all sub nodes,
append it as a child
echo the returned value.
foreach($divs as $div) {
if($div->getAttribute('class') == "content1"){
$dom = new DOMDocument();
$cloned = $div->cloneNode(TRUE);
$dom->appendChild($dom->importNode($cloned,TRUE));
$content1 = $dom->saveHTML();
echo "content:".$content1;
}
}
EDIT: I've made a mistake it was not
$cloned = $element->cloneNode(TRUE);
but
$cloned = $div->cloneNode(TRUE);
sorry ^^ (hope it will work)

PHP DOMDocument error handling

I'm having trouble trying to write an if statement for the DOM that will check if $html is blank. However, whenever the HTML page does end up blank, it just removes everything that would be below DOM (including what I had to check if it was blank).
$html = file_get_contents("http://example.com/");
$dom = new DOMDocument;
#$dom->loadHTML($html);
$links = $dom->getElementById('dividhere')->getElementsByTagName('img');
foreach ($links as $link)
{
echo $link->getAttribute('src');
}
All this does is grab an image URL in the specified div, which works perfectly until the page is a blank HTML page.
I've tried using SimpleHTMLDOM, which didn't work either (it didn't even fetch the image on working pages). Did I happen to miss something with this one or am I just missing something in both?
include_once('simple_html_dom.php')
$html = file_get_html("http://example.com/");
foreach($html->find('div[id="dividhere"]') as $div)
{
if(empty($div->src))
{
continue;
}
echo $div->src;
}
Get rid on the $html variable and just load the file into $dom by doing #$dom->loadHTMLFile("http://example.com/");, then have an if statement below that to check if $dom is empty.

Categories