Get remote page complete contents and display at site - php

I simply want to load above page's complete contents and display using php.
I tried below method but it did not work.
$url = "http://www.officialcerts.com/exams.asp?examcode=101";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$output = curl_exec($curl);
curl_close($curl);
$DOM = new DOMDocument;
$DOM->loadHTML( $output);
How do I walk the Document that loadHTML produced?

If you like to display the page as is, Use a frame to load your page, this is the simplest way ever:
<frame src="URL">
Frame tag

Related

XPath: No elements found in grabbed website and html seems incomplete

Using Xpath in a PHP (v8.1) environment, I am trying to fetch all IMG tags from a dummy website:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.someurl.com');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$response = curl_exec($ch);
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($htmlString);
$xpath = new DOMXPath($doc);
$images = $xpath->evaluate("//img");
echo serialize($images); //gives me: O:11:"DOMNodeList":0:{}
echo $doc->saveHTML(); //outputs entire website in series with wiped <HTML>,<HEAD>,<BODY> tags
I don't understand, why I don't get any results for whatever tags I am trying to adress with Xpath (in this case all img tags but I've tried a bunch of variations!).
The second issue I am having is, when looking at the output of the second echo instruction (outputting the entire grabbed html), I realize that the HTML page is not complete. What I am getting is everything except the <HTML></HTML>, <HEAD></HEAD> and <BODY></BODY> tags (but the actual contents still exist!), as if everything was appended in series. Is that supposed to be this way?

Why is curl links redirecting through localhost?

Right now, I have the current php code:
<?php
include('simple_html_dom.php');
# set up the request parameters
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, 'https://www.google.com/search?q=sport+news');
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_MAXREDIRS, 0);
$result = curl_exec($curl);
curl_close($curl);
echo $result;
?>
when this code is run, it returns a google page with the search results corresponding to the search sport news. Although, when you try to click on any one of theese links, it redirects you to 'localhost:/--url--'. How do I prevent curl from redirecting to localhost and rather redirect to the actual site?
I am currently using wampserver for testing.
This happens because Google's result page is using relative URLs in the links.
<a href="/url?q=https://www.bbc.co.uk/sport/43634915&sa=U&ved=2ahUKEwjX (...)
Notice that the href starts with: / not with a domain such as href="https://foobar.com/url?q=.
Therefore, the links will use the hostname of the page serving the results.
Te reason you get localhost in the results when clicking them, is that you are serving this code from localhost.
One solution could be to use the DOMDocument PHP extension to parse links, and add a hostname, so that the result links are absolute, rather than relative.
For example:
// Ignore HTML errors
libxml_use_internal_errors(true);
// Instantiate parser
$dom = new DOMDocument;
// Load HTML into DOM document parser
$dom->loadXML($result);
// Select anchor tags
$books = $dom->getElementsByTagName('a');
// Iterate through all links
foreach ($links as $link) {
// Get relative link value
$relativePath = $link->getAttribute('href');
// Check if this is a relative link
if (substr($relativePath, 0, 1) === '/') {
// Prepend Google domain
$link->setAttribute('href', "https://google.com/" . $relativePath);
}
}
echo $dom->saveHTML();

Get processed content of URL

I am trying to retrieve the content of web pages and check if the page contain certain error keywords I am monitoring. (instead of manually loading each URL everytime to check on the sites, I hope to do this programmatically and flag out errors when they occur)
I have tried XMLHttpRequest. I am able to get the HTML content, like what I see when I "view source" on the page. But the pages I monitor runs on Sharepoint and the webparts are dynamically generated. I believe if error occurs when loading these parts I would not be able to flag them out as the HTML I pull will not contain the errors but just usual paths to the webparts.
cURL seems to do the same. I just read about DOMDocument and I was wondering if DOMDocument process the codes or does it just break the HTML into a hierarchical structure.
I only wish to have the content of the URL. (like what you get when you save website as txt in IE, not the HTML). Or if I can further process the HTML then it would be good too. How can I do that? Any help will be really appreciated. :)
Why do you want to strip the HTML? It's better to use it!
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
$data = curl_exec($ch);
curl_close($ch);
// libxml_use_internal_errors(true);
$oDom = new DomDocument();
$oDom->loadHTML($data);
// Go through DOM and look for error (it's similar if it'd be
// <p class="error">error message</p> or whatever)
$errors = $oDom->getElementsByTagName( "error" ); // or however you get errors
foreach( $errors as $error ) {
if(strstr($error->nodeValue, 'SOME ERROR')) {
echo 'SOME ERROR occurred';
}
}
If you don't want to do that, you can just do:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
$data = curl_exec($ch);
curl_close($ch);
if(strstr($data, 'SOME_ERROR')) {
echo 'SOME ERROR occurred';
}

Load website in div using curl

I'm using curl to retrieve an external website and display it in a div...
function Get_Domain_Contents($url){
$ch = curl_init();
$timeout = 5;
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
but how do I get it to return the css and images of that webpage too? Right now it just returns everything except the images and css. Thanks in advance.
$url = 'http://example.com';
$html = Get_Domain_Contents($url);
$html = "<base href='{$url}' />" . $html;
echo $html;
Rather than download all these extra files, you could parse the content that you downloaded and modify any relative URLs to make them absolute URLs. You'd be using them in place and the CSS and images would be included in the rendered code. You could use something like the PHP simple HTML DOM parser to make parsing part of the task easier.

Selecting a specific div from a extern webpage using CURL

Hi can anyone help me how to select a specific div from the content of a webpage.
Let's say i want to get the div with id="wrapper_content" from webpage http://www.test.com/page3.php.
My current code looks something like this: (not working)
//REG EXP.
$s_searchFor = '#^/.dont know what to put here..#ui';
//CURL
$ch = curl_init();
$timeout = 5; // set to zero for no timeout
curl_setopt ($ch, CURLOPT_URL, 'http://www.test.com/page3.php');
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
if(!preg_match($s_searchFor, $ch))
{
$file_contents = curl_exec($ch);
}
curl_close($ch);
// display file
echo $file_contents;
So i'd like to know how i can use reg expressions to find a specific div and how to unset the rest of the webpage so that $file_content only contains the div.
HTML isn't regular, so you shouldn't use regex. Instead I would recommend a HTML Parser such as Simple HTML DOM or DOM
If you were going to use Simple HTML DOM you would do something like the following:
$html = str_get_html($file_contents);
$elem = $html->find('div[id=wrapper_content]', 0);
Even if you used regex your code still wouldn't work correctly. You need to get the contents of the page before you can use regex.
//wrong
if(!preg_match($s_searchFor, $ch)){
$file_contents = curl_exec($ch);
}
//right
$file_contents = curl_exec($ch); //get the page contents
preg_match($s_searchFor, $file_contents, $matches); //match the element
$file_contents = $matches[0]; //set the file_contents var to the matched elements
include('simple_html_dom.php');
$html = str_get_html($file_contents);
$elem = $html->find('div[id=wrapper_content]', 0);
Download simple_html_dom.php
check our hpricot, it lets you elegantly select sections
first you would use curl to get the document, then use hpricot to get the part you need

Categories