Using Xpath in a PHP (v8.1) environment, I am trying to fetch all IMG tags from a dummy website:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.someurl.com');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$response = curl_exec($ch);
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($htmlString);
$xpath = new DOMXPath($doc);
$images = $xpath->evaluate("//img");
echo serialize($images); //gives me: O:11:"DOMNodeList":0:{}
echo $doc->saveHTML(); //outputs entire website in series with wiped <HTML>,<HEAD>,<BODY> tags
I don't understand, why I don't get any results for whatever tags I am trying to adress with Xpath (in this case all img tags but I've tried a bunch of variations!).
The second issue I am having is, when looking at the output of the second echo instruction (outputting the entire grabbed html), I realize that the HTML page is not complete. What I am getting is everything except the <HTML></HTML>, <HEAD></HEAD> and <BODY></BODY> tags (but the actual contents still exist!), as if everything was appended in series. Is that supposed to be this way?
Related
I need to scrape this HTML page ...
http://www1.usl3.toscana.it/default.asp?page=ps&ospedale=3
.... using PHP and XPath to get the values like 0 under the string "CODICE BIANCO"
(NOTE: you could see different values in that page if you try to browse it ... it doesn't matter ..,, they changing dinamically .... )
I'm using this PHP code sample to print the value ...
<?php
ini_set('display_errors', 'On');
error_reporting(E_ALL);
include "./tmp/vendor/autoload.php";
$url = 'http://www1.usl3.toscana.it/default.asp?page=ps&ospedale=3';
//$xpath_for_parsing = '/html/body/div/div[2]/table[2]/tbody/tr[1]/td/table/tbody/tr[3]/td[1]/table/tbody/tr[11]/td[3]/b';
$xpath_for_parsing = '//*[#id="contentint"]/table[2]/tbody/tr[1]/td/table/tbody/tr[3]/td[1]/table/tbody/tr[11]/td[3]/b';
//#Set CURL parameters: pay attention to the PROXY config !!!!
$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_PROXY, '');
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$colorWaitingNumber = $xpath->query($xpath_for_parsing);
$theValue = 'N.D.';
foreach( $colorWaitingNumber as $node )
{
$theValue = $node->nodeValue;
}
print $theValue;
?>
I've extracted the xpath using both the Chrome and Firefox web consoles ...
Suggestions / examples?
Both Chrome and Firefox most probably improve the original HTML by adding <tbody> elements inside <table> because the original HTML does not contain them. CURL does not do this and that's why your XPATH fails. Try this one instead:
$xpath_for_parsing = '//*[#id="contentint"]/table[2]/tr[1]/td/table/tr[3]/td[1]/table/tr[11]/td[3]/b';
Rather than relying on what is potentially quite a fragile hierarchy (which we all find ourselves building at times), it may be worth looking for something relatively near the data your looking for. I've just done the XPath, but it basically navigates from the text "CODICE BIANCO" and finds the data relative to that string.
$xpath_for_parsing = '//*[text()="CODICE BIANCO"]/../../following-sibling::tr[1]//descendant::b[2]';
This is still breakable when the coders change the page format, but it tries to localise the code as much as possible.
I simply want to load above page's complete contents and display using php.
I tried below method but it did not work.
$url = "http://www.officialcerts.com/exams.asp?examcode=101";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$output = curl_exec($curl);
curl_close($curl);
$DOM = new DOMDocument;
$DOM->loadHTML( $output);
How do I walk the Document that loadHTML produced?
If you like to display the page as is, Use a frame to load your page, this is the simplest way ever:
<frame src="URL">
Frame tag
I see a lot of answers on SO that pertain to the question but either there are slight differences that I couldn't overcome or maybe i just couldn't repeat the processes shown.
What I am trying to accomplish is to use CURL to get the HTML from a Google+ business page, iterate over the HTML and for each review of the business scrape the reviews HTML for display on that businesses non google+ webpage.
Every review shares this parent div structure:
<div class="ZWa nAa" guidedhelpid="userreviews"> .....
Thus i am trying to do a foreach loop based on finding and grabbing the div and innerhtml for each div with attribute: guidehelpid="userreviews"
I am succesfully getting the HTML back via CURL and can parse it when targeting a standard TAG name like "a" or if it had an ID, but iterating over the HTML using the PHP default parser when looking for a attribute name is problematic:
How can I take this successful code below and make it work like intended as shown in the second code which of course is wrong?
WORKING CODE (Finds,gets, echo's all "a" tags in $output)
$url = "https://plus.google.com/+Mcgowansac/about";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$output = curl_exec($curl);
curl_close($curl);
$DOM = new DOMDocument;
#$DOM->loadHTML($output);
foreach($DOM->getElementsByTagName('a') as $link) {
# Show the <a href>
echo $link->getAttribute('href');
echo "<br />";}
THEORETICALLY NEEDED CODE: (Find every review by custom attribute in HTML and echo them)
$url = "https://plus.google.com/+Mcgowansac/about";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$output = curl_exec($curl);
curl_close($curl);
$DOM = new DOMDocument;
#$DOM->loadHTML($output);
foreach($DOM->getElementsByTagName('div[guidehelpid=userreviews]') as $review) {
echo $review;
echo "<br />"; }
Any help i correcting this would be appreciated. I would prefer not to use "simple_html_dom" if I can accomplish this without it.
I suggest and you could use an DOMXpath in this case too. Example:
$url = "https://plus.google.com/+Mcgowansac/about";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$output = curl_exec($curl);
curl_close($curl);
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($output);
libxml_clear_errors();
$xpath = new DOMXpath($dom);
$review = $xpath->query('//div[#guidedhelpid="userreviews"]');
if($review->length > 0) { // if it exists
echo $review->item(0)->nodeValue;
// echoes
// John DeRemer reviewed 3 months ago Last fall, we had a major issue with mold which required major ... and so on
}
I am try to learn curl usage, but I do not understand how it works fully yet. How can I use curl (or other functions) to access on one (the top) data entry of a table. So far I am only able to retrieve the entire website. How can I only echo the whole table and specifically the first entry. My code is:
<?php
$ch = curl_init("http://www.w3schools.com/html/html_tables.asp");
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_exec($ch);
curl_close($ch);
?>
Using curl is a good start, but its not going to be enough, as hanky suggested, you need to also use DOMDocument and also you can include DOMXpath.
Sample Code:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.w3schools.com/html/html_tables.asp');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
libxml_use_internal_errors(true);
$html = curl_exec($ch); // the whole document (in string) goes in here
$dom = new DOMDocument();
$dom->loadHTML($html); // load it
libxml_clear_errors();
$xpath = new DOMXpath($dom);
// point it to the particular table
// table with a class named 'reference', second row (first data), get the td
$table_row = $xpath->query('//table[#class="reference"]/tr[2]/td');
foreach($table_row as $td) {
echo $td->nodeValue . ' ';
}
Should output:
Jill Smith 50
$url = 'http://www.w3schools.com/js/js_loop_for.asp';
$html = #file_get_contents($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$xml = #simplexml_import_dom($doc);
$images = $xml->xpath('//img');
var_dump($images);
die();
Output is:
array(0) { }
However, in the page source I see this:
<img border="0" width="336" height="69" src="/images/w3schoolslogo.gif" alt="W3Schools.com" style="margin-top:5px;" />
Edit: It appears $html's contents stop at the <body> tag for this page. Any idea why?
It appears $html's contents stop at the tag for this page. Any idea why?
Yes, you must provide this page with a valid user agent.
$url = 'http://www.w3schools.com/js/js_loop_for.asp';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");
curl_exec($ch);
outputs everything to the ending </html> including your requested <img border="0" width="336" height="69" src="/images/w3schoolslogo.gif" alt="W3Schools.com" style="margin-top:5px;" />
When a simple wget or curl without the user agent returns only up to the <body> tag.
$url = 'http://www.w3schools.com/js/js_loop_for.asp';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($ch);
$doc = new DOMDocument();
$doc->loadHTML($html);
$xml = simplexml_import_dom($doc);
$images = $xml->xpath('//img');
var_dump($images);
die();
EDIT: My first post stated that there was still an issue with xpath... I was just not doing my due diligence and the updated code above works great. I forgot to force curl to output to a string rather then print to the screen(as it does by default).
Why bring simplexml into the mix? You're already loading the HTML from w3fools into the DOM class, which has a perfectly good XPath query engine in it already.
[...snip...]
$doc->loadHTML($html);
$xpath = new DOMXPath($doc)
$images = $xpath->xpath('//img');
[...snip...]
The IMG tag is generated by javascript.
If you'd downloaded this page via wget, you'd realize there is no IMG tag in the HTML.
Update #1
I believe it is because of user agent string.
If I supply "Mozilla/5.0 (X11; Linux i686 on x86_64; rv:2.0) Gecko/20100101 Firefox/4.0" as user agent id, I get the page in whole.