Get rigth Xpath for HTML elements

Get rigth Xpath for HTML elements - php

I need to scrape this HTML page ...
http://www1.usl3.toscana.it/default.asp?page=ps&ospedale=3
.... using PHP and XPath to get the values like 0 under the string "CODICE BIANCO"
(NOTE: you could see different values in that page if you try to browse it ... it doesn't matter ..,, they changing dinamically .... )
I'm using this PHP code sample to print the value ...
<?php
ini_set('display_errors', 'On');
error_reporting(E_ALL);
include "./tmp/vendor/autoload.php";
$url = 'http://www1.usl3.toscana.it/default.asp?page=ps&ospedale=3';
//$xpath_for_parsing = '/html/body/div/div[2]/table[2]/tbody/tr[1]/td/table/tbody/tr[3]/td[1]/table/tbody/tr[11]/td[3]/b';
$xpath_for_parsing = '//*[#id="contentint"]/table[2]/tbody/tr[1]/td/table/tbody/tr[3]/td[1]/table/tbody/tr[11]/td[3]/b';
//#Set CURL parameters: pay attention to the PROXY config !!!!
$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_PROXY, '');
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$colorWaitingNumber = $xpath->query($xpath_for_parsing);
$theValue = 'N.D.';
foreach( $colorWaitingNumber as $node )
{
$theValue = $node->nodeValue;
}
print $theValue;
?>
I've extracted the xpath using both the Chrome and Firefox web consoles ...
Suggestions / examples?

Both Chrome and Firefox most probably improve the original HTML by adding <tbody> elements inside <table> because the original HTML does not contain them. CURL does not do this and that's why your XPATH fails. Try this one instead:
$xpath_for_parsing = '//*[#id="contentint"]/table[2]/tr[1]/td/table/tr[3]/td[1]/table/tr[11]/td[3]/b';

Rather than relying on what is potentially quite a fragile hierarchy (which we all find ourselves building at times), it may be worth looking for something relatively near the data your looking for. I've just done the XPath, but it basically navigates from the text "CODICE BIANCO" and finds the data relative to that string.
$xpath_for_parsing = '//*[text()="CODICE BIANCO"]/../../following-sibling::tr[1]//descendant::b[2]';
This is still breakable when the coders change the page format, but it tries to localise the code as much as possible.

Related

Parsing values from a ASP web page using PHP and XPath

I'm trying to scrape this web page ...
http://prontosoccorso.usl4.toscana.it/attesa/home.asp
using PHP and XPath to get the number values under the red, yellow, green and white colored circles.
(NOTE: you could see different value in that page if you try to browse it ... it doesn't matter ..,, it change dinamically .... )
I'm trying to use this PHP code sample to print the value ...
<?php
ini_set('display_errors', 'On');
error_reporting(E_ALL);
$url = 'http://prontosoccorso.usl4.toscana.it/attesa/home.asp';
$xpath_for_parsing = '[#id="prontosoccorso"]/tbody/tr[2]/td[2]';
//#Set CURL parameters: pay attention to the PROXY config !!!!
$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_PROXY, '');
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$colorWaitingNumber = $xpath->query($xpath_for_parsing);
$theValue = 'N.D.';
foreach( $colorWaitingNumber as $node )
{
$theValue = $node->nodeValue;
}
print $theValue;
?>
The code works fine but the result is always 0 !!
I've notice that if you use
$xpath_for_parsing = '[#id="prontosoccorso"]';
the result is
Situazione aggiornata al giorno 30/12/2017 alle ore 14:09 Rosso Giallo Verde Azzurro Bianco Pazienti in attesa (totale 0) 0 0 0 0 0 Pazienti in visita (totale 0) 0 0 0 0 0 Pazienti trattati nelle ultime ore 0 0 0 0 0
so the result 0 for my values is coherent (and also if you try the following curl http://prontosoccorso.usl4.toscana.it/attesa/home.aspfrom command line you note that the values are all zero .... )
Analyzing with browser console I can't found the request that get tha real values ..... Any help / suggestions?
Thank you in advance .. .

One thing to notice is that even if you go to that web page, you start off with 0's in all the fields, which is why I tried with loading the page twice. This still didn't work, so I then made it store the cookies between calls and the values start to turn up.
The code is mainly what you have, there are extra curl_setopt() calls to create a cookie file (may be able to do this once and that will always work - don't quote me on that).
The XPath, will only fetch the first row of fields, but this can be easily adapted for the other rows.
<?php
ini_set('display_errors', 'On');
error_reporting(E_ALL);
$url = 'http://prontosoccorso.usl4.toscana.it/attesa/home.asp';
//#Set CURL parameters: pay attention to the PROXY config !!!!
$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_PROXY, '');
$cookies = "./cookie.txt";
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookies);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookies);
$data = curl_exec($ch);
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$xpath_for_parsing = '//table[#id="prontosoccorso"]/tbody/tr[2]/td';
$colorWaitingNumber = $xpath->query($xpath_for_parsing);
$theValue = 'N.D.';
foreach( $colorWaitingNumber as $node )
{
echo $theValue = $node->nodeValue.PHP_EOL;
}
You may be able to add some logic that checks if all values are 0 to reload the page. But this code just calls curl_exec() twice.

Right Xpath for HTML elements?

I need to scrape this HTML page ...
http://www1.usl3.toscana.it/default.asp?page=ps&ospedale=3
.... using PHP and XPath to get the value 7 near the string "CODICE GIALLO"
(NOTE: you could see different value in that page if you try to browse it ... it doesn't matter ..,, it change dinamically .... )
I'm using this PHP code sample to print the value ...
<?php
ini_set('display_errors', 'On');
error_reporting(E_ALL);
$url = 'http://www1.usl3.toscana.it/default.asp?page=ps&ospedale=3';
$xpath_for_parsing = '/html/body/div/div[2]/table[2]/tbody/tr[1]/td/table/tbody/tr[3]/td[2]/table/tbody/tr[4]/td[2]/table/tbody/tr[2]/td[2]/b';
//#Set CURL parameters: pay attention to the PROXY config !!!!
$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_PROXY, '');
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$colorWaitingNumber = $xpath->query($xpath_for_parsing);
$theValue = 'N.D.';
foreach( $colorWaitingNumber as $node )
{
$theValue = $node->nodeValue;
}
print $theValue;
?>
In this way I obtain "N.D." as output not "7" as I suppose.
Reading this Why does my XPath query (scraping HTML tables) only work in Firebug, but not the application I'm developing? I've seen that the problem coud be about the <tbody> tag so I've tried to eliminate it form my original xpath and I tried my code using:
$xpath_for_parsing = '/html/body/div/div[2]/table[2]/tr[1]/td/table/tr[3]/td[2]/table/tr[4]/td[2]/table/tr[2]/td[2]/b'
but the result is still "N.D." instead of "7".
Using
$xpath_for_parsing = '/html/body/div/div[2]/table[2]/tr[1]/td/table/tr[3]/td[2]/table/tr[4]/td[2]/table'
the result is "Codice GIALLO 7"
How may I obtain only the "7" value?
Any suggestions / example?

This one should do the trick:
//td[.="Codice GIALLO"]/following-sibling::td/b

Retrieving a single line from a html table from another website?

I am try to learn curl usage, but I do not understand how it works fully yet. How can I use curl (or other functions) to access on one (the top) data entry of a table. So far I am only able to retrieve the entire website. How can I only echo the whole table and specifically the first entry. My code is:
<?php
$ch = curl_init("http://www.w3schools.com/html/html_tables.asp");
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_exec($ch);
curl_close($ch);
?>

Using curl is a good start, but its not going to be enough, as hanky suggested, you need to also use DOMDocument and also you can include DOMXpath.
Sample Code:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.w3schools.com/html/html_tables.asp');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
libxml_use_internal_errors(true);
$html = curl_exec($ch); // the whole document (in string) goes in here
$dom = new DOMDocument();
$dom->loadHTML($html); // load it
libxml_clear_errors();
$xpath = new DOMXpath($dom);
// point it to the particular table
// table with a class named 'reference', second row (first data), get the td
$table_row = $xpath->query('//table[#class="reference"]/tr[2]/td');
foreach($table_row as $td) {
echo $td->nodeValue . ' ';
}
Should output:
Jill Smith 50

How do you detect if a remote website uses flash?

I am trying to write a tool that detects if a remote website uses flash using php. So far I have written a script that detects if embed or objects exist which give an indicator that there is a possibility of it being installed but some sites encrypt their code so renders this function useless.
include_once('simple_html_dom.php');
$flashTotalCount = 0;
function file_get_contents_curl($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$html = file_get_contents_curl($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
foreach($html->find('embed') as $pageEmbed){
$flashTotalCount++;
}
foreach($html->find('object') as $pageObject){
$flashTotalCount++;
}
if($flashTotalCount == 0){
echo "NO FLASH";
}
else{
echo "FLASH";
}
Would anyone one know of a way to check to see if a website uses flash or if possible get header information that flash is being used etc.
Any advise would be helpful.

As far as I understand, flash can be loaded by javascript. So you should execute the web page. For this purposes you'll have to use tool like this:
http://seleniumhq.org/docs/02_selenium_ide.html#the-waitfor-commands-in-ajax-applications
I don't think that it is usable from php.

Get processed content of URL

I am trying to retrieve the content of web pages and check if the page contain certain error keywords I am monitoring. (instead of manually loading each URL everytime to check on the sites, I hope to do this programmatically and flag out errors when they occur)
I have tried XMLHttpRequest. I am able to get the HTML content, like what I see when I "view source" on the page. But the pages I monitor runs on Sharepoint and the webparts are dynamically generated. I believe if error occurs when loading these parts I would not be able to flag them out as the HTML I pull will not contain the errors but just usual paths to the webparts.
cURL seems to do the same. I just read about DOMDocument and I was wondering if DOMDocument process the codes or does it just break the HTML into a hierarchical structure.
I only wish to have the content of the URL. (like what you get when you save website as txt in IE, not the HTML). Or if I can further process the HTML then it would be good too. How can I do that? Any help will be really appreciated. :)

Why do you want to strip the HTML? It's better to use it!
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
$data = curl_exec($ch);
curl_close($ch);
// libxml_use_internal_errors(true);
$oDom = new DomDocument();
$oDom->loadHTML($data);
// Go through DOM and look for error (it's similar if it'd be
// <p class="error">error message</p> or whatever)
$errors = $oDom->getElementsByTagName( "error" ); // or however you get errors
foreach( $errors as $error ) {
if(strstr($error->nodeValue, 'SOME ERROR')) {
echo 'SOME ERROR occurred';
}
}
If you don't want to do that, you can just do:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
$data = curl_exec($ch);
curl_close($ch);
if(strstr($data, 'SOME_ERROR')) {
echo 'SOME ERROR occurred';
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Get rigth Xpath for HTML elements - php

Related

Parsing values from a ASP web page using PHP and XPath

Right Xpath for HTML elements?

Retrieving a single line from a html table from another website?

How do you detect if a remote website uses flash?

Get processed content of URL

Categories

Resources