Right Xpath for HTML elements?

Right Xpath for HTML elements? - php

I need to scrape this HTML page ...
http://www1.usl3.toscana.it/default.asp?page=ps&ospedale=3
.... using PHP and XPath to get the value 7 near the string "CODICE GIALLO"
(NOTE: you could see different value in that page if you try to browse it ... it doesn't matter ..,, it change dinamically .... )
I'm using this PHP code sample to print the value ...
<?php
ini_set('display_errors', 'On');
error_reporting(E_ALL);
$url = 'http://www1.usl3.toscana.it/default.asp?page=ps&ospedale=3';
$xpath_for_parsing = '/html/body/div/div[2]/table[2]/tbody/tr[1]/td/table/tbody/tr[3]/td[2]/table/tbody/tr[4]/td[2]/table/tbody/tr[2]/td[2]/b';
//#Set CURL parameters: pay attention to the PROXY config !!!!
$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_PROXY, '');
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$colorWaitingNumber = $xpath->query($xpath_for_parsing);
$theValue = 'N.D.';
foreach( $colorWaitingNumber as $node )
{
$theValue = $node->nodeValue;
}
print $theValue;
?>
In this way I obtain "N.D." as output not "7" as I suppose.
Reading this Why does my XPath query (scraping HTML tables) only work in Firebug, but not the application I'm developing? I've seen that the problem coud be about the <tbody> tag so I've tried to eliminate it form my original xpath and I tried my code using:
$xpath_for_parsing = '/html/body/div/div[2]/table[2]/tr[1]/td/table/tr[3]/td[2]/table/tr[4]/td[2]/table/tr[2]/td[2]/b'
but the result is still "N.D." instead of "7".
Using
$xpath_for_parsing = '/html/body/div/div[2]/table[2]/tr[1]/td/table/tr[3]/td[2]/table/tr[4]/td[2]/table'
the result is "Codice GIALLO 7"
How may I obtain only the "7" value?
Any suggestions / example?

This one should do the trick:
//td[.="Codice GIALLO"]/following-sibling::td/b

Related

Parsing values from a ASP web page using PHP and XPath

I'm trying to scrape this web page ...
http://prontosoccorso.usl4.toscana.it/attesa/home.asp
using PHP and XPath to get the number values under the red, yellow, green and white colored circles.
(NOTE: you could see different value in that page if you try to browse it ... it doesn't matter ..,, it change dinamically .... )
I'm trying to use this PHP code sample to print the value ...
<?php
ini_set('display_errors', 'On');
error_reporting(E_ALL);
$url = 'http://prontosoccorso.usl4.toscana.it/attesa/home.asp';
$xpath_for_parsing = '[#id="prontosoccorso"]/tbody/tr[2]/td[2]';
//#Set CURL parameters: pay attention to the PROXY config !!!!
$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_PROXY, '');
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$colorWaitingNumber = $xpath->query($xpath_for_parsing);
$theValue = 'N.D.';
foreach( $colorWaitingNumber as $node )
{
$theValue = $node->nodeValue;
}
print $theValue;
?>
The code works fine but the result is always 0 !!
I've notice that if you use
$xpath_for_parsing = '[#id="prontosoccorso"]';
the result is
Situazione aggiornata al giorno 30/12/2017 alle ore 14:09 Rosso Giallo Verde Azzurro Bianco Pazienti in attesa (totale 0) 0 0 0 0 0 Pazienti in visita (totale 0) 0 0 0 0 0 Pazienti trattati nelle ultime ore 0 0 0 0 0
so the result 0 for my values is coherent (and also if you try the following curl http://prontosoccorso.usl4.toscana.it/attesa/home.aspfrom command line you note that the values are all zero .... )
Analyzing with browser console I can't found the request that get tha real values ..... Any help / suggestions?
Thank you in advance .. .

One thing to notice is that even if you go to that web page, you start off with 0's in all the fields, which is why I tried with loading the page twice. This still didn't work, so I then made it store the cookies between calls and the values start to turn up.
The code is mainly what you have, there are extra curl_setopt() calls to create a cookie file (may be able to do this once and that will always work - don't quote me on that).
The XPath, will only fetch the first row of fields, but this can be easily adapted for the other rows.
<?php
ini_set('display_errors', 'On');
error_reporting(E_ALL);
$url = 'http://prontosoccorso.usl4.toscana.it/attesa/home.asp';
//#Set CURL parameters: pay attention to the PROXY config !!!!
$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_PROXY, '');
$cookies = "./cookie.txt";
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookies);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookies);
$data = curl_exec($ch);
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$xpath_for_parsing = '//table[#id="prontosoccorso"]/tbody/tr[2]/td';
$colorWaitingNumber = $xpath->query($xpath_for_parsing);
$theValue = 'N.D.';
foreach( $colorWaitingNumber as $node )
{
echo $theValue = $node->nodeValue.PHP_EOL;
}
You may be able to add some logic that checks if all values are 0 to reload the page. But this code just calls curl_exec() twice.

Get rigth Xpath for HTML elements

I need to scrape this HTML page ...
http://www1.usl3.toscana.it/default.asp?page=ps&ospedale=3
.... using PHP and XPath to get the values like 0 under the string "CODICE BIANCO"
(NOTE: you could see different values in that page if you try to browse it ... it doesn't matter ..,, they changing dinamically .... )
I'm using this PHP code sample to print the value ...
<?php
ini_set('display_errors', 'On');
error_reporting(E_ALL);
include "./tmp/vendor/autoload.php";
$url = 'http://www1.usl3.toscana.it/default.asp?page=ps&ospedale=3';
//$xpath_for_parsing = '/html/body/div/div[2]/table[2]/tbody/tr[1]/td/table/tbody/tr[3]/td[1]/table/tbody/tr[11]/td[3]/b';
$xpath_for_parsing = '//*[#id="contentint"]/table[2]/tbody/tr[1]/td/table/tbody/tr[3]/td[1]/table/tbody/tr[11]/td[3]/b';
//#Set CURL parameters: pay attention to the PROXY config !!!!
$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_PROXY, '');
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$colorWaitingNumber = $xpath->query($xpath_for_parsing);
$theValue = 'N.D.';
foreach( $colorWaitingNumber as $node )
{
$theValue = $node->nodeValue;
}
print $theValue;
?>
I've extracted the xpath using both the Chrome and Firefox web consoles ...
Suggestions / examples?

Both Chrome and Firefox most probably improve the original HTML by adding <tbody> elements inside <table> because the original HTML does not contain them. CURL does not do this and that's why your XPATH fails. Try this one instead:
$xpath_for_parsing = '//*[#id="contentint"]/table[2]/tr[1]/td/table/tr[3]/td[1]/table/tr[11]/td[3]/b';

Rather than relying on what is potentially quite a fragile hierarchy (which we all find ourselves building at times), it may be worth looking for something relatively near the data your looking for. I've just done the XPath, but it basically navigates from the text "CODICE BIANCO" and finds the data relative to that string.
$xpath_for_parsing = '//*[text()="CODICE BIANCO"]/../../following-sibling::tr[1]//descendant::b[2]';
This is still breakable when the coders change the page format, but it tries to localise the code as much as possible.

how do you skip errored lines in xml when using php?

I'm making a anime search using PHP and the myanimelist API. The issue I am having is, every once and awhile I'll search for something and it will come up with a bunch of XML errors. Which is fine but it won't display the information here is the code.
<?php
error_reporting(E_ALL);
ini_set('display_errors', 1);
$search = $_GET['q'];
$username = '';
$password = '';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://myanimelist.net/api/anime/search.xml?q=$search");
curl_setopt($ch, CURLOPT_USERPWD,$username . ":" . $password);
curl_setopt($ch, CURLOPT_HEADER, 'Magic Browser');
curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_ANY);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
//curl_setopt($ch, CURLOPT_TIMEOUT, 10 );
$data = curl_exec($ch);
$xml = simplexml_load_string($data);
$image = $xml->entry[0]->image;
$title = $xml->entry[0]->title;
$status = $xml->entry[0]->status;
$synopsis = $xml->entry[0]->synopsis;
echo "$image <br><br><b>Title</b>: $title <br> <b>Status</b>: $status <br><b>Synopsis</b>: $synopsis";
?>
EDIT FIXED
<?php
ini_set('display_startup_errors', 1);
error_reporting(E_ALL);
ini_set('display_errors', 1);
$search = $_GET['q'];
$username = '';
$password = '';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://myanimelist.net/api/anime/search.xml?q=$search");
curl_setopt($ch, CURLOPT_USERPWD,$username . ":" . $password);
curl_setopt($ch, CURLOPT_HEADER, 'Magic Browser');
curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_ANY);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10 );
$data = curl_exec($ch);
//changed the encoding I don't know if it helped
$data = str_replace('utf-8', 'iso-8859-1', $data);
//replaced — so now it works
$data = str_replace('—', ' ', $data);
$xml = simplexml_load_string($data);
$image = $xml->entry[0]->image;
$title = $xml->entry[0]->title;
$status = $xml->entry[0]->status;
$synopsis = $xml->entry[0]->synopsis;
echo "$image <br><br><b>Title</b>: $title <br> <b>Status</b>: $status <br><b>Synopsis</b>: $synopsis";
?>
The example of what I mean is located here. http://vs3.yuribot.com/mal.php?q=naruto
it took awhile but now its fixed i commented the places that help fix it. Thank you everyone for your help.

You should check if simplexml_load_string returned false if the call failed (remote server not available) and if there is some sort of error message in the XML. If there are, you should skip the info and display the error instead. I can't tell you how they are hidden in the XML, but there is most likely a tag or something similar.
Edit: Just saw in the PHP docs that simplexml_load_string requires a well formed XML string. Maybe you should check if you really got a well-formed XML. Best is you look at the docs and change the code in the way you need.

If you are concerned with the PHP "Notice" or "Warning" messages that sometimes seem to appear on your example, then you can prefix the relevant function call (in this case, "simplexml_load_string") with the "#" symbol to suppress any such messages.
For example:
$xml = #simplexml_load_string($data);
For more information, see PHP's manual for error control.

<?php echo file_get_contents how to get content in a certain tag

<?php echo file_get_contents ("http://www.google.com/"); ?>
but I only want to get the contents of the tag in the url...how to do that...?
I need to echo the content between a tag....not the whole page

Refer this PHP manual and cURL which also help you.
You may also use user define function instead of file_get_contents():
function get_content($URL){
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $URL);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
echo get_content('http://example.com');
Hope, it will resolve your issue.

I think you want to extract content from a specific html tag in the file. For this you can use regular expressions. However view the following link to parse an HTML document file:
http://php.net/manual/en/class.domdocument.php

libxml_use_internal_errors(true);
$url = "http://stackoverflow.com/questions/15947331/php-echo-file-get-contents-how-to-get-content-in-a-certain-tag";
$dom = new DomDocument();
$dom->loadHTML(file_get_contents($url));
foreach($dom->getElementsByTagName('a') as $element) {
echo $element->nodeValue.'<br/>';
}
exit;
More info: http://www.php.net/manual/en/class.domdocument.php
There you can see how to select elements by id or class, how to get elements' attribute values etc.
Note: It's better to get content via cURL instead of get_file_contents. For example:
function file_get_contents_curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
Also note that on some websites you have to specify options like CURLOPT_USERAGENT etc., otherwise the content may not be returned.
Here are the other options: http://www.php.net/manual/en/function.curl-setopt.php

get information from html table using Curl

i need to get some information about some plants and put it into mysql table.
My knowledge on Curl and DOM is quite null, but i've come to this:
set_time_limit(0);
include('simple_html_dom.php');
$ch = curl_init ("http://davesgarden.com/guides/pf/go/1501/");
curl_setopt($ch, CURLOPT_USERAGENT,"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.1) Gecko/2008070208 Firefox/3.0.1");
curl_setopt($ch, CURLOPT_HTTPHEADER, array("Accept-Language: es-es,en"));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER,1);
curl_setopt($ch, CURLOPT_TIMEOUT,0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$data = curl_exec ($ch);
curl_close ($ch);
$html= str_get_html($data);
$e = $html->find("table", 8);
echo $e->innertext;
now, i'm really lost about how to move in from this point, can you please guide me?
Thanks!

This is a mess.
But at least it's a (somewhat) consistent mess.
If this is a one time extraction and not a rolling project, personally I'd use quick and dirty regex on this instead of simple_html_dom. You'll be there all day twiddling with the tags otherwise.
For example, this regex pulls out the majority of title/data pairs:
$pattern = "/<b>(.*?)</b>\s*<br>(.*?)</?(td|p)>/si";
You'll need to do some pre and post cleaning before it will get them all though.
I don't envy you having this task...

Your best bet will be to wrape this in php ;)
Yes, this is a ugly hack for a ugly html code.
<?php
ob_start();
system("
/usr/bin/env links -dump 'http://davesgarden.com/guides/pf/go/1501/' |
/usr/bin/env perl -lne 'm/((Family|Genus|Species):\s+\w+\s+\([\w-]+\))/ && \
print $1'
");
$out = ob_get_contents();
ob_end_clean();
print $out;
?>

Use Simple Html Dom and you would be able to access any element/element's content you wish. Their api is very straightforward.

you can try somthing like this.
<?php
$ch = curl_init ("http://www.digionline.ir/Allprovince/CategoryProducts/cat=10301");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$page = curl_exec($ch);
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($page);
libxml_clear_errors();
$xpath = new DOMXpath($dom);
$data = array();
// get all table rows and rows which are not headers
$table_rows = $xpath->query('//table[#id="tbl-all-product-view"]/tr[#class!="rowH"]');
foreach($table_rows as $row => $tr) {
foreach($tr->childNodes as $td) {
$data[$row][] = preg_replace('~[\r\n]+~', '', trim($td->nodeValue));
}
$data[$row] = array_values(array_filter($data[$row]));
}
echo '<pre>';
print_r($data);
?>

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Right Xpath for HTML elements? - php

This one should do the trick: //td[.="Codice GIALLO"]/following-sibling::td/b

Related

Parsing values from a ASP web page using PHP and XPath

Get rigth Xpath for HTML elements

how do you skip errored lines in xml when using php?

<?php echo file_get_contents how to get content in a certain tag

get information from html table using Curl

Categories

Resources