Parsing values from a ASP web page using PHP and XPath

Parsing values from a ASP web page using PHP and XPath - php

I'm trying to scrape this web page ...
http://prontosoccorso.usl4.toscana.it/attesa/home.asp
using PHP and XPath to get the number values under the red, yellow, green and white colored circles.
(NOTE: you could see different value in that page if you try to browse it ... it doesn't matter ..,, it change dinamically .... )
I'm trying to use this PHP code sample to print the value ...
<?php
ini_set('display_errors', 'On');
error_reporting(E_ALL);
$url = 'http://prontosoccorso.usl4.toscana.it/attesa/home.asp';
$xpath_for_parsing = '[#id="prontosoccorso"]/tbody/tr[2]/td[2]';
//#Set CURL parameters: pay attention to the PROXY config !!!!
$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_PROXY, '');
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$colorWaitingNumber = $xpath->query($xpath_for_parsing);
$theValue = 'N.D.';
foreach( $colorWaitingNumber as $node )
{
$theValue = $node->nodeValue;
}
print $theValue;
?>
The code works fine but the result is always 0 !!
I've notice that if you use
$xpath_for_parsing = '[#id="prontosoccorso"]';
the result is
Situazione aggiornata al giorno 30/12/2017 alle ore 14:09 Rosso Giallo Verde Azzurro Bianco Pazienti in attesa (totale 0) 0 0 0 0 0 Pazienti in visita (totale 0) 0 0 0 0 0 Pazienti trattati nelle ultime ore 0 0 0 0 0
so the result 0 for my values is coherent (and also if you try the following curl http://prontosoccorso.usl4.toscana.it/attesa/home.aspfrom command line you note that the values are all zero .... )
Analyzing with browser console I can't found the request that get tha real values ..... Any help / suggestions?
Thank you in advance .. .

One thing to notice is that even if you go to that web page, you start off with 0's in all the fields, which is why I tried with loading the page twice. This still didn't work, so I then made it store the cookies between calls and the values start to turn up.
The code is mainly what you have, there are extra curl_setopt() calls to create a cookie file (may be able to do this once and that will always work - don't quote me on that).
The XPath, will only fetch the first row of fields, but this can be easily adapted for the other rows.
<?php
ini_set('display_errors', 'On');
error_reporting(E_ALL);
$url = 'http://prontosoccorso.usl4.toscana.it/attesa/home.asp';
//#Set CURL parameters: pay attention to the PROXY config !!!!
$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_PROXY, '');
$cookies = "./cookie.txt";
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookies);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookies);
$data = curl_exec($ch);
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$xpath_for_parsing = '//table[#id="prontosoccorso"]/tbody/tr[2]/td';
$colorWaitingNumber = $xpath->query($xpath_for_parsing);
$theValue = 'N.D.';
foreach( $colorWaitingNumber as $node )
{
echo $theValue = $node->nodeValue.PHP_EOL;
}
You may be able to add some logic that checks if all values are 0 to reload the page. But this code just calls curl_exec() twice.

Related

Right Xpath for HTML elements?

I need to scrape this HTML page ...
http://www1.usl3.toscana.it/default.asp?page=ps&ospedale=3
.... using PHP and XPath to get the value 7 near the string "CODICE GIALLO"
(NOTE: you could see different value in that page if you try to browse it ... it doesn't matter ..,, it change dinamically .... )
I'm using this PHP code sample to print the value ...
<?php
ini_set('display_errors', 'On');
error_reporting(E_ALL);
$url = 'http://www1.usl3.toscana.it/default.asp?page=ps&ospedale=3';
$xpath_for_parsing = '/html/body/div/div[2]/table[2]/tbody/tr[1]/td/table/tbody/tr[3]/td[2]/table/tbody/tr[4]/td[2]/table/tbody/tr[2]/td[2]/b';
//#Set CURL parameters: pay attention to the PROXY config !!!!
$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_PROXY, '');
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$colorWaitingNumber = $xpath->query($xpath_for_parsing);
$theValue = 'N.D.';
foreach( $colorWaitingNumber as $node )
{
$theValue = $node->nodeValue;
}
print $theValue;
?>
In this way I obtain "N.D." as output not "7" as I suppose.
Reading this Why does my XPath query (scraping HTML tables) only work in Firebug, but not the application I'm developing? I've seen that the problem coud be about the <tbody> tag so I've tried to eliminate it form my original xpath and I tried my code using:
$xpath_for_parsing = '/html/body/div/div[2]/table[2]/tr[1]/td/table/tr[3]/td[2]/table/tr[4]/td[2]/table/tr[2]/td[2]/b'
but the result is still "N.D." instead of "7".
Using
$xpath_for_parsing = '/html/body/div/div[2]/table[2]/tr[1]/td/table/tr[3]/td[2]/table/tr[4]/td[2]/table'
the result is "Codice GIALLO 7"
How may I obtain only the "7" value?
Any suggestions / example?

This one should do the trick:
//td[.="Codice GIALLO"]/following-sibling::td/b

Get rigth Xpath for HTML elements

I need to scrape this HTML page ...
http://www1.usl3.toscana.it/default.asp?page=ps&ospedale=3
.... using PHP and XPath to get the values like 0 under the string "CODICE BIANCO"
(NOTE: you could see different values in that page if you try to browse it ... it doesn't matter ..,, they changing dinamically .... )
I'm using this PHP code sample to print the value ...
<?php
ini_set('display_errors', 'On');
error_reporting(E_ALL);
include "./tmp/vendor/autoload.php";
$url = 'http://www1.usl3.toscana.it/default.asp?page=ps&ospedale=3';
//$xpath_for_parsing = '/html/body/div/div[2]/table[2]/tbody/tr[1]/td/table/tbody/tr[3]/td[1]/table/tbody/tr[11]/td[3]/b';
$xpath_for_parsing = '//*[#id="contentint"]/table[2]/tbody/tr[1]/td/table/tbody/tr[3]/td[1]/table/tbody/tr[11]/td[3]/b';
//#Set CURL parameters: pay attention to the PROXY config !!!!
$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_PROXY, '');
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$colorWaitingNumber = $xpath->query($xpath_for_parsing);
$theValue = 'N.D.';
foreach( $colorWaitingNumber as $node )
{
$theValue = $node->nodeValue;
}
print $theValue;
?>
I've extracted the xpath using both the Chrome and Firefox web consoles ...
Suggestions / examples?

Both Chrome and Firefox most probably improve the original HTML by adding <tbody> elements inside <table> because the original HTML does not contain them. CURL does not do this and that's why your XPATH fails. Try this one instead:
$xpath_for_parsing = '//*[#id="contentint"]/table[2]/tr[1]/td/table/tr[3]/td[1]/table/tr[11]/td[3]/b';

Rather than relying on what is potentially quite a fragile hierarchy (which we all find ourselves building at times), it may be worth looking for something relatively near the data your looking for. I've just done the XPath, but it basically navigates from the text "CODICE BIANCO" and finds the data relative to that string.
$xpath_for_parsing = '//*[text()="CODICE BIANCO"]/../../following-sibling::tr[1]//descendant::b[2]';
This is still breakable when the coders change the page format, but it tries to localise the code as much as possible.

curl_exec returns empty string

I'm still a bit new to using curl to pull data and I've recently started using Fiddler to help find what options need to be set.
I'm trying to see if I can pull an image from a site. I first hit a search page - I set the search parameters, then start hitting links in the results. When I attempt to go a link in one of the results for an image, I get an empty string returned from curl_exec().
The weird thing is - at one point, it worked - I got the data back and successfully saved the image locally. But then it stopped, and I have no idea what I was doing to have it working. Naturally, everything works OK in the browser. :(
I'm using Simple HTML DOM to parse through results and cUrl for the actual page requests. curl_error() does not show an error, curl_getinfo() thinks everything is OK too. It's probably something trivial, but I'm not sure how to troubleshoot it beyond where I am.
<?php
include 'includes/simple_html_dom.php';
$url = "http://nwweb.co.bell.tx.us/NewWorld.Aegis.WebPortal/Corrections/InmateInquiry.aspx";
// Get Cookie - ASP.NET_SessionId
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
$r = curl_exec($ch);
preg_match_all('/^Set-Cookie:\s*([^;]*)/mi', $r, $matches);
$cookies = array();
foreach($matches[1] as $item)
{
parse_str($item, $cookie);
$cookies = array_merge($cookies, $cookie);
}
$sessionCookie = "ASP_NET_SessionId=".$cookies['ASP_NET_SessionId'];
// now load up page into Simple HTML DOM and get all inputs - ignore buttons and populate our dates
$startDate = "02%2F01%2F2000";
$endDate = "02%2F07%2F2016";
$getInputs = str_get_html($r);
$inputs = $getInputs->find('input');
$inputs_array = array();
$buttons_array = array();
for ($i=0; $i<count($inputs); $i++)
{
if ($inputs[$i]->type != "submit")
{
$inputs_array[$inputs[$i]->id] = $inputs[$i]->value;
if (stripos($inputs[$i]->id, "FromDate") > 0)
$inputs_array[$inputs[$i]->id] = $startDate;
if (stripos($inputs[$i]->id, "ToDate") > 0)
$inputs_array[$inputs[$i]->id] = $endDate;
}
}
// build up our curl data - includes hidden inputs, our to & from dates, plus the Search button
$curl_data = http_build_query($inputs_array)."&ctl00%24DefaultContent%24uxSearch=Search";
// POST the data, include session cookie
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $curl_data);
curl_setopt($ch, CURLOPT_COOKIE, $sessionCookie);
$response = curl_exec($ch);
// this shows that we can get data
// find the links from the HTML
$htmlDom = str_get_html($response); // load up Simple HTML DOM
// get the table of results
$divTable = $htmlDom->find('div#ctl00_DefaultContent_uxResultsWrapper',0)->find('table',0);
$rows = $divTable->find('tr');
for ($i=1; $i<count($rows);$i++)
{
if ($i>3) break; // limit the length of script for debugging
$link = $rows[$i]->find('td',1)->find('a',0)->href;
// build up query to get inmate details from the link above
$url = "http://nwweb.co.bell.tx.us/NewWorld.Aegis.WebPortal/Corrections/".$link;
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_COOKIE, $sessionCookie);
$page = curl_exec($ch);
$pageData = str_get_html($page);
// Now find the Photo, there's a thumb in div.BookingPhotos
// It is linked to a full size image, the link is of the form http://nwweb.co.bell.tx.us/NewWorld.Aegis.WebPortal/GetImage.aspx?ImageKey=17C030IS, but in the href, it has ../GetImage.aspx?ImageKey=xxxx
$photoLink = $pageData->find('div.BookingPhotos',0)->find('a',0)->href;
// get rid of .. and put the base URL on the front
$imgLink = str_replace("..", "http://nwweb.co.bell.tx.us/NewWorld.Aegis.WebPortal", $photoLink);
// now attempt to pull the image
$ch = curl_init($imgLink);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_COOKIE, $sessionCookie);
// here is the PROBLEM - NO DATA RETURNED
$imgData = curl_exec($ch); // I get a header back, but NO data
}
?>

how do you skip errored lines in xml when using php?

I'm making a anime search using PHP and the myanimelist API. The issue I am having is, every once and awhile I'll search for something and it will come up with a bunch of XML errors. Which is fine but it won't display the information here is the code.
<?php
error_reporting(E_ALL);
ini_set('display_errors', 1);
$search = $_GET['q'];
$username = '';
$password = '';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://myanimelist.net/api/anime/search.xml?q=$search");
curl_setopt($ch, CURLOPT_USERPWD,$username . ":" . $password);
curl_setopt($ch, CURLOPT_HEADER, 'Magic Browser');
curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_ANY);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
//curl_setopt($ch, CURLOPT_TIMEOUT, 10 );
$data = curl_exec($ch);
$xml = simplexml_load_string($data);
$image = $xml->entry[0]->image;
$title = $xml->entry[0]->title;
$status = $xml->entry[0]->status;
$synopsis = $xml->entry[0]->synopsis;
echo "$image <br><br><b>Title</b>: $title <br> <b>Status</b>: $status <br><b>Synopsis</b>: $synopsis";
?>
EDIT FIXED
<?php
ini_set('display_startup_errors', 1);
error_reporting(E_ALL);
ini_set('display_errors', 1);
$search = $_GET['q'];
$username = '';
$password = '';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://myanimelist.net/api/anime/search.xml?q=$search");
curl_setopt($ch, CURLOPT_USERPWD,$username . ":" . $password);
curl_setopt($ch, CURLOPT_HEADER, 'Magic Browser');
curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_ANY);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10 );
$data = curl_exec($ch);
//changed the encoding I don't know if it helped
$data = str_replace('utf-8', 'iso-8859-1', $data);
//replaced — so now it works
$data = str_replace('—', ' ', $data);
$xml = simplexml_load_string($data);
$image = $xml->entry[0]->image;
$title = $xml->entry[0]->title;
$status = $xml->entry[0]->status;
$synopsis = $xml->entry[0]->synopsis;
echo "$image <br><br><b>Title</b>: $title <br> <b>Status</b>: $status <br><b>Synopsis</b>: $synopsis";
?>
The example of what I mean is located here. http://vs3.yuribot.com/mal.php?q=naruto
it took awhile but now its fixed i commented the places that help fix it. Thank you everyone for your help.

You should check if simplexml_load_string returned false if the call failed (remote server not available) and if there is some sort of error message in the XML. If there are, you should skip the info and display the error instead. I can't tell you how they are hidden in the XML, but there is most likely a tag or something similar.
Edit: Just saw in the PHP docs that simplexml_load_string requires a well formed XML string. Maybe you should check if you really got a well-formed XML. Best is you look at the docs and change the code in the way you need.

If you are concerned with the PHP "Notice" or "Warning" messages that sometimes seem to appear on your example, then you can prefix the relevant function call (in this case, "simplexml_load_string") with the "#" symbol to suppress any such messages.
For example:
$xml = #simplexml_load_string($data);
For more information, see PHP's manual for error control.

getting parent element's index in xml and php?

I am trying to get the parent element, I think of an XML tag. Basically I need to go through multiple <HotelRoomResponse> results and find this parent tag that contains a child tag with this exact number value: <roomTypeCode>17918</roomTypeCode> I am not sure how to do this or what would be the best way. Because I then need to get ALL the information in that specific <HotelRoomResponse>. Here is an example XML response:
<HotelRoomResponse>
<cancellationPolicy> </cancellationPolicy>
<rateCode>200482409</rateCode>
<roomTypeCode>17918</roomTypeCode>
<rateDescription>
Deluxe Sunset View - All Inclusive-Up to $300Resort Credit
</rateDescription>
<roomTypeDescription>
Deluxe Sunset View - All Inclusive-Up to $300Resort Credit
</roomTypeDescription>
<supplierType>E</supplierType>
</HotelRoomResponse>
So there are various of these result types and I need to loop through it and find this specific one.
Here is how I am connecting to the XML:
$ch = curl_init();
$fp = fopen('room_request.xml','w');
curl_setopt($ch, CURLOPT_URL, "http://api.ean.com/ean-services/rs/hotel/v3/avail?cid=55505&minorRev=13&apiKey=4sr8d8bsn75tpcuja6ypx5g3&locale=en_US&currencyCode=USD&customerIpAddress=10.184.2.9&customerUserAgent=Mozilla/5.0+(Windows+NT+6.1)+AppleWebKit/535.11+(KHTML,+like+Gecko)+Chrome/17.0.963.79+Safari/535.11&customerSessionId=&xml=<HotelRoomAvailabilityRequest><hotelId>".$hid."</hotelId><arrivalDate>05/14/2012</arrivalDate><departureDate>05/18/2012</departureDate><RoomGroup><Room><numberOfAdults>3</numberOfAdults><numberOfChildren>0</numberOfChildren><childAges>0</childAges></Room></RoomGroup><includeDetails>true</includeDetails></HotelRoomAvailabilityRequest>");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HTTPHEADER, array('Accept: application/xml'));
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_FILE, $fp);
$val = curl_exec($ch);
curl_close($ch);//Close curl session
fclose($fp); //Close file overwrite
$avail = simplexml_load_file('room_request.xml');
Any ideas are welcome.

To find all HotelRoomResponse nodes which has a roomTypeCode child node with the value '17918', use the following:
$match = $avail->xpath("/HotelRoomResponse[child::roomTypeCode[text() = '17918']]");
EDIT: $match will be an array holding all matches.

Ok figured it out!!!! Here is what I used to get a node with text = value. Then I got all sibling elements.
// load as file
$contents = new SimpleXMLElement($source,null,true);
$result = $contents->xpath('HotelRoomResponse[roomTypeCode="17918"]');
foreach($result as $key=>$node)
{
$cancelPolicy = $node->cancellationPolicy;
}

$xml = new SimpleXMLElement('room_request.xml');
/* Search for <HotelRoomResponse><roomTypeCode> */
$result = $xml->xpath('/HotelRoomResponse/roomTypeCode');
Result will give you a list of nodes you can then check and get the parent node if appropriate
see here http://www.php.net/manual/en/simplexmlelement.xpath.php
Edit 2.
Key was the namespace
<?php
$ch = curl_init();
$fp = fopen('room_request.xml','w');
curl_setopt($ch, CURLOPT_URL, "http://travellinginmexico.com/test/room_request.xml");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HTTPHEADER, array('Accept: application/xml'));
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_FILE, $fp);
$val = curl_exec($ch);
curl_close($ch);//Close curl session
fclose($fp); //Close file overwrite
$xml = new SimpleXMLElement(file_get_contents('room_request.xml'));
/* Search for <HotelRoomResponse><roomTypeCode> */
$xml->registerXPathNamespace('ns2', 'http://v3.hotel.wsapi.ean.com/');
$result = $xml->xpath("HotelRoomResponse[child::roomTypeCode[text() = '153725']]");
foreach($result as $obj=>$node)
{
var_dump($node->roomTypeCode);
}
With the example you sent will get specific information

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Parsing values from a ASP web page using PHP and XPath - php

Related

Right Xpath for HTML elements?

Get rigth Xpath for HTML elements

curl_exec returns empty string

how do you skip errored lines in xml when using php?

getting parent element's index in xml and php?

Categories

Resources