I am try to learn curl usage, but I do not understand how it works fully yet. How can I use curl (or other functions) to access on one (the top) data entry of a table. So far I am only able to retrieve the entire website. How can I only echo the whole table and specifically the first entry. My code is:
<?php
$ch = curl_init("http://www.w3schools.com/html/html_tables.asp");
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_exec($ch);
curl_close($ch);
?>
Using curl is a good start, but its not going to be enough, as hanky suggested, you need to also use DOMDocument and also you can include DOMXpath.
Sample Code:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.w3schools.com/html/html_tables.asp');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
libxml_use_internal_errors(true);
$html = curl_exec($ch); // the whole document (in string) goes in here
$dom = new DOMDocument();
$dom->loadHTML($html); // load it
libxml_clear_errors();
$xpath = new DOMXpath($dom);
// point it to the particular table
// table with a class named 'reference', second row (first data), get the td
$table_row = $xpath->query('//table[#class="reference"]/tr[2]/td');
foreach($table_row as $td) {
echo $td->nodeValue . ' ';
}
Should output:
Jill Smith 50
Related
Using Xpath in a PHP (v8.1) environment, I am trying to fetch all IMG tags from a dummy website:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.someurl.com');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$response = curl_exec($ch);
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($htmlString);
$xpath = new DOMXPath($doc);
$images = $xpath->evaluate("//img");
echo serialize($images); //gives me: O:11:"DOMNodeList":0:{}
echo $doc->saveHTML(); //outputs entire website in series with wiped <HTML>,<HEAD>,<BODY> tags
I don't understand, why I don't get any results for whatever tags I am trying to adress with Xpath (in this case all img tags but I've tried a bunch of variations!).
The second issue I am having is, when looking at the output of the second echo instruction (outputting the entire grabbed html), I realize that the HTML page is not complete. What I am getting is everything except the <HTML></HTML>, <HEAD></HEAD> and <BODY></BODY> tags (but the actual contents still exist!), as if everything was appended in series. Is that supposed to be this way?
I'm trying to fetch a product name and price on this website Toplivo.bg
I am using the Simple HTML DOM parser to get it. Here is my code
include_once('simple_html_dom.php');
$link="https://toplivo.bg/en/products/Construction-materials/Dry-construction-mixtures/Screeds-and-flooring";
$html = file_get_html($link);
//Price
foreach ($html->find('div[class="content"]') as $text){
echo $text -> plaintext.'<br>';
}
?>
The problem is that first, I need to select the warehouse on the website to get the price for "Baumit Cement screed Baumit Solido E160, 25 kg".
Can I select it by default through PHP code? For example, I want to select the "Plovdiv region -> Plovdiv Store"
Thanks for helping!
This can be achieved using cURL. Complete code below:
<?php
include_once('simple_html_dom.php');
$link = "https://toplivo.bg/en/products/Construction-materials/Dry-construction-mixtures/Screeds-and-flooring";
// let's use curl to create a get request first to select a store while keeping the session using a cookie file
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://toplivo.bg/izborNaSklad/39');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookie-45fg.txt');
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookie-45fg.txt');
$output = curl_exec($ch);
curl_setopt($ch, CURLOPT_URL, $link); // now let's fetch the raw content of the store products page
$output = curl_exec($ch);
$html = str_get_html($output); // since we have the raw input, we can use the str_get_html method instead of file_get_html
//Price
foreach ($html->find('div[class="content"]') as $text){
echo $text->plaintext . '<br>';
}
?>
I need to scrape this HTML page ...
http://www1.usl3.toscana.it/default.asp?page=ps&ospedale=3
.... using PHP and XPath to get the values like 0 under the string "CODICE BIANCO"
(NOTE: you could see different values in that page if you try to browse it ... it doesn't matter ..,, they changing dinamically .... )
I'm using this PHP code sample to print the value ...
<?php
ini_set('display_errors', 'On');
error_reporting(E_ALL);
include "./tmp/vendor/autoload.php";
$url = 'http://www1.usl3.toscana.it/default.asp?page=ps&ospedale=3';
//$xpath_for_parsing = '/html/body/div/div[2]/table[2]/tbody/tr[1]/td/table/tbody/tr[3]/td[1]/table/tbody/tr[11]/td[3]/b';
$xpath_for_parsing = '//*[#id="contentint"]/table[2]/tbody/tr[1]/td/table/tbody/tr[3]/td[1]/table/tbody/tr[11]/td[3]/b';
//#Set CURL parameters: pay attention to the PROXY config !!!!
$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_PROXY, '');
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$colorWaitingNumber = $xpath->query($xpath_for_parsing);
$theValue = 'N.D.';
foreach( $colorWaitingNumber as $node )
{
$theValue = $node->nodeValue;
}
print $theValue;
?>
I've extracted the xpath using both the Chrome and Firefox web consoles ...
Suggestions / examples?
Both Chrome and Firefox most probably improve the original HTML by adding <tbody> elements inside <table> because the original HTML does not contain them. CURL does not do this and that's why your XPATH fails. Try this one instead:
$xpath_for_parsing = '//*[#id="contentint"]/table[2]/tr[1]/td/table/tr[3]/td[1]/table/tr[11]/td[3]/b';
Rather than relying on what is potentially quite a fragile hierarchy (which we all find ourselves building at times), it may be worth looking for something relatively near the data your looking for. I've just done the XPath, but it basically navigates from the text "CODICE BIANCO" and finds the data relative to that string.
$xpath_for_parsing = '//*[text()="CODICE BIANCO"]/../../following-sibling::tr[1]//descendant::b[2]';
This is still breakable when the coders change the page format, but it tries to localise the code as much as possible.
I am trying to use CURL to grab an XML file associated with this URL, then i am trying to parse the xml file using DOMxPath.
There are no output errors at this point it is just not displaying anything, i tried to catch some errors but i was unable to figure it out, any direction would be amazing.
<?php
if (!function_exists('curl_init')){
die('Sorry cURL is not installed!');
}
function tideTime() {
$ch = curl_init("http://tidesandcurrents.noaa.gov/noaatidepredictions/NOAATidesFacade.jsp?datatype=XML&Stationid=8721138");
$fp = fopen("8721138.xml", "w");
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_exec($ch);
curl_close($ch);
fclose($fp);
$dom = new DOMDocument();
#$dom->loadHTML($ch);
$domx = new DOMXPath($dom);
$entries = $domx->evaluate("//time");
$arr = array();
foreach ($entries as $entry) {
$tide = $entry->nodeValue;
}
echo $tide;
}
?>
Youre trying to load the curl resource handle as the DOM which it is not. the curl functions either output directly or output to string.
$ch = curl_init("http://tidesandcurrents.noaa.gov/noaatidepredictions/NOAATidesFacade.jsp?datatype=XML&Stationid=8721138");
$fp = fopen("8721138.xml", "w");
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 0);
$data = curl_exec($ch);
curl_close($ch);
fclose($fp);
$dom = new DomDocument();
$dom->loadHTML($data);
// the rest of the code
it seems you try to catch some unavailable xpath, make sure you have ("//time"); in the xml file, are you sure that you grab is a xml file ? or you just put into xml ?
if we look at that page, it seems xml generated by javascript, look at the http://tidesandcurrents.noaa.gov/noaatidepredictions/NOAATidesFacade.jsp?datatype=XML&Stationid=8721138&text=datafiles%2F8721138%2F09122011%2F877%2F&imagename=images/8721138/09122011/877/8721138_2011-12-10.gif&bdate=20111209&timelength=daily&timeZone=2&dataUnits=1&interval=&edate=20111210&StationName=Ponce Inlet, Halifax River&Stationid_=8721138&state=FL&primary=Subordinate&datum=MLLW&timeUnits=2&ReferenceStationName=GOVERNMENT CUT, MIAMI HARBOR ENTRANCE&HeightOffsetLow=*1.00&HeightOffsetHigh=* 1.18&TimeOffsetLow=33&TimeOffsetHigh=5&pageview=dayly&print_download=true&Threshold=&thresholdvalue=
may be you can grab that
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Parse Website for URLs
How do I get all the links in a webpage using PHP?
I need to get a list of the links :-
Google
I want to fetch the href (http://www.google.com) and the text (Google)
-------------------situation is:-
I'm building a crawler and i want it to get all the links that exist in a database table.
There are a couple of ways to do this, but the way I would approach this is something like the following,
Use cURL to fetch the page, ie:
// $target_url has the url to be fetched, ie: "http://www.website.com"
// $userAgent should be set to a friendly agent, sneaky but hey...
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}
If all goes well, page content is now all in $html.
Let's move on and load the page in a DOM Object:
$dom = new DOMDocument();
#$dom->loadHTML($html);
So far so good, XPath to the rescue to scrape the links out of the DOM object:
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
Loop through the result and get the links:
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$link = $href->getAttribute('href');
$text = $href->nodeValue
// Do what you want with the link, print it out:
echo $text , ' -> ' , $link;
// Or save this in an array for later processing..
$links[$i]['href'] = $link;
$links[$i]['text'] = $text;
}
$hrefs is an object of type DOMNodeList and item() returns a DOMNode object for the specified index. So basically we’ve got a loop that retrieves each link as a DOMNode object.
This should pretty much do it for you.
The only part I am not 100% sure of is if the link is an image or an anchor, what would happen in those conditions, I have no idea so you would need to test and filter those out.
Hope this gives you an idea of how to scrape links, happy coding.