I just learned what scrapping and cUrl is few hours ago, and since then I am playing with that. Nevertheless, I am facing something strange now. The here below code works fine with some sites and not with others (of course I modified the url and the xpath...). Note that I have no error raised when I test if curl_exec was executed properly. So the problem must come from somwhere after. Some my questions are as follows:
How can I check if the new DOMDocument as been created properly: if(??)
How can I check if the new DOMDocument has been populated properly with html?
...if a new DOMXPath object has been created?
Hope I was clear. Thank you in advance for your replies. Cheers. Marc
My php:
<?php
$target_url = "http://www.somesite.com";
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}
// parse the html into a DOMDocument
$dom = new DOMDocument();
#$dom->loadHTML($html);
// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->query('somepath');
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
echo "<br />Link: $url";
}
?>
Use a try/catch to check if the document object was created, then check the return value of loadHTML() to determine if the HTML was loaded into the document. You can use a try/catch on the XPath object as well.
try
{
$dom = new DOMDocument();
$loaded = $dom->loadHTML($html);
if($loaded)
{
// loaded OK
}
else
{
// could not load HTML
}
}
catch(Exception $e)
{
// document could not be created, see $e->getMessage()
}
Problem solved. The error came from firebug who gave a wrong path. Big thanks to MrCode for his support...
Related
I want to get img src value on a page if it is
https://www.google.com
then result will be like
https://www.google.com/images/branding/googlelogo/1x/googlelogo_color_272x92dp.png
https://www.google.com/ff.png
https://www.google.com/idk.jpg
i want something like this!
Thanks
<?php
# Use the Curl extension to query Google and get back a page of results
$url = "https://www.google.com";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$html = curl_exec($ch);
curl_close($ch);
# Create a DOM parser object
$dom = new DOMDocument();
# Parse the HTML from Google.
# The # before the method call suppresses any warnings that
# loadHTML might throw because of invalid HTML in the page.
#$dom->loadHTML($html);
# Iterate over all the <a> tags
foreach($dom->getElementsByTagName('img') as $link) {
# Show the <a href>
echo $link->getAttribute('src');
echo "<br />";
}
?>
Here it is
I have searched around on stackoverflow and the web and must be missing something here. I have not found exactly what I am looking for. Maybe its called something else.. I have this code below which will grab everything fine in the first folder but will not grab other items from other folders.. example it grabs everything in front of the first / but if you have a site mysite.com/folder2/ it will not grab folder2. Everything is linked. It also does travel backwards too. If you put in the longest link of the site will go all the way to the front of the site. I am not sure what I am missing any pointers would be great. The site is a joomla site that I am trying to scrap.
<?php function storelink($web,$taken) {
$query = "INSERT INTO scanned (web, taken) VALUES ('$web', '$taken')";
mysql_query($query) or die('Error, insert query failed');
}
$target_web = "mysite.com";
$userAgent = 'bobsbot(http://www.somebot.com/bot.html)';
// make the cURL request to $target_web
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL, $target_web);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 1000);
$html= curl_exec($ch);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}
// parse the html into a DOMDocument
$dom = new DOMDocument();
#$dom->loadHTML($html);
// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$web = $href->getAttribute('href');
storeLink($web,$target_web);
echo "<br />Link saved: $web";
} ?>
If I understand you correctly, you want to spider a site and save all URLs. This means you need to recurse when you encounter an URL.
The function you use to start the spider is called saveLink($web, $taken). The function you call when encountering a link is storeLink($web, $target_web). Shouldn't that be saveLink($web, $target_web)?
saveLink() should be recursive and also execute the cURL request. The cURL URL should be set to the link encountered. This way, it will parse the DOM of all links encountered and follow all links in them.
I am trying to use CURL to grab an XML file associated with this URL, then i am trying to parse the xml file using DOMxPath.
There are no output errors at this point it is just not displaying anything, i tried to catch some errors but i was unable to figure it out, any direction would be amazing.
<?php
if (!function_exists('curl_init')){
die('Sorry cURL is not installed!');
}
function tideTime() {
$ch = curl_init("http://tidesandcurrents.noaa.gov/noaatidepredictions/NOAATidesFacade.jsp?datatype=XML&Stationid=8721138");
$fp = fopen("8721138.xml", "w");
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_exec($ch);
curl_close($ch);
fclose($fp);
$dom = new DOMDocument();
#$dom->loadHTML($ch);
$domx = new DOMXPath($dom);
$entries = $domx->evaluate("//time");
$arr = array();
foreach ($entries as $entry) {
$tide = $entry->nodeValue;
}
echo $tide;
}
?>
Youre trying to load the curl resource handle as the DOM which it is not. the curl functions either output directly or output to string.
$ch = curl_init("http://tidesandcurrents.noaa.gov/noaatidepredictions/NOAATidesFacade.jsp?datatype=XML&Stationid=8721138");
$fp = fopen("8721138.xml", "w");
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 0);
$data = curl_exec($ch);
curl_close($ch);
fclose($fp);
$dom = new DomDocument();
$dom->loadHTML($data);
// the rest of the code
it seems you try to catch some unavailable xpath, make sure you have ("//time"); in the xml file, are you sure that you grab is a xml file ? or you just put into xml ?
if we look at that page, it seems xml generated by javascript, look at the http://tidesandcurrents.noaa.gov/noaatidepredictions/NOAATidesFacade.jsp?datatype=XML&Stationid=8721138&text=datafiles%2F8721138%2F09122011%2F877%2F&imagename=images/8721138/09122011/877/8721138_2011-12-10.gif&bdate=20111209&timelength=daily&timeZone=2&dataUnits=1&interval=&edate=20111210&StationName=Ponce Inlet, Halifax River&Stationid_=8721138&state=FL&primary=Subordinate&datum=MLLW&timeUnits=2&ReferenceStationName=GOVERNMENT CUT, MIAMI HARBOR ENTRANCE&HeightOffsetLow=*1.00&HeightOffsetHigh=* 1.18&TimeOffsetLow=33&TimeOffsetHigh=5&pageview=dayly&print_download=true&Threshold=&thresholdvalue=
may be you can grab that
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Parse Website for URLs
How do I get all the links in a webpage using PHP?
I need to get a list of the links :-
Google
I want to fetch the href (http://www.google.com) and the text (Google)
-------------------situation is:-
I'm building a crawler and i want it to get all the links that exist in a database table.
There are a couple of ways to do this, but the way I would approach this is something like the following,
Use cURL to fetch the page, ie:
// $target_url has the url to be fetched, ie: "http://www.website.com"
// $userAgent should be set to a friendly agent, sneaky but hey...
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}
If all goes well, page content is now all in $html.
Let's move on and load the page in a DOM Object:
$dom = new DOMDocument();
#$dom->loadHTML($html);
So far so good, XPath to the rescue to scrape the links out of the DOM object:
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
Loop through the result and get the links:
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$link = $href->getAttribute('href');
$text = $href->nodeValue
// Do what you want with the link, print it out:
echo $text , ' -> ' , $link;
// Or save this in an array for later processing..
$links[$i]['href'] = $link;
$links[$i]['text'] = $text;
}
$hrefs is an object of type DOMNodeList and item() returns a DOMNode object for the specified index. So basically we’ve got a loop that retrieves each link as a DOMNode object.
This should pretty much do it for you.
The only part I am not 100% sure of is if the link is an image or an anchor, what would happen in those conditions, I have no idea so you would need to test and filter those out.
Hope this gives you an idea of how to scrape links, happy coding.
Why this code results empty in my hosting, but works well in my local?
$raw = file_get_contents($rssURL);
$xml = new SimpleXmlElement($raw);
echo "<b>RSS Items:</b><br /><br />";
foreach($xml->channel->item as $item) {
echo $item->title."</br >";
}
libxml version: 2.6.32; libxml2 version: 2.6.32
I also tried this code:
# INSTANTIATE CURL.
$curl = curl_init();
# CURL SETTINGS.
curl_setopt($curl, CURLOPT_URL, $rssURL);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 0);
curl_setopt($curl, CURLOPT_VERBOSE, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
# GRAB THE XML FILE.
$xml = curl_exec($curl);
curl_close($curl);
# SET UP XML OBJECT.
$xmlObj = simplexml_load_string($xml);
echo "<b>RSS Items:</b><br /><br />";
foreach($xmlObj->channel->item as $item) {
echo $item->title."</br >";
}
echo "<br /><b>var_dump:</b><br><br>";
var_dump(libxml_get_errors())
The result was array(0) { }
Is there any differences between coding this snippet for Windows and Linux (I don't think so)?
Any Idea's?
Start with http://www.php.net/manual/en/function.libxml-get-errors.php and find out what errors simplexml_load_string() is throwing since it returns false on error.
Also, your provider might not let you make outside calls from your software, just a thought.
I use godaddy and I have to put in a proxy to make outbound calls.
curl_setopt ($curl,CURLOPT_PROXY,'http://proxy.shr.secureserver.net:3128');