I have searched around on stackoverflow and the web and must be missing something here. I have not found exactly what I am looking for. Maybe its called something else.. I have this code below which will grab everything fine in the first folder but will not grab other items from other folders.. example it grabs everything in front of the first / but if you have a site mysite.com/folder2/ it will not grab folder2. Everything is linked. It also does travel backwards too. If you put in the longest link of the site will go all the way to the front of the site. I am not sure what I am missing any pointers would be great. The site is a joomla site that I am trying to scrap.
<?php function storelink($web,$taken) {
$query = "INSERT INTO scanned (web, taken) VALUES ('$web', '$taken')";
mysql_query($query) or die('Error, insert query failed');
}
$target_web = "mysite.com";
$userAgent = 'bobsbot(http://www.somebot.com/bot.html)';
// make the cURL request to $target_web
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL, $target_web);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 1000);
$html= curl_exec($ch);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}
// parse the html into a DOMDocument
$dom = new DOMDocument();
#$dom->loadHTML($html);
// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$web = $href->getAttribute('href');
storeLink($web,$target_web);
echo "<br />Link saved: $web";
} ?>
If I understand you correctly, you want to spider a site and save all URLs. This means you need to recurse when you encounter an URL.
The function you use to start the spider is called saveLink($web, $taken). The function you call when encountering a link is storeLink($web, $target_web). Shouldn't that be saveLink($web, $target_web)?
saveLink() should be recursive and also execute the cURL request. The cURL URL should be set to the link encountered. This way, it will parse the DOM of all links encountered and follow all links in them.
Related
I have been looking around for a while to make this work but seems that I can't do it by myself. I am using cURL to get some informations from an website and store those infos on MySQL database. What I have right now is following code :
$target_url = "[http:\[//\]iliria98\[.\]com][1]"; //delete [ and ] to get the url correctly
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 100);
$html= curl_exec($ch);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}
// parse the html into a DOMDocument
$document = new DOMDocument();
libxml_use_internal_errors(true);
$document->loadHTML($html);
libxml_clear_errors();
$selector = new DOMXPath($document);
//$anchors = $selector->query('//div[#class="single"]/div[2]');
$anchors = $selector->query('//div[#class="single"]/div');
foreach($anchors as $div) {
$text = $div->nodeValue;
$valuta_arr=explode(',', $text);
var_dump($valuta_arr);
echo $text;
}
And, the output is not a correct one, since it gets all te currency codes from the website, but the currency values are only from the first rows, from USD.
What I want is to get the values from the html table on the url specified and to insert those values in the database for every currency, where the database table looks like this:
id
currency
sell
buy
date
I didn't get till the mysql insert code since I have been struggling for 3 days to firstly get the informations fro that website.
Hope that someone can help me on this.
Thank you to everyone.
if you will try to get this page from the console by curl http://iliria98.com you will find, that this widget is filled by js-script:
$('div#usd1').append('<div style="position: absolute; background: transparent; width: 100%; height: 100%; left: 0; top: 0; z-index: 9999;"></div>')
$(".kursiweb .single").eq(0).find("div").eq(1).html("114<sup>.20</sup>"); $(".kursiweb .single").eq(0).find("div").eq(2).html("116");
and etc...
So, you can get needed data only from this script in source HTML you get from curl, not from the DOM document, just because curl didn't have any JS engine.
Another way you can go - to use something like PhantomJS
I am trying to write a tool that detects if a remote website uses flash using php. So far I have written a script that detects if embed or objects exist which give an indicator that there is a possibility of it being installed but some sites encrypt their code so renders this function useless.
include_once('simple_html_dom.php');
$flashTotalCount = 0;
function file_get_contents_curl($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$html = file_get_contents_curl($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
foreach($html->find('embed') as $pageEmbed){
$flashTotalCount++;
}
foreach($html->find('object') as $pageObject){
$flashTotalCount++;
}
if($flashTotalCount == 0){
echo "NO FLASH";
}
else{
echo "FLASH";
}
Would anyone one know of a way to check to see if a website uses flash or if possible get header information that flash is being used etc.
Any advise would be helpful.
As far as I understand, flash can be loaded by javascript. So you should execute the web page. For this purposes you'll have to use tool like this:
http://seleniumhq.org/docs/02_selenium_ide.html#the-waitfor-commands-in-ajax-applications
I don't think that it is usable from php.
I am trying to retrieve the content of web pages and check if the page contain certain error keywords I am monitoring. (instead of manually loading each URL everytime to check on the sites, I hope to do this programmatically and flag out errors when they occur)
I have tried XMLHttpRequest. I am able to get the HTML content, like what I see when I "view source" on the page. But the pages I monitor runs on Sharepoint and the webparts are dynamically generated. I believe if error occurs when loading these parts I would not be able to flag them out as the HTML I pull will not contain the errors but just usual paths to the webparts.
cURL seems to do the same. I just read about DOMDocument and I was wondering if DOMDocument process the codes or does it just break the HTML into a hierarchical structure.
I only wish to have the content of the URL. (like what you get when you save website as txt in IE, not the HTML). Or if I can further process the HTML then it would be good too. How can I do that? Any help will be really appreciated. :)
Why do you want to strip the HTML? It's better to use it!
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
$data = curl_exec($ch);
curl_close($ch);
// libxml_use_internal_errors(true);
$oDom = new DomDocument();
$oDom->loadHTML($data);
// Go through DOM and look for error (it's similar if it'd be
// <p class="error">error message</p> or whatever)
$errors = $oDom->getElementsByTagName( "error" ); // or however you get errors
foreach( $errors as $error ) {
if(strstr($error->nodeValue, 'SOME ERROR')) {
echo 'SOME ERROR occurred';
}
}
If you don't want to do that, you can just do:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
$data = curl_exec($ch);
curl_close($ch);
if(strstr($data, 'SOME_ERROR')) {
echo 'SOME ERROR occurred';
}
I just learned what scrapping and cUrl is few hours ago, and since then I am playing with that. Nevertheless, I am facing something strange now. The here below code works fine with some sites and not with others (of course I modified the url and the xpath...). Note that I have no error raised when I test if curl_exec was executed properly. So the problem must come from somwhere after. Some my questions are as follows:
How can I check if the new DOMDocument as been created properly: if(??)
How can I check if the new DOMDocument has been populated properly with html?
...if a new DOMXPath object has been created?
Hope I was clear. Thank you in advance for your replies. Cheers. Marc
My php:
<?php
$target_url = "http://www.somesite.com";
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}
// parse the html into a DOMDocument
$dom = new DOMDocument();
#$dom->loadHTML($html);
// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->query('somepath');
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
echo "<br />Link: $url";
}
?>
Use a try/catch to check if the document object was created, then check the return value of loadHTML() to determine if the HTML was loaded into the document. You can use a try/catch on the XPath object as well.
try
{
$dom = new DOMDocument();
$loaded = $dom->loadHTML($html);
if($loaded)
{
// loaded OK
}
else
{
// could not load HTML
}
}
catch(Exception $e)
{
// document could not be created, see $e->getMessage()
}
Problem solved. The error came from firebug who gave a wrong path. Big thanks to MrCode for his support...
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Parse Website for URLs
How do I get all the links in a webpage using PHP?
I need to get a list of the links :-
Google
I want to fetch the href (http://www.google.com) and the text (Google)
-------------------situation is:-
I'm building a crawler and i want it to get all the links that exist in a database table.
There are a couple of ways to do this, but the way I would approach this is something like the following,
Use cURL to fetch the page, ie:
// $target_url has the url to be fetched, ie: "http://www.website.com"
// $userAgent should be set to a friendly agent, sneaky but hey...
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}
If all goes well, page content is now all in $html.
Let's move on and load the page in a DOM Object:
$dom = new DOMDocument();
#$dom->loadHTML($html);
So far so good, XPath to the rescue to scrape the links out of the DOM object:
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
Loop through the result and get the links:
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$link = $href->getAttribute('href');
$text = $href->nodeValue
// Do what you want with the link, print it out:
echo $text , ' -> ' , $link;
// Or save this in an array for later processing..
$links[$i]['href'] = $link;
$links[$i]['text'] = $text;
}
$hrefs is an object of type DOMNodeList and item() returns a DOMNode object for the specified index. So basically we’ve got a loop that retrieves each link as a DOMNode object.
This should pretty much do it for you.
The only part I am not 100% sure of is if the link is an image or an anchor, what would happen in those conditions, I have no idea so you would need to test and filter those out.
Hope this gives you an idea of how to scrape links, happy coding.