I want to download a page from the web, it's allowed to do when you are using a simple browser like Firefox, but when I use "file_get_contents" the server refuses and replies that it understands the command but don't allow such downloads.
So what to do? I think I saw in some scripts (on Perl) a way to make your script like a real browser by creating a user agent and cookies, which makes the servers think that your script is a real web browser.
Does anyone have an idea about this, how it can be done?
Use CURL.
<?php
// create curl resource
$ch = curl_init();
// set url
curl_setopt($ch, CURLOPT_URL, "example.com");
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// set the UA
curl_setopt($ch, CURLOPT_USERAGENT, 'My App (http://www.example.com/)');
// Alternatively, lie, and pretend to be a browser
// curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)');
// $output contains the output string
$output = curl_exec($ch);
// close curl resource to free up system resources
curl_close($ch);
?>
(From http://uk.php.net/manual/en/curl.examples-basic.php)
Yeah, CUrl is pretty good in getting page content. I use it with classes like DOMDocument and DOMXPath to grind the content to a usable form.
function __construct($useragent,$url)
{
$this->useragent='Firefox (WindowsXP) - Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.'.$useragent;
$this->url=$url;
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$this->xpath = new DOMXPath($dom);
}
...
public function displayResults($site)
$data=$this->path[0]->length;
for($i=0;$i<$data;$i++)
{
$delData=$this->path[0]->item($i);
//setting the href and title properties
$urlSite=$delData->getElementsByTagName('a')->item(0)->getAttribute('href');
$titleSite=$delData->getElementsByTagName('a')->item(0)->nodeValue;
//setting the saves and additoinal
$saves=$delData->getElementsByTagName('span')->item(0)->nodeValue;
if ($saves==NULL)
{
$saves=0;
}
//build the array
$this->newSiteBookmark[$i]['source']='delicious.com';
$this->newSiteBookmark[$i]['url']=$urlSite;
$this->newSiteBookmark[$i]['title']=$titleSite;
$this->newSiteBookmark[$i]['saves']=$saves;
}
The latter is a part of a class that scrapes data from delicious.com .Not very legal though.
This answer takes your comment to Rich's answer in mind.
The site is probably checking whether or not you are a real user using the HTTP referer or the User Agent string. try setting these for your curl:
//pretend you came from their site already
curl_setopt($ch, CURLOPT_REFERER, 'http://domainofthesite.com');
//pretend you are firefox 3.06 running on windows Vista
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.6) Gecko/2009011913 Firefox/3.0.6');
Another way to do it (though others have pointed out a better way), is to use PHP's fopen() function, like so:
$handle = fopen("http://www.example.com/", "r");//open specified URL for reading
It's especially useful if cURL isn't available.
Related
I have an extremely simple script:
<?php
$jsonurl = "http://api.wipmania.com/json";
$json = file_get_contents($jsonurl);
echo $json;
?>
It works for this URL, but when I call it with this URL: https://erikberg.com/nba/standings.json
it is not echoing the data. What is the reason for this? I'm probably missing a concept here. Thanks
The problem for that particular URL is that it's expecting a different User Agent, different to the default that PHP is using with file_get_contents()
Here is a better example using CURL. It's more robust although it takes more lines of code to configure it and make it run:
// create curl resource
$ch = curl_init();
// set the URL
curl_setopt($ch, CURLOPT_URL, 'https://erikberg.com/nba/standings.json');
// Return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// Fake the User Agent for this particular API endpoint
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
// $output contains the output string.
$output = curl_exec($ch);
// close curl resource to free up system resources.
curl_close($ch);
// You have your JSON response here
echo $output;
I'm trying to get a page content with cURL or file_get_content. On many websites it's working but i'm trying to do that on a friend's server and it's not.
I think there is a protection with header or things like that. I get the following error code : 401 forbidden. If i try to reach the same page with a normal browser it works.
Here is my code for the file_get_contents function :
$homepage = file_get_contents('http://192.168.1.3');
echo $homepage; // just a test to see if the page is loaded, it's not.
if (preg_match("/my regex/", $homepage)) {
// ... some code
}
I also tryed with cURL :
$url = urlencode('http://192.168.1.3');
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0');
$result = curl_exec($ch) or die("Not working");
curl_close($ch);
echo $result; // not working ..
Nothing works, maybe i should add more args to curl_setopt ...
Thanks.
PS : If i try with linux (wget) i get an error, but if i try with aria2c it's working.
HTTP Status 401 means that UNAUTHORIZED. You need send the server with username and passwd。
With file_get_contents, you add the second param . That's a context-steam, which you can set header info.
You'd better to use curl for file_get_contents intend to access local file, as it's a block function. Add the option as following, it's a basic authorize.
curl_setopt($ch,CURLOPT_USERPWD,"my_username:my_password");
try this update with useragent
<?php
$curlSession = curl_init();
curl_setopt($curlSession, CURLOPT_URL, 'http://192.168.1.3/');
curl_setopt($curlSession,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($curlSession, CURLOPT_BINARYTRANSFER, true);
curl_setopt($curlSession, CURLOPT_RETURNTRANSFER, true);
$homepage = curl_exec($curlSession);
curl_close($curlSession);
echo $homepage ;
?>
if still getting blank page you have to install this add-on on firefox and see the "request-headers" and "response-headers"
I've written a small PHP script for grabbing images with curl and saving them locally.
It reads the urls for the images from my db, grabs it and saves the file to a folder.
Tested and works on a couple other websites before, fails with a new one I'm trying it with.
I did some reading around, modified the script a bit but still nothing.
Please suggest what to look out for.
$query_products = "SELECT * from product";
$products = mysql_query($query_products, $connection) or die(mysql_error());
$row_products = mysql_fetch_assoc($products);
$totalRows_products = mysql_num_rows($products);
do {
$ch = curl_init ($row_products['picture']);
$agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; rv:2.0) Gecko/20110319 Firefox/4.0');
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
$rawdata = curl_exec ($ch);
$http_status = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close ($ch);
if($http_status==200){
$fp = fopen("images/products/".$row_products['productcode'].".jpg", 'w');
fwrite($fp, $rawdata);
fclose($fp);
echo ' -- Downloaded '.$newname.' to local: '.$newname.'';
} else {
echo ' -- Failed to download '.$row_products['picture'].'';
}
usleep(500);
} while ($row_products = mysql_fetch_assoc($products));
Your target website may require/check a combination of things. In order:
Location. Some websites only allow the referer to be a certain value (either their site or no referer, to prevent hotlinking)
Incorrect URL
Cookies. Yes, this can be checked
Authentication of some sort
The only way to do this is to sniff what a normal request looks like and to mimic it. Your MSIE user-agent string looks different from a genuine MSIE UA, however, and I'd consider changing it to an exact copy of a real one if I were you.
Could you get curl to output to a file (using the setopt for output stream) and telling us what error code you are getting, along with the URL of an image? This will help me be more precise.
Also, 0 isn't a success - it's a failure
Recently I moved my scraping code with Curl to CodeIgniter. I'm using Curl CI library from http://philsturgeon.co.uk/code/codeigniter-curl. I put the scraping process in a controller and then I found the execution time of my scraping is slower than the one I built in plain PHP.
It took 12 seconds for CodeIgniter to output the result, whereas it only takes 6 seconds in plain PHP. Both are including the parsing process with the HTML DOM parser.
Here's my Curl code in CodeIgniter:
function curl($url, $postdata=false)
{
$agent = "Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4) Gecko/20030624 Netscape/7.1 (ax)";
$this->curl->create($url);
$this->curl->ssl(false);
$options = array(
'URL' => $url,
'HEADER' => 0,
'AUTOREFERER' => true,
'FOLLOWLOCATION' => true,
'TIMEOUT' => 60,
'RETURNTRANSFER' => 1,
'USERAGENT' => $agent,
'COOKIEJAR' => dirname(__FILE__) . "/cookie.txt",
'COOKIEFILE' => dirname(__FILE__) . "/cookie.txt",
);
if($postdata)
{
$this->curl->post($postdata, $options);
}
else
{
$this->curl->options($options);
}
return $this->curl->execute();
}
non codeigniter (plain php) code :
function curl($url ,$binary=false,$post=false,$cookie =false ){
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); // Accepts all CAs
curl_setopt ($ch, CURLOPT_SSL_VERIFYHOST, 2);
curl_setopt ($ch, CURLOPT_URL, $url );
curl_setopt ($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_REFERER, $url);
curl_setopt($ch, CURLOPT_ENCODING, 'gzip,deflate');
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 60);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
if($cookie){
$agent = "Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4) Gecko/20030624 Netscape/7.1 (ax)";
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_COOKIEJAR, dirname(__FILE__) . "/cookie.txt");
curl_setopt($ch, CURLOPT_COOKIEFILE, dirname(__FILE__) . "/cookie.txt");
}
if($binary)
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
if($post){
foreach($post as $key=>$value)
{
$post_array_string1 .= $key.'='.$value.'&';
}
$post_array_string1 = rtrim($post_array_string1,'&');
//set the url, number of POST vars, POST data
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_array_string1);
}
return curl_exec ($ch);
}
Does anyone know why this CodeIgniter Curl is slower?? or maybe it's because the simple_html_dom parser??
I'm not sure I know the exact answer for this, but I have a few observations about Curl & CI as I use it extensively.
Check for the state of DNS caches/queries.
I noticed a substantial speedup when code was uploaded to a hosted staging server from my dev desktop. It was traced to a DNS issue that was solved by rebooting a bastion host... You can sometimes check this by using IP addresses instead of hostnames.
Phil's 'library' is really just a wrapper.
All he's really done is map CI-style functions to the PHP Curl library. There's almost nothing else going on. I spent some time poking around (I forget why) and it was really unremarkable. That said, there may well be some general CI overhead - you might see what happens in another similar framework (Fuel, Kohana, Laravel, etc).
Check your reverse lookup.
Some API's do reverse DNS checks as part of their security scanning. Sometimes hostnames or other headers are badly set in buried configs and can cause real headaches.
Use Chrome's Postman extension to debug REST APIs.
No comment, it's brilliant - https://github.com/a85/POSTMan-Chrome-Extension/wiki and you have fine grained control of the 'conversation'.
I would have to know more about the CI Library and if it is doing any extra tasks on the gathered data but I would try naming your method to something other than the library name. I have had issues where with the Facebook library, calling it in a method named facebook caused problems. $this->curl could be ambiguous to if you are talking about the library or the method.
Also, try adding the debug profiler and see what it comes up with. Add this either in the construct or the method:
$this->output->enable_profiler(TRUE);
How can I make a simple CURL request to that Flickr API that does the following:
Get the X number of most recent photos URLs + captions from collection Y?
Where "X" is the number of photo URLs and "Y" is the collection name.
This code is part of an existing application and I'm not allowed to use scripts like PHPFlickr for help.
what is the problem of using a already tested PHP api, you probably will need care about lot of stuff as authentication, size, etc. doing that by your own
Edit:
I will put some simple code using curl. hope helps you. I grabbed the idea from here
<?php
$ch = #curl_init();
#curl_setopt($ch, CURLOPT_URL, "http://api.flickr.com/services/feeds/groups_pool.gne?id=675729#N22&lang=en-us&format=json");
#curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1");
#curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
#curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
#curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$response = #curl_exec($ch);
$errno = #curl_errno($ch);
$error = #curl_error($ch);
if( $errno == CURLE_OK) {
$pics = json_decode($response);
}
?>