I've written a small PHP script for grabbing images with curl and saving them locally.
It reads the urls for the images from my db, grabs it and saves the file to a folder.
Tested and works on a couple other websites before, fails with a new one I'm trying it with.
I did some reading around, modified the script a bit but still nothing.
Please suggest what to look out for.
$query_products = "SELECT * from product";
$products = mysql_query($query_products, $connection) or die(mysql_error());
$row_products = mysql_fetch_assoc($products);
$totalRows_products = mysql_num_rows($products);
do {
$ch = curl_init ($row_products['picture']);
$agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; rv:2.0) Gecko/20110319 Firefox/4.0');
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
$rawdata = curl_exec ($ch);
$http_status = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close ($ch);
if($http_status==200){
$fp = fopen("images/products/".$row_products['productcode'].".jpg", 'w');
fwrite($fp, $rawdata);
fclose($fp);
echo ' -- Downloaded '.$newname.' to local: '.$newname.'';
} else {
echo ' -- Failed to download '.$row_products['picture'].'';
}
usleep(500);
} while ($row_products = mysql_fetch_assoc($products));
Your target website may require/check a combination of things. In order:
Location. Some websites only allow the referer to be a certain value (either their site or no referer, to prevent hotlinking)
Incorrect URL
Cookies. Yes, this can be checked
Authentication of some sort
The only way to do this is to sniff what a normal request looks like and to mimic it. Your MSIE user-agent string looks different from a genuine MSIE UA, however, and I'd consider changing it to an exact copy of a real one if I were you.
Could you get curl to output to a file (using the setopt for output stream) and telling us what error code you are getting, along with the URL of an image? This will help me be more precise.
Also, 0 isn't a success - it's a failure
Related
The following curl-call succeeds every time, if and only if $data is printed after the curl-call. curl_getinfo() returning
[content_type] => text/html; charset=UTF-8
If $data is not printed, the curl-call sometimes return the same result as above and sometimes returns $data being "Loading...", Which means that page has not finished loading yet. And curl_getinfo() returning
[content_type] => text/html
Furthermore, when using print_r($data), I can see the print_r(curl_getinfo($ch)); on my website being updated several times while performing the curl-call. What... The.... F?
(the set_opt-list has grown larger as I'm trying to find a solution LOL)
Ooh.. yeah, even if I print $data after it's been returned to function caller and caught in another variable.. curl succeeds every time.
Is this normal behaviour? I don't want to print_r($data)!
Is it possible that the url I'm retrieving contains javascript which gets run when I "print" it on my website? Why does it work occasionally without the print_r($data)? Ref: is-there-a-way-to-let-curl-wait-until-the-pages-dynamic-updates-are-done
edit: Until further notice, I've put the curl-call in a while-loop, checking if downloaded size is above a certain threshold. I've set the while loop to 10 iterations, and so far it is enough, i.e. it will manage to download the content of interest. Time consumed is barely noticed.
function curl_get_contents($url) {
global $dbg;
$ch = curl_init();
$timeout = 30;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_NOSIGNAL, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
//curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.52 Safari/537.17');
curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_ANY);
curl_setopt($ch,CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_FRESH_CONNECT, true);
$data = curl_exec($ch);
if ($dbg) {
print_r(curl_getinfo($ch)); // This one gets refreshed if print_r($data) used below
if(curl_errno($ch)){
echo 'Curl error: ' . curl_error($ch);
} else {
echo "ALL GOOD <br>";
}
}
curl_close($ch);
//echo $data; // If I do this...
//print_r($data); // ... or this. curl is success 100%.
return $data;
}
I have an extremely simple script:
<?php
$jsonurl = "http://api.wipmania.com/json";
$json = file_get_contents($jsonurl);
echo $json;
?>
It works for this URL, but when I call it with this URL: https://erikberg.com/nba/standings.json
it is not echoing the data. What is the reason for this? I'm probably missing a concept here. Thanks
The problem for that particular URL is that it's expecting a different User Agent, different to the default that PHP is using with file_get_contents()
Here is a better example using CURL. It's more robust although it takes more lines of code to configure it and make it run:
// create curl resource
$ch = curl_init();
// set the URL
curl_setopt($ch, CURLOPT_URL, 'https://erikberg.com/nba/standings.json');
// Return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// Fake the User Agent for this particular API endpoint
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
// $output contains the output string.
$output = curl_exec($ch);
// close curl resource to free up system resources.
curl_close($ch);
// You have your JSON response here
echo $output;
I'm trying to get a page content with cURL or file_get_content. On many websites it's working but i'm trying to do that on a friend's server and it's not.
I think there is a protection with header or things like that. I get the following error code : 401 forbidden. If i try to reach the same page with a normal browser it works.
Here is my code for the file_get_contents function :
$homepage = file_get_contents('http://192.168.1.3');
echo $homepage; // just a test to see if the page is loaded, it's not.
if (preg_match("/my regex/", $homepage)) {
// ... some code
}
I also tryed with cURL :
$url = urlencode('http://192.168.1.3');
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0');
$result = curl_exec($ch) or die("Not working");
curl_close($ch);
echo $result; // not working ..
Nothing works, maybe i should add more args to curl_setopt ...
Thanks.
PS : If i try with linux (wget) i get an error, but if i try with aria2c it's working.
HTTP Status 401 means that UNAUTHORIZED. You need send the server with username and passwd。
With file_get_contents, you add the second param . That's a context-steam, which you can set header info.
You'd better to use curl for file_get_contents intend to access local file, as it's a block function. Add the option as following, it's a basic authorize.
curl_setopt($ch,CURLOPT_USERPWD,"my_username:my_password");
try this update with useragent
<?php
$curlSession = curl_init();
curl_setopt($curlSession, CURLOPT_URL, 'http://192.168.1.3/');
curl_setopt($curlSession,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($curlSession, CURLOPT_BINARYTRANSFER, true);
curl_setopt($curlSession, CURLOPT_RETURNTRANSFER, true);
$homepage = curl_exec($curlSession);
curl_close($curlSession);
echo $homepage ;
?>
if still getting blank page you have to install this add-on on firefox and see the "request-headers" and "response-headers"
How can I make a simple CURL request to that Flickr API that does the following:
Get the X number of most recent photos URLs + captions from collection Y?
Where "X" is the number of photo URLs and "Y" is the collection name.
This code is part of an existing application and I'm not allowed to use scripts like PHPFlickr for help.
what is the problem of using a already tested PHP api, you probably will need care about lot of stuff as authentication, size, etc. doing that by your own
Edit:
I will put some simple code using curl. hope helps you. I grabbed the idea from here
<?php
$ch = #curl_init();
#curl_setopt($ch, CURLOPT_URL, "http://api.flickr.com/services/feeds/groups_pool.gne?id=675729#N22&lang=en-us&format=json");
#curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1");
#curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
#curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
#curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$response = #curl_exec($ch);
$errno = #curl_errno($ch);
$error = #curl_error($ch);
if( $errno == CURLE_OK) {
$pics = json_decode($response);
}
?>
I want to download a page from the web, it's allowed to do when you are using a simple browser like Firefox, but when I use "file_get_contents" the server refuses and replies that it understands the command but don't allow such downloads.
So what to do? I think I saw in some scripts (on Perl) a way to make your script like a real browser by creating a user agent and cookies, which makes the servers think that your script is a real web browser.
Does anyone have an idea about this, how it can be done?
Use CURL.
<?php
// create curl resource
$ch = curl_init();
// set url
curl_setopt($ch, CURLOPT_URL, "example.com");
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// set the UA
curl_setopt($ch, CURLOPT_USERAGENT, 'My App (http://www.example.com/)');
// Alternatively, lie, and pretend to be a browser
// curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)');
// $output contains the output string
$output = curl_exec($ch);
// close curl resource to free up system resources
curl_close($ch);
?>
(From http://uk.php.net/manual/en/curl.examples-basic.php)
Yeah, CUrl is pretty good in getting page content. I use it with classes like DOMDocument and DOMXPath to grind the content to a usable form.
function __construct($useragent,$url)
{
$this->useragent='Firefox (WindowsXP) - Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.'.$useragent;
$this->url=$url;
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$this->xpath = new DOMXPath($dom);
}
...
public function displayResults($site)
$data=$this->path[0]->length;
for($i=0;$i<$data;$i++)
{
$delData=$this->path[0]->item($i);
//setting the href and title properties
$urlSite=$delData->getElementsByTagName('a')->item(0)->getAttribute('href');
$titleSite=$delData->getElementsByTagName('a')->item(0)->nodeValue;
//setting the saves and additoinal
$saves=$delData->getElementsByTagName('span')->item(0)->nodeValue;
if ($saves==NULL)
{
$saves=0;
}
//build the array
$this->newSiteBookmark[$i]['source']='delicious.com';
$this->newSiteBookmark[$i]['url']=$urlSite;
$this->newSiteBookmark[$i]['title']=$titleSite;
$this->newSiteBookmark[$i]['saves']=$saves;
}
The latter is a part of a class that scrapes data from delicious.com .Not very legal though.
This answer takes your comment to Rich's answer in mind.
The site is probably checking whether or not you are a real user using the HTTP referer or the User Agent string. try setting these for your curl:
//pretend you came from their site already
curl_setopt($ch, CURLOPT_REFERER, 'http://domainofthesite.com');
//pretend you are firefox 3.06 running on windows Vista
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.6) Gecko/2009011913 Firefox/3.0.6');
Another way to do it (though others have pointed out a better way), is to use PHP's fopen() function, like so:
$handle = fopen("http://www.example.com/", "r");//open specified URL for reading
It's especially useful if cURL isn't available.