Best method for bulk downloading images from website

Best method for bulk downloading images from website - php

I will download a lot of images (+20.000) from a website to my server and i'm trying to figure out the best way to do this since there's so many images to download.
Currently I have the code below which works in testing. But is there a better solution or should I use some software to do this?
foreach ($products as $product) {
$url = $product->img;
$imgName = $product->product_id
$path = "images/";
$img = $path . $imgName . ".png";
file_put_contents($img, file_get_contents($url));
}
Also, is there a chance that I will break something or crash the website when I download that many images at once?

first off, i agree with #Rudy Palacois here, wget would probably be better. that said, if you want to do it in PHP, curl would be much faster than file_get_contents, for 2 reasons.
1: unlike file_get_contents, curl can reuse the same connection to download multiple files, while file_get_contents will create & close a new connection for each download, that takes time, thus curl will be faster (as long as you're not using CURLOPT_FORBID_REUSE / CURLOPT_FRESH_CONNECT , anyway)
2: curl stops the download when the Content-Length http header's bytes has been downloaded. but file_get_contents completely ignores this header, and keeps downloading everything it can, until the connection is closed. this can again be much slower than curl's approach, because it's up to the web server when the connection will close, on some servers, it's A LOT slower than reading Content-Length bytes.
(and generally, curl is faster than file_get_contents because curl supports compressed transfers, gzip and deflate, which file_get_contents does not do... but that's generally not applicable for images, most common image formats are already pre-compressed. notable exceptions include .bmp images, though)
like this:
$ch = curl_init ();
curl_setopt ( $ch, CURLOPT_ENCODING, '' ); // if you're downloading files that benefit from compression (like .bmp images), this line enables compressed transfers.
foreach ( $products as $product ) {
$url = $product->img;
$imgName = $product->product_id;
$path = "images/";
$img = $path . $imgName . ".png";
$img=fopen($img,'wb');
curl_setopt_array ( $ch, array (
CURLOPT_URL => $url,
CURLOPT_FILE => $img
) );
curl_exec ( $ch );
fclose($img);
// file_put_contents ( $img, file_get_contents ( $url ) );
}
curl_close ( $ch );
edit: fixed a code-breaking typo, it's called CURLOPT_FILE, not CURLOPT_OUTFILE
edit 2: CURLOPT_FILE wants a file resource, not a filepath, fixed that too x.x

If you have access to shell, you could use WGET, I mean, the main problem with php, if you are executing this code from a browser, is the execution time, it will stop after a few minutes or it can be loading forever and get stucked, but if you have a complete URL and a pattern, as I can see, you can create a file with the URLs, one URL per line, list.txt, for example and then execute
wget -i list.txt
Check this answer too https://stackoverflow.com/a/14578517/5415074

Related

Running Out Of Memory With Fread

I'm using Backblaze B2 to store files and am using their documentation code to upload via their API. However their code uses fread to read the file, which is causing issues for files that are larger than 100MB as it tries to load the entire file into memory. Is there a better way to this that doesn't try to load the entire file into RAM?
$file_name = "file.txt";
$my_file = "<path-to-file>" . $file_name;
$handle = fopen($my_file, 'r');
$read_file = fread($handle,filesize($my_file));
$upload_url = ""; // Provided by b2_get_upload_url
$upload_auth_token = ""; // Provided by b2_get_upload_url
$bucket_id = ""; // The ID of the bucket
$content_type = "text/plain";
$sha1_of_file_data = sha1_file($my_file);
$session = curl_init($upload_url);
// Add read file as post field
curl_setopt($session, CURLOPT_POSTFIELDS, $read_file);
// Add headers
$headers = array();
$headers[] = "Authorization: " . $upload_auth_token;
$headers[] = "X-Bz-File-Name: " . $file_name;
$headers[] = "Content-Type: " . $content_type;
$headers[] = "X-Bz-Content-Sha1: " . $sha1_of_file_data;
curl_setopt($session, CURLOPT_HTTPHEADER, $headers);
curl_setopt($session, CURLOPT_POST, true); // HTTP POST
curl_setopt($session, CURLOPT_RETURNTRANSFER, true); // Receive server response
$server_output = curl_exec($session); // Let's do this!
curl_close ($session); // Clean up
echo ($server_output); // Tell me about the rabbits, George!
I have tried using:
curl_setopt($session, CURLOPT_POSTFIELDS, array('file' => '#'.realpath('file.txt')));
However I get an error response: Error reading uploaded data: SocketTimeoutException(Read timed out)
Edit: Streaming the filename withing the CURL also doesn't seem to work.

The issue you are having is related to this.
fread($handle,filesize($my_file));
With the filesize in there you might as well just do file_get_contents. it's much better memory wise to read 1 line at a time with fget
$handle = fopen($myfile, 'r');
while(!feof($handle)){
$line = fgets($handle);
}
This way you only read one line into memory, but if you need the full file contents you will still hit a bottleneck.
The only real way is to stream the upload.
I did a quick search and it seems the default for CURL is to stream the file if you give it the filename
$post_data['file'] = 'myfile.csv';
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_data);
You can see the previous answer for more details
Is it possible to use cURL to stream upload a file using POST?
So as long as you can get past the sha1_file It looks like you can just stream the file, which should avoid the memory issues. There may be issues with time limit though. Also I can't really think of a way around getting the hash if that fails.
Just FYI, personally I never tried this, typically i just us sFTP for large file transfers. So I don't know if it has to be specially post_data['file'] I just copied that from the other answer.
Good luck...
UPDATE
Seeing as streaming seems to have failed (see comments).
You may want to test the streaming to make sure it works. I don't know what all that would involve, maybe stream a file to your own server? Also I am not sure why it wouldn't work "as advertised" and you may have tested it already. But it never hurts to test something, never assume something works until you know for sure. It very easy to try something new as a solution, only to miss a setting or put a path in wrong and then fall back to thinking its all based on the original issue.
I've spent a lot of time tearing things apart only to realize I had a spelling error. I'm pretty adept a programing these days so I typically overthink the errors too. My point is, be sure it's not a simple mistake before moving on.
Assuming everything is setup right, I would try file_get_contents. I don't know if it will be any better but it's more meant to open whole files. It also would seem to be more Readable in the code, because then it's clear that the whole file is needed. It just seems more semantically correct if nothing else.
You can also increase the RAM PHP has access to by using
ini_set('memory_limit', '512M')
You can even go higher then that, depending on your server. The highest I went before was 3G, but the server I uses has 54GB of ram and that was a one time thing, (we migrated 130million rows from MySql to MongoDB, the innodb index was eating up 30+GB ). Typically I run with 512M and have some scripts that routinely need 1G. But I wouldn't just up the Memory willy-nilly. That is usually a last resort for me after optimizing and testing. We do a lot of heavy processing that is why we have such a big server, we also have 2 slave servers (among other things) that run with 16GB each.
As far as what size to put, typically I increment it by 128M tell it works, then add an extra 128M just to be sure, but you might want to go in smaller steps. Typically people always use multiples of 8, but I don't know if that make to much difference these days.
Again, Good Luck.

Can I retrieve a remote image without quality loss?

I'm trying to retrieve a remote image, but when I view the image (either output straight to browser or saved to disk and viewed) the quality is always lower.
Is there a way to do this without quality loss? Here are the methods I've used, but with the same result.
$imagePath is a path like http://www.example.com/myimage.jpg and $filePath is where the image will be saved to.
curl
$ch = curl_init($imagePath);
$fp = fopen($filePath, 'wb');
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_exec($ch);
curl_close($ch);
fclose($fp);
file_get_contents
$tmpImage = file_get_contents($imagePath);
file_put_contents($filePath, $tmpImage);
I'll get a screenshot to show the quality issues shortly.
If you look around the KIA logo you can see of the left the quality issues I'm having.
Update:
Upon some further testing with different images it seems the quality issues are different with each image. Some has no issues at all.
Also the image which the screenshots are from above, based on the long url to the image, it seems like the image has already been resized and had it's quality alter before it gets to my script, so I'm thinking that could account for these issues too.

Using file_get_contents vs curl for file size

I have a file uploading script running on my server which also features remote uploads.. Everything works fine but I am wondering what is the best way to upload via URL. Right now I am using fopen to get the file from the remote url pasted in the text box named "from". I have heard that fopen isn't the best way to do it. Why is that?
Also I am using file_get_contents to get the file size of the file from the URL. I have heard that curl is better on that part. Why is that and also how can I apply these changes to this script?
<?php
$from = htmlspecialchars(trim($_POST['from']));
if ($from != "") {
$file = file_get_contents($from);
$filesize = strlen($file);
while (!feof($file)) {
$move = "./uploads/" . $rand2;
move_upload($_FILES['from']['tmp_name'], $move);
$newfile = fopen("./uploads/" . $rand2, "wb");
file_put_contents($newfile, $file);
}
}
?>

You can use filesize to get the file size of a file on disk.
file_get_contents actually gets the file into memory so $filesize = strlen(file_get_contents($from)); already gets the file, you just don't do anything with it other than find it size. You can substitute for you fwrite call file_put_contents;
See: file_get_contents and file_put_contents .
curl is used when you need more access to the HTTP protocol. There are many questions and examples on StackOverflow using curl in PHP.
So we can first download the file, in this example I wll use file_get_contents, get its size, then put the file in the directory on your local disk.
$tmpFile = file_get_contents($from);
$fileSize = strlen($tmpFile);
// you could do a check for file size here
$newFileName = "./uploads/$rand2";
file_put_contents($newFileName, $tmpFile);
In your code you have move_upload($_FILES['from']['tmp_name'], $move); but $_FILES is only applicable when you have a <input type="file"> element, which it doesn't seem you have.
P.S. You should probably white-list characters that you allow in a filename for instance $goodFilename = preg_replace("/^[^a-zA-Z0-9]+$/", "-", $filename) This is often easier to read and safer.
Replace:
while (!feof($file)) {
$move = "./uploads/" . $rand2;
move_upload($_FILES['from']['tmp_name'], $move);
$newfile = fopen("./uploads/" . $rand2, "wb");
file_put_contents($newfile, $file);
}
With:
$newFile = "./uploads/" . $rand2;
file_put_contents($newfile, $file);
The whole file is read in by file_get_contents the whole file is written by file_put_contents

As far as I understand your question: You want to get the filesize of a remote fiel given by a URL, and you're not sure which solution ist best/fastest.
At first, the biggest difference between CURL, file_get_contents() and fread() in this context is that CURL and file_get_contents() put the whole thing into memory, while fopen() gives you more control over what parts of the file you want to read. I think fopen() and file_get_contents() are nearly equivalent in your case, because you're dealing with small files and you actually want to get the whole file. So it doesn't make any difference in terms of memory usage.
CURL is just the big brother of file_get_contents(). It is actually a complete HTTP-Client rather than some kind of a wrapper for simple functions.
And talking about HTTP: Don't forget there's more to HTTP than GET and POST. Why don't you just use the resource's meta-data to check it's size before you even get it? That's one thing the HTTP method HEAD is meant for. PHP even comes with a built in function for getting the headers: get_headers(). It has some flaws though: It still sends a GET request, which makes it probably a little slower, and it follows redirects, which may cause security issues. But you can fix this pretty easily by adjusting the default context:
$opts = array(
'http' =>
array(
'method' => 'HEAD',
'max_redirects'=> 1,
'ignore_errors'=> true
)
);
stream_context_set_default($opts);
Done. Now you can simply get the headers:
$headers = get_headers('http://example.com/pic.png', 1);
//set the keys to lowercase so we don't have to deal with lower- and upper case
$lowerCaseHeaders = array_change_key_case($headers);
// 'content-length' is the header we're interested in:
$filesize = $lowerCaseHeaders['content-length'];
NOTE: filesize() will not work on a http / https stream wrapper, because stat() is not supported (http://php.net/manual/en/wrappers.http.php).
And that's pretty much it. Of course you can achieve the same with CURL just as easy if you like it better. The approach would be same (reding the headers).
And here's how you get the file and it's size (after downloading) with CURL:
// Create a CURL handle
$ch = curl_init();
// Set all the options on this handle
// find a full list on
// http://au2.php.net/manual/en/curl.constants.php
// http://us2.php.net/manual/en/function.curl-setopt.php (for actual usage)
curl_setopt($ch, CURLOPT_URL, 'http://example.com/pic.png');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// Send the request and store what is returned to a variable
// This actually contains the raw image data now, you could
// pass it to e.g. file_put_contents();
$data = curl_exec($ch);
// get the required info about the request
// find a full list on
// http://us2.php.net/manual/en/function.curl-getinfo.php
$filesize = curl_getinfo($ch, CURLINFO_SIZE_DOWNLOAD);
// close the handle after you're done
curl_close($ch);
Pure PHP approach: http://codepad.viper-7.com/p8mlOt
Using CURL: http://codepad.viper-7.com/uWmsYB
For a nicely formatted and human readable output of the file size I've learned this amazing function from Laravel:
function get_file_size($size)
{
$units = array('Bytes', 'KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB');
return #round($size / pow(1024, ($i = floor(log($size, 1024)))), 2).' '.$units[$i];
}
If you don't want to deal with all this, you should check out Guzzle. It's a very powerful and extremely easy to use library for any kind HTTP stuff.

How to work around a site forbidding me to scrape their images with PHP

I'm scraping a site, searching for JPGs to download.
Scraping the site's HTML pages works fine.
But when I try getting the JPGs with CURL, copy(), fopen(), etc., I get a 403 forbiden status.
I know that's because the site owners don't want their images scraped, so I understand a good answer would be just don't do it, because they don't want you to.
Ok, but let's say it's ok and I try to work around this, how could this be achieved?
If I get the same URL with a browser, I can open the image perfectly, it's not that my IP is banned or anything, and I'm testing the scraper one file at a time, so it's not blocking me because I make too many requests too often.
From my understanding, it could be that either the site is checking for some cookies that confirm that I'm using a browser and browsing their site before I download a JPG.
Or that maybe PHP is using some user agent for the requests that the server can detect and filter out.
Anyway, have any idea?

Actually it was quite simple.
As #Leigh suggested, it only took spoofing an http referrer with the option CURLOPT_REFERER.
In fact for every request, I just provided the domain name as the referrer and it worked.

Are you able to view the page through a browser? Wouldn't a simple search of the page source find all images?
` $findme = '.jpg';
$pos = strpos($html, $findme);
if ($pos === false) {
echo "The string '$findme' was not found in the string '$html'";
} else {
echo "Images found..
///grab image location code
} `

Basic image retrieval:
Using the GD Library plugin commonly installed by default with many web hosts. This is something of an ugly hack but some may find the fact it can be done this way useful.
$remote_img = 'http://www.somwhere.com/images/image.jpg';
$img = imagecreatefromjpeg($remote_img);
$path = 'images/';
imagejpeg($img, $path);
Classic cURL image grabbing function for when you have extracted the location of the image from the donor pages HTML.
function save_image($img,$fullpath){
$ch = curl_init ($img);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER,1);
$rawdata=curl_exec($ch);
curl_close ($ch);
if(file_exists($fullpath)){
unlink($fullpath);
}
$fp = fopen($fullpath,'x');
fwrite($fp, $rawdata);
fclose($fp);
}
If the basic cURL image grabbing function fails then the donor site probably has some form of server side defences in place to prevent retrieval and so you are probably breaching the terms of service by proceeding further. Though rare some sites do create images 'on the fly' using the GD library module, so what may look like a link to an image is actually a PHP script and that could be checking for things like a cookie, referer or session value being passed to it before the image is created and outputted.

PHP: get remote file size with strlen? (html)

I was looking at PHP docs for fsockopen and whatnot and they say you can't use filesize() on a remote file without doing some crazy things with ftell or something (not sure what they said exactly), but I had a good thought about how to do it:
$file = file_get_contents("http://www.google.com");
$filesize = mb_strlen($file) / 1000; //KBs, mb_* in case file contains unicode
Would this be a good method? It seemed so simple and good to use at the time, just want to get any thoughts if this could run into problems or not be the true file size.
I only wish to use this on text (websites) by the way not binary.

This answer requires PHP5 and cUrl. It first checks the headers. If Content-Length isn't specified, it uses cUrl to download it and check the size (the file is not saved anywhere though--just temporarily in memory).
<?php
echo get_remote_size("http://www.google.com/");
function get_remote_size($url) {
$headers = get_headers($url, 1);
if (isset($headers['Content-Length'])) return $headers['Content-Length'];
if (isset($headers['Content-length'])) return $headers['Content-length'];
$c = curl_init();
curl_setopt_array($c, array(
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HTTPHEADER => array('User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3'),
));
curl_exec($c);
return curl_getinfo($c, CURLINFO_SIZE_DOWNLOAD);
}
?>

You should look at the get_headers() function. It will return a hash of HTTP headers from an HTTP request. The Content-length header may be a better judge of the size of the actual content, if it's present.
That being said, you really should use either curl or streams to do a HEAD request instead of a GET. Content-length should be present, which saves you the transfer. It will be both faster and more accurate.

it will fetch the whole file and then calculate the filesize (rather the string length) out of the retrieved data. usually filesize can tell the filesize directly from the filesystem without reading the whole file first.
so this will be rather slow, and will everytime fetch the whole file before being able to retrieve the filesize (string length

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.