Best bulk image download practise using curl? - php

I have a script running for a Laravel 5.4 webapplication that is supposed to download a big amount of images (10k). I'm wondering what the best way to handle this would be. I currently grab the base64_encode() data from the remote image and write it to a local folder with the function file_put_contents(). This works fine but some images can take more than 10 seconds to download/write, image that times a 10 thousand. Fair enough these images are rather big but I would like to see this process happen faster and thus I am asking for advice!
My current process is like this;
I read a JSON file containing all the image links I have to
download.
I convert the JSON data to an array with json_decode() and I loop through all the links with a foreach() loop and let curl handle the rest.
All the relevant parts of the code look like this:
<?php
// Defining the paths for easy access.
$__filePath = public_path() . DIRECTORY_SEPARATOR . "importImages" . DIRECTORY_SEPARATOR . "images" . DIRECTORY_SEPARATOR . "downloadList.json";
$__imagePath = public_path() . DIRECTORY_SEPARATOR . "importImages" . DIRECTORY_SEPARATOR . "images";
// Decode the json array into an array readable by PHP.
$this->imagesToDownloadList = json_decode(file_get_contents($__filePath));
// Let's loop through the image list and try to download
// all of the images that are present within the array.
foreach ($this->imagesToDownloadList as $IAN => $imageData) {
$__imageGetContents = $this->curl_get_contents($imageData->url);
$__imageBase64 = ($__imageGetContents) ? base64_encode($__imageGetContents) : false;
if( !file_put_contents($__imagePath . DIRECTORY_SEPARATOR . $imageData->filename, base64_decode($__imageBase64)) ) {
return false;
}
return true;
}
And the curl_get_contents functions looks like this:
<?php
private function curl_get_contents($url)
{
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
I hope someone could englighten me with possible improvements that I could apply on the current way I'm handling this mass-download.

Related

Upload large files to Dropbox via HTTP API

I am currently implementing an upload mechanism for files on my webserver into my Dropbox app directory.
As stated on the API docs, there is the /upload endpoint (https://www.dropbox.com/developers/documentation/http/documentation#files-upload) which accepts files up to 150MB in size. However I‘m dealing with images and videos with a potential size of up to 2GB.
Therefore I need to use the upload_session endpoints. There is an endpoint to start the session (https://www.dropbox.com/developers/documentation/http/documentation#files-upload_session-start), to append data and to finish the session.
What currently is unclear to me is how to exactly use these endpoints. Do I have to split my file on my server into 150MB chunks (how would I do that with a video file?) and then upload the first chunk with /start, the next chunks with /append and the last one with /finish? Or can I just specify the file and the API somehow (??) does the splitting for me? Obviously not, but I somehow can‘t get my head around on how I should calculate, split and store the chunks on my webserver and not lose the session inbetween...
Any advice or further leading links are greatly appreciated. Thank you!
As Greg mentioned in the comments, you decide how to manage the "chunks" of the files. In addition to his .NET example, Dropbox has a good upload session implementation in the JavaScript upload example of the Dropbox API v2 JavaScript SDK.
At a high-level, you're splitting up the file into smaller sizes (aka "chunks") and passing those to the upload_session mechanism in a specific order. The upload mechanism has a few parts that need to be used in the following order:
Call /files/upload_session/start. Use the resulting session_id as a parameter in the following methods so Dropbox knows which session you're interacting with.
Incrementally pass each "chunk" of the file to /files/upload_session/append_v2. A couple things to be aware of:
The first call will return a cursor, which is used to iterate over the file's chunks in a specific order. It gets passed as a parameter in each consecutive call to this method (with the cursor being updated on every response).
The final call must include the property "close": true, which closes the session so it can be uploaded.
Pass the final cursor (and commit info) to /files/upload_session/finish. If you see the new file metadata in the response, then you did it!!
If you're uploading many files instead of large ones, then the /files/upload_session/finish_batch and /files/upload_session/finish_batch/check are the way to go.
I know this is an old post, but here is a fully functional solution for your problem. Maybe anyone else finds it usefull. :)
<?php
$backup_folder = glob('/var/www/test_folder/*.{sql,gz,rar,zip}', GLOB_BRACE); // Accepted file types (sql,gz,rar,zip)
$token = '<ACCESS TOKEN>'; // Dropbox Access Token;
$append_url = 'https://content.dropboxapi.com/2/files/upload_session/append_v2';
$start_url = 'https://content.dropboxapi.com/2/files/upload_session/start';
$finish_url = 'https://content.dropboxapi.com/2/files/upload_session/finish';
if (!empty($backup_folder)) {
foreach ($backup_folder as $single_folder_file) {
$file_name= basename($single_folder_file); // File name
$destination_folder = 'destination_folder'; // Dropbox destination folder
$info_array = array();
$info_array["close"] = false;
$headers = array(
'Authorization: Bearer ' . $token,
'Content-Type: application/octet-stream',
'Dropbox-API-Arg: '.json_encode($info_array)
);
$chunk_size = 50000000; // 50mb
$fp = fopen($single_folder_file, 'rb');
$fileSize = filesize($single_folder_file); // File size
$tosend = $fileSize;
$first = $tosend > $chunk_size ? $chunk_size : $tosend;
$ch = curl_init($start_url);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, fread($fp, $first));
curl_setopt($ch,CURLOPT_RETURNTRANSFER,true);
$response = curl_exec($ch);
$tosend -= $first;
$resp = explode('"',$response);
$sesion = $resp[3];
$position = $first;
$info_array["cursor"] = array();
$info_array["cursor"]["session_id"] = $sesion;
while ($tosend > $chunk_size)
{
$info_array["cursor"]["offset"] = $position;
$headers[2] = 'Dropbox-API-Arg: '.json_encode($info_array);
curl_setopt($ch, CURLOPT_URL, $append_url);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_POSTFIELDS, fread($fp, $chunk_size));
curl_exec($ch);
$tosend -= $chunk_size;
$position += $chunk_size;
}
unset($info_array["close"]);
$info_array["cursor"]["offset"] = $position;
$info_array["commit"] = array();
$info_array["commit"]["path"] = '/'. $destination_folder . '/' . $file_name;
$info_array["commit"]["mode"] = array();
$info_array["commit"]["mode"][".tag"] = "overwrite";
$info_array["commit"]["autorename"] = true;
$info_array["commit"]["mute"] = false;
$info_array["commit"]["strict_conflict"] = false;
$headers[2] = 'Dropbox-API-Arg: '. json_encode($info_array);
curl_setopt($ch, CURLOPT_URL, $finish_url);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_POSTFIELDS, $tosend > 0 ? fread($fp, $tosend) : null);
curl_exec($ch);
curl_close($ch);
fclose($fp);
unlink($single_folder_file); // Remove files from server folder
}
}

What would be the best way to collect the titles (in bulk) of a subreddit

I am looking to collect the titles of all of the posts on a subreddit, and I wanted to know what would be the best way of going about this?
I've looked around and found some stuff talking about Python and bots. I've also had a brief look at the API and am unsure in which direction to go.
As I do not want to commit to find out 90% of the way through it won't work, I ask if someone could point me in the right direction of language and extras like any software needed for example pip for Python.
My own experience is in web languages such as PHP so I initially thought of a web app would do the trick but am unsure if this would be the best way and how to go about it.
So as my question stands
What would be the best way to collect the titles (in bulk) of a
subreddit?
Or if that is too subjective
How do I retrieve and store all the post titles of a subreddit?
Preferably needs to :
do more than 1 page of (25) results
save to a .txt file
Thanks in advance.
PHP; in 25 lines:
$subreddit = 'pokemon';
$max_pages = 10;
// Set variables with default data
$page = 0;
$after = '';
$titles = '';
do {
$url = 'http://www.reddit.com/r/' . $subreddit . '/new.json?limit=25&after=' . $after;
// Set URL you want to fetch
$ch = curl_init($url);
// Set curl option of of header to false (don't need them)
curl_setopt($ch, CURLOPT_HEADER, 0);
// Set curl option of nobody to false as we need the body
curl_setopt($ch, CURLOPT_NOBODY, 0);
// Set curl timeout of 5 seconds
curl_setopt($ch, CURLOPT_TIMEOUT, 5);
// Set curl to return output as string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// Execute curl
$output = curl_exec($ch);
// Get HTTP code of request
$status = curl_getinfo($ch, CURLINFO_HTTP_CODE);
// Close curl
curl_close($ch);
// If http code is 200 (success)
if ($status == 200) {
// Decode JSON into PHP object
$json = json_decode($output);
// Set after for next curl iteration (reddit's pagination)
$after = $json->data->after;
// Loop though each post and output title
foreach ($json->data->children as $k => $v) {
$titles .= $v->data->title . "\n";
}
}
// Increment page number
$page++;
// Loop though whilst current page number is less than maximum pages
} while ($page < $max_pages);
// Save titles to text file
file_put_contents(dirname(__FILE__) . '/' . $subreddit . '.txt', $titles);

Curl caching a downloaded image?

I have a script that tails a log file for a song change event. When that happens a Tweet is auto generated and sent to Twitter. Problem is that, I recently updated my script so that it also includes the album art with the Tweet. I am downloading the image using curl, posting the media, then removing the file for the next Tweet. What is throwing me off is that when the image displayed in the Tweet is for the previous song playing and not the current one. I want to know if I download a file using curl, then use php unlink command, would that same image be downloaded again? Here is my code below.
// Analyze prepared tweet for issues with length
$tweeted = ($front . $fixedArtistNow . $hyphen . " " . $fixedTitleNow . $tag);
if (strlen($tweeted) <= 140) {
try {
// $path = ("/home/soundcheck/public_html/images/artwork.png");
// if(unlink($path));
// Grab album art from the song that is currently playing
$fresh = date("y-m-d-G-i-s");
$ch = curl_init('http://soundcheck.xyz:8000/playingart? sid=1?'.$fresh);
var_dump($ch);
$fp = fopen('/home/soundcheck/public_html/images/artwork.png', 'wb');
curl_setopt($fp, CURLOPT_FRESH_CONNECT, TRUE);
curl_setopt($fp, CURLOPT_FORBID_REUSE, TRUE);
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_exec($ch);
curl_close($ch);
fclose($fp);
// copy("/home/soundcheck/public_html/images/artwork.png","/home/soundcheck/public_html/images/album.png");
// $twitter->send($front . $fixedArtistNow . $hyphen . " " . $fixedTitleNow . $tag);
$media1 = $twitter->upload('media/upload', ['media' => '/home/soundcheck/public_html/images/artwork.png']);
// $media2 = $connection->upload('media/upload', ['media' => '/path/to/file/kitten2.jpg']);
$parameters = [
'status' => $tweeted,
'media_ids' => implode(',', [$media1->media_id_string]),
];
$result = $twitter->post('statuses/update', $parameters);
// unlink("/home/soundcheck/public_html/images/artwork.png");
The url I am getting the image from never changes but the image does if that makes any difference. If anyone knows a better way to download the album art, please let me know. Thanks!

Multiple file uploads with cURL

I'm using cURL to transfer image files from one server to another using PHP. This is my cURL code:
// Transfer the original image and thumbnail to our storage server
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_VERBOSE, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL, 'http://' . $server_data['hostname'] . '.localhost/transfer.php');
curl_setopt($ch, CURLOPT_POST, true);
$post = array(
'upload[]' => '#' . $tmp_uploads . $filename,
'upload[]' => '#' . $tmp_uploads . $thumbname,
'salt' => 'q8;EmT(Vx*Aa`fkHX:up^WD^^b#<Lm:Q'
);
curl_setopt($ch, CURLOPT_POSTFIELDS, $post);
$resp = curl_exec($ch);
This is the code in transfer.php on the server I'm uploading to:
if($_FILES && $_POST['salt'] == 'q8;EmT(Vx*Aa`fkHX:up^WD^^b#<Lm:Q')
{
// Save the files
foreach($_FILES['upload']['error'] as $key => $error)
{
if ($error == UPLOAD_ERR_OK)
{
move_uploaded_file($_FILES['upload']['tmp_name'][$key], $_FILES['upload']['name'][$key]);
}
}
}
All seems to work, apart from one small logic error. Only one file is getting saved on the server I'm transferring to. This is probably because I'm calling both images upload[] in my post fields array, but I don't know how else to do it. I'm trying to mimic doing this:
<input type="file" name="upload[]" />
<input type="file" name="upload[]" />
Anyone know how I can get this to work? Thanks!
here is your error in the curl call...
var_dump($post)
you are clobbering the array entries of your $post array since the key strings are identical...
make this change
$post = array(
'upload[0]' => '#' . $tmp_uploads . $filename,
'upload[1]' => '#' . $tmp_uploads . $thumbname,
'salt' => 'q8;EmT(Vx*Aa`fkHX:up^WD^^b#<Lm:Q'
);
The code itself looks ok, but I don't know about your move() target directory. You're using the raw filename as provided by the client (which is your curl script). You're using the original uploaded filename (as specified in your curl script) as the target of the move, with no overwrite checking and no path data. If the two uploaded files have the same filename, you'll overwrite the first processed image with whichever one got processed second by PHP.
Try putting some debugging around the move() command:
if (!move_uploaded_file($_FILES['upload']['tmp_name'][$key], $_FILES['upload']['name'][$key])) {
echo "Unable to move $key/";
echo $_FILES['upload']['tmp_name'][$key];
echo ' to ';
echo $_FILES['upload']['name'][$key];
}
(I split the echo onto multiple lines for legibility).

Posting with PHP and Curl, deep array

I'm trying to post via curl, I've been using the same code over and over again with no problem but now I need to be able to use an array for posts (i'm not sure if there's a proper term for that?).
I should clarify that it's specifically a file i'm trying to post, but I can't get it working with a string either so I don't think it's too do with that.
This is absouletly fine:
$uploadData = array();
$uploadData['uploads'] = "#".$file;
$uploadData['iagree'] = 'on';
This doesn't appear to work:
$uploadData = array();
$uploadData['uploads'][0] = "#".$file;
$uploadData['iagree'] = 'on';
In the second example i'm trying to replicate an input with the attribute name="uploads[]"
Obviously i'm trying to curl an external site, but if I experiment curling a page on my own server so that I can see what's being sent, I can see that the uploads array is being converted to a string:
print_r($_POST);
print_r($_FILES);
returns:
Array
(
[uploads] => Array
[iagree] => on
)
Array
(
)
This is my full Curl:
$uploadData = array();
$uploadData['uploads'][] = "#".$file;
$uploadData['iagree'] = 'on';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $theLink);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_file);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie_file);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $uploadData);
$upload_response = curl_exec($ch);
curl_close($ch);
I've tried to give as much information as possible, but if i've missed something feel free to ask and i'll provide more.
Other than that, does anyone have any suggestions or solutions?
$uploadData['uploads[]'] = "#".$file; and passing it as an array should work, just keep in mind you need the absolute path to the file.
There is no mechanism in 'simple' HTTP (multipart/form-data or application/x-www-form-urlencoded) to send 'arrays'. However, PHP interprets the [ & ] characters in key-value pairs as special. PHP is alone in that AFAIK, it's not a HTTP mechanism, it's just the parsing of input PHP does, as is replacing .'s in the name of values with _. Curl is a 3rd party package which lives seperately from PHP, and as such does not understand multidimensional arrays.
Try passing the query string:
$uploadData = 'uploads[]=#' . $file . '&iagree=on&uploads[]=#' . $file2;
See if that works for you.
EDIT
Reading through the manual, the string needs to be urlencoded, try this:
$uploadData = urlencode('uploads[]=#' . $file . '&iagree=on&uploads[]=#' . $file2);
Received this information from the curl_setopt() man page:
Note:
Passing an array to CURLOPT_POSTFIELDS will encode the data as multipart/form-data, while passing a URL-encoded string will encode the data as application/x-www-form-urlencoded.
I may have used the urlencode improperly, try this:
$uploadData = 'uploads[]=' . urlencode('#' . $file) . '&iagree=' . urlencode('on') . '&uploads[]=' . urlencode('#' . $file2);
UPDATE
Ok this is my last shot at it. Reading through some user comments at the curl page I found something about serializing the sub-array. So:
$uploadData = array('iagree' => 'on', 'uploads' => serialize(array('#' . $file)));
Hopefully that is the key. If that does not work...well it may not be possible to do.
Give that a shot and see if it works. (Sorry for the trial and error, I do not have a method to test it!)

Categories