Download millions of images from external website

Download millions of images from external website - php

I am working on a real estate website and we're about to get an external feed of ~1M listings. Assuming each listing has ~10 photos associated with it, that's about ~10M photos, and we're required to download each of them to our server so as to not "hot link" to them.
I'm at a complete loss as to how to do this efficiently. I played with some numbers and I concluded, based on a 0.5 second per image download rate, this could take upwards of ~58 days to complete (download ~10M images from an external server). Which is obviously unacceptable.
Each photo seems to be roughly ~50KB, but that can vary with some being larger, much larger, and some being smaller.
I've been testing by simply using:
copy(http://www.external-site.com/image1.jpg, /path/to/folder/image1.jpg)
I've also tried cURL, wget, and others.
I know other sites do it, and at a much larger scale, but I haven't the slightest clue how they manage this sort of thing without it taking months at a time.
Sudo code based on the XML feed we're set to receive. We're parsing the XML using PHP:
<listing>
<listing_id>12345</listing_id>
<listing_photos>
<photo>http://example.com/photo1.jpg</photo>
<photo>http://example.com/photo2.jpg</photo>
<photo>http://example.com/photo3.jpg</photo>
<photo>http://example.com/photo4.jpg</photo>
<photo>http://example.com/photo5.jpg</photo>
<photo>http://example.com/photo6.jpg</photo>
<photo>http://example.com/photo7.jpg</photo>
<photo>http://example.com/photo8.jpg</photo>
<photo>http://example.com/photo9.jpg</photo>
<photo>http://example.com/photo10.jpg</photo>
</listing_photos>
</listing>
So my script will iterate through each photo for a specific listing and download the photo to our server, and also insert the photo name into our photo database (the insert part is already done without issue).
Any thoughts?

I am surprised the vendor is not allowing you to hot-link. The truth is you will not serve every image every month so why download every image? Allowing you to hot link is a better use of everyone's bandwidth.
I manage a catalog with millions of items where the data is local but the images are mostly hot linked. Sometimes we need to hide the source of the image or the vendor requires us to cache the image. To accomplish both goals we use a proxy. We wrote our own proxy but you might find something open source that would meet your needs.
The way the proxy works is that we encrypt and URL encode the encrypted URL string. So http://yourvendor.com/img1.jpg becomes xtX957z. In our markup the img src tag is something like http://ourproxy.com/getImage.ashx?image=xtX957z.
When our proxy receives an image request, it decrypts the image URL. The proxy first looks on disk for the image. We derive the image name from the URL, so it is looking for something like yourvendorcom.img1.jpg. If the proxy cannot find the image on disk, then it uses the decrypted URL to fetch the image from the vendor. It then writes the image to disk and serves it back to the client. This approach has the advantage of being on demand with no wasted bandwidth. I only get the images I need and I only get them once.

You can save all links into some database table (it will be yours "job queue"),
Then you can create a script which in the loop gets the job and do it (fetch image for a single link and mark job record as done)
The script you can execute multiple times f.e. using supervisord. So the job queue will be processed in parallel. If it's to slow you can just execute another worker script (if bandwidth does not slow you down)
If any script hangs for some reason you can easly run it again to get only images that havnt been yet downloaded. Btw supervisord can be configured to automaticaly restart each script if it fails.
Another advantage is that at any time you can check output of those scripts by supervisorctl. To check how many images are still waiting you can easy query the "job queue" table.

Before you do this
Like #BrokenBinar said in the comments. Take into account how many requests per second the host can provide. You don't want to flood them with requests without them knowing. Then use something like sleep to limit your requests per whatever number it is they can provide.
Curl Multi
Anyway, use Curl. Somewhat of a duplicate answer but copied anyway:
$nodes = array($url1, $url2, $url3);
$node_count = count($nodes);
$curl_arr = array();
$master = curl_multi_init();
for($i = 0; $i < $node_count; $i++)
{
$url =$nodes[$i];
$curl_arr[$i] = curl_init($url);
curl_setopt($curl_arr[$i], CURLOPT_RETURNTRANSFER, true);
curl_multi_add_handle($master, $curl_arr[$i]);
}
do {
curl_multi_exec($master,$running);
} while($running > 0);
for($i = 0; $i < $node_count; $i++)
{
$results[] = curl_multi_getcontent ( $curl_arr[$i] );
}
print_r($results);
From: PHP Parallel curl requests
Another solution:
Pthread
<?php
class WebRequest extends Stackable {
public $request_url;
public $response_body;
public function __construct($request_url) {
$this->request_url = $request_url;
}
public function run(){
$this->response_body = file_get_contents(
$this->request_url);
}
}
class WebWorker extends Worker {
public function run(){}
}
$list = array(
new WebRequest("http://google.com"),
new WebRequest("http://www.php.net")
);
$max = 8;
$threads = array();
$start = microtime(true);
/* start some workers */
while (#$thread++<$max) {
$threads[$thread] = new WebWorker();
$threads[$thread]->start();
}
/* stack the jobs onto workers */
foreach ($list as $job) {
$threads[array_rand($threads)]->stack(
$job);
}
/* wait for completion */
foreach ($threads as $thread) {
$thread->shutdown();
}
$time = microtime(true) - $start;
/* tell you all about it */
printf("Fetched %d responses in %.3f seconds\n", count($list), $time);
$length = 0;
foreach ($list as $listed) {
$length += strlen($listed["response_body"]);
}
printf("Total of %d bytes\n", $length);
?>
Source: PHP testing between pthreads and curl
You should really use the search feature, ya know :)

Related

PHP CodeIgniter: How to run processes in the background without getting my whole website stuck

I have a website on an Ubuntu LAMP Server - that has a form which gets variables and then they get submitted to a function that handles them. The function calls other functions in the controller that "explodes" the variables, order them in an array and run a "for" loop on each variable, gets new data from slow APIs, and inserts the new data to the relevant tables in the database.
Whenever I submit a form, the whole website gets stuck (only for my IP, on other desktops the website continue working regularly), and I get redirected until I get to the requested "redirect("new/url);".
I have been researching this issue for a while and found this post as an example:
Continue PHP execution after sending HTTP response
After studding how this works in the server side, which is explained really good in this video: https://www.youtube.com/watch?v=xVSPv-9x3gk
I wanted to start learning how to write it's syntax and found out that this only work on CLI and not from APACHE, but I wasn't sure.
I opened this post a few days ago: PHP+fork(): How to run a fork in a PHP code
and after getting everything working from the server side, installing fork and figuring out the differences of the php.ini files in a server (I edited the apache2 php.ini, don't get mistaked), I stopped getting the errors I used to get for the "fork", but the processes don't run in the background, and I didn't get redirected.
This is the controller after adding fork:
<?php
// Registers a new keyword for prod to the DB.
public function add_keyword() {
$keyword_p = $this->input->post('key_word');
$prod = $this->input->post('prod_name');
$prod = $this->kas_model->search_prod_name($prod);
$prod = $prod[0]->prod_id;
$country = $this->input->post('key_country');
$keyword = explode(", ", $keyword_p);
var_dump($keyword);
$keyword_count = count($keyword);
echo "the keyword count: $keyword_count";
for ($i=0; $i < $keyword_count ; $i++) {
// create your next fork
$pid = pcntl_fork();
if(!$pid){
//*** get new vars from $keyword_count
//*** run API functions to get new data_arrays
//*** inserts new data for each $keyword_count to the DB
print "In child $i\n";
exit($i);
// end child
}
}
// we are the parent (main), check child's (optional)
while(pcntl_waitpid(0, $status) != -1){
$status = pcntl_wexitstatus($status);
echo "Child $status completed\n";
}
// your other main code: Redirect to main page.
redirect('banana/kas');
}
?>
And this is the controller without the fork:
// Registers a new keyword for prod to the DB.
public function add_keyword() {
$keyword_p = $this->input->post('key_word');
$prod = $this->input->post('prod_name');
$prod = $this->kas_model->search_prod_name($prod);
$prod = $prod[0]->prod_id;
$country = $this->input->post('key_country');
$keyword = explode(", ", $keyword_p);
var_dump($keyword);
$keyword_count = count($keyword);
echo "the keyword count: $keyword_count";
// problematic part that needs forking
for ($i=0; $i < $keyword_count ; $i++) {
// get new vars from $keyword_count
// run API functions to get new data_arrays
// inserts new data for each $keyword_count to the DB
}
// Redirect to main page.
redirect('banana/kas');
}
The for ($i=0; $i < $keyword_count ; $i++) { is the part that I want to get running in the background because it's taking too much time.
So now:
How can I get this working the way I explained? Because from what I see, fork isn't what I'm looking for, or I might be doing this wrong.
I will be happy to learn new techniques, so I will be happy to get suggestions about how I can do this in different ways. I am a self learner, and I found out the great advantages of Node.js for exmaple, which could have worked perfectly in this case if I would have learnt it. I will consider to learn working with Node.js in the future. sending server requests and getting back responses is awesome ;).
***** If there is a need to add more information about something, please tell me in comments and I will add more information to my post if you think it's relevant and I missed it.

What you're really after is a queue or a job system. There's one script running all the time, waiting for something to do. Once your original PHP script runs, it just adds a job to the list, and it can continue it's process as normal.
There's a few implementations of this - take a look at something like https://laravel.com/docs/5.1/queues

Too slow Http Client in Zend Framework 1.12

I want to send ~50 requests to different pages on the same domain and then, I'm using DOM object to gain urls to articles.
The problem is that this number of requests takes over 30 sec.
for ($i = 1; $i < 51; $i++)
{
$url = 'http://example.com/page/'.$i.'/';
$client = new Zend_Http_Client($url);
$response = $client->request();
$dom = new Zend_Dom_Query($response); // without this two lines, execution is also too long
$results = $dom->query('li'); //
}
Is there any way to speed this up?

It's a generel problem by design - not the code itself. If you're doing a for-loop over 50 items each opening an request to an remote uri, things get pretty slow since every requests waits until responde from the remote uri. e.g.: a request takes ~0,6 sec to been completed, multiple this by 50 and you get an exection time of 30 seconds!
Other problem is that most webserver limits its (open) connections per client to an specific amount. So even if you're able to do 50 requests simultaneously (which you're currently not), things won't speed up measurably.
In my option there is only one solution (without any deep going changes):
Change the amout of requests per exection. Make chunks from e.g. only 5 - 10 per (script)-call and trigger them by an external call (e.g. run them by cron).
Todo:
Build a wrapper function which is able to save the state of its current run ("i did request 1 - 10 at my last run, so now I have to call 11 - 20) into a file or database and trigger this function by an cron.
Example Code (untested) for better declaration;
[...]
private static $_chunks = 10; //amout of calls per run
public function cronAction() {
$lastrun = //here get last run parameter saved from local file or database
$this->crawl($lastrun);
}
private function crawl($lastrun) {
$limit = $this->_chunks + $lastrun;
for ($i = $lastrun; $i < limit; $i++)
{
[...] //do stuff here
}
//here set $lastrun parameter to new value inside local file / database
}
[...]

I can't think of a way to speed it up but you can increase the timeout limit in PHP if that is your concern:
for($i=1; $i<51; $i++) {
set_time_limit(30); //This restarts the timer to 30 seconds starting now
//Do long things here
}

is there any better way to use Youtube PHP API

I am using youtube php Zend API Library.
In this API first I send request to get the temporary/confirmation code.
Then an request to get the access token.
After this I want to fetch the user information then another request makes to
https://gdata.youtube.com/feeds/api/users/default
for current user It returns the url with userId
Then finally I get the user Information from that url which is in xml form.
I am fed up by these so many requests it takes much time as well.
Is there another way to get these thing by reducing the number of curl/ajax requests.

You can use curl_multi_* to do requests for different users in parallel. It won't speed up the process for every single user, but since you can do 10-30 or more requests in parallel, it will speed up the whole deal.
The only complication is that you will need separate cookie file for every request. Here's sample code to get your started:
$chs = array();
$cmh = curl_multi_init();
for ($t = 0; $t < $tc; $t++)
{
$chs[$t] = curl_init();
// set $chs[$t] options
curl_multi_add_handle($cmh, $chs[$t]);
}
$running=null;
do {
curl_multi_exec($cmh, $running);
} while ($running > 0);
for ($t = 0; $t < $tc; $t++)
{
$contents[$t] = curl_multi_getcontent($chs[$t]);
// work with $contencts[$t]
curl_multi_remove_handle($cmh, $chs[$t]);
curl_close($chs[$t]);
}
curl_multi_close($cmh);

Ajax call to php for csv file manipulation hangs

Okay so I have a button. When pressed it does this:
Javascript
$("#csv_dedupe").live("click", function(e) {
file_name = 'C:\\server\\xampp\\htdocs\\Gene\\IMEXporter\\include\\files\\' + $("#IMEXp_import_var-uploadFile-file").val();
$.post($_CFG_PROCESSORFILE, {"task": "csv_dupe", "file_name": file_name}, function(data) {
alert(data);
}, "json")
});
This ajax call gets sent out to this:
PHP
class ColumnCompare {
function __construct($column) {
$this->column = $column;
}
function compare($a, $b) {
if ($a[$this->column] == $b[$this->column]) {
return 0;
}
return ($a[$this->column] < $b[$this->column]) ? -1 : 1;
}
}
if ($task == "csv_dupe") {
$file_name = $_REQUEST["file_name"];
// Hard-coded input
$array_var = array();
$sort_by_col = 9999;
//Open csv file and dump contents
if(($handler = fopen($file_name, "r")) !== FALSE) {
while(($csv_handler = fgetcsv($handler, 0, ",")) !== FALSE) {
$array_var[] = $csv_handler;
}
}
fclose($handler);
//copy original csv data array to be compared later
$array_var2 = $array_var;
//Find email column
$new = array();
$new = $array_var[0];
$findme = 'email';
$counter = 0;
foreach($new as $key) {
$pos = strpos($key, $findme);
if($pos === false) {
$counter++;
}
else {
$sort_by_col = $counter;
}
}
if($sort_by_col === 999) {
echo 'COULD NOT FIND EMAIL COLUMN';
return;
}
//Temporarily remove headers from array
$headers = array_shift($array_var);
// Create object for sorting by a particular column
$obj = new ColumnCompare($sort_by_col);
usort($array_var, array($obj, 'compare'));
// Remove Duplicates from a coulmn
array_unshift($array_var, $headers);
$newArr = array();
foreach ($array_var as $val) {
$newArr[$val[$sort_by_col]] = $val;
}
$array_var = array_values($newArr);
//Write CSV to standard output
$sout = fopen($file_name, 'w');
foreach ($array_var as $fields) {
fputcsv($sout, $fields);
}
fclose($sout);
//How many dupes were there?
$number = count($array_var2) - count($array_var);
echo json_encode($number);
}
This php gets all the data from a csv file. Columns and rows and using the fgetcsv function it assigns all the data to an array. Now I have code in there that also dedupes (finds and removes a copy of a duplicate) the csv files by a single column. Keeping intact the row and column structure of the entire array.
The only problem is, even though it works with small files that have 10 or so rows that i tested, it does not work for files with 25,000.
Now before you say it, I have went into my php.ini file and changed the max_input, filesize, max time running etc etc to astronomical values to insure php can accept file sizes of upwards to 999999999999999MB and time to run its script of a few hundred years.
I used a file with 25,000 records and execute the script. Its been two hours and fiddler still shows that a http request has not yet been sent back. Can someone please give me some ways that I can optimize my server and my code?
I was able to use that code from a user who helped my in another question I posted on how to even do this initially. My concern now is even though I tested it to work, I want to know how to make it work in less than a minute. Excel can dedupe a column of a million records in a few seconds why cant php do this?

Sophie, I assume that you are not experienced at writing this type of application because IMO this isn't the way to approach this. So I'll pitch this accordingly.
When you have a performance problem like this, you really need to binary chop the problem to understand what is going on. So step 1 is to decouple the PHP timing problem from AJAX and get a simple understanding of why your approach is so unresponsive. Do this using a locally installed PHP-cgi or even use your web install and issue a header('Context-Type: text/plain' ) and dump out microtiming of each step. How long does the CSV read take, ditto the sort, then nodup, then the write? Do this for a range of CSV file sizes going up by 10x in rowcount each time.
Also do a memory_get_usage() at each step to see how you are chomping up memory. Because your approach is a real hog and you are probably erroring out by hitting the configured memory limits -- a phpinfo() will tell you these.
The read, nodup and write are all o(N), but the sort is o(NlogN) at best and o(N2) at worst. Your sort is also calling a PHP method per comparison so will be slow.
What I don't understand is why you are even doing the sort, since your nodup algo does not make use of the fact that the rows are sorted.
(BTW, the sort will also sort the header row inline, so you need to unshift it before you do the sort if you still want to do it.)
There are other issue that you need to think about such as
Using a raw parameter as a filename makes you vulnerable to attack. Better to fix the patch relative to, say DOCROOT/Gene/IMEXporter/include and enforce some grammar on the file names.
You need to think about atomicity of reading and rewriting large files as a response to a web request -- what happen if two clients make the request at the same time.
Lastly you compare this to Excel, well load and saving Excel files can take time, and Excel doesn't have to scale to respond to 10s or 100s or users at the same time. In a transactional system you typically use a D/B backend for this sort of thing, and if you are using a web interface to compute heavy tasks, you need to accept the Apache (or equiv server) hard memory and timing constraints and chop your algos and approach accordingly.

Multiple RSS feeds with PHP (performance)

In my recently project i work with multiple rss feeds. I want to list only the latest post from all of them, and sort them by timestamps.
My issue is that i have about 20 different feeds and the page take 6 seconds to load (only testing with 10 feeds).
What can i do to make it perfrom better?
I use simplexml:
simplexml_load_file($url);
Which i append to an array:
function appendToArray($key, $value){
$this->array[$key] = $value;
}
Just before showing it i make krsort:
krsort($this->array);
Should i cache it somehow?

You could cache them, but you would still have the problem of the page taking ages to load if caches have expired.
You could have a PHP script which runs in the background (e.g. via a cron job) and periodically downloads the feeds you are subscribed to into a database, then you can do much faster fetching/filtering of the data when you want to display it.

Have you done any debugging? Logging microtime at various points in your code.
You'll find that it's the loading of the RSS feed, rather than parsing it, that takes the time but you might find that this is due to the time each RSS feed takes to generate.
Save those ten feeds as static xml files, point your script at them and see how fast it takes to load.

You can load the RSS feeds in parallel with curl_multi. That could speed up your script, especially if you're using blocking calls at the moment.
A small example (from http://www.rustyrazorblade.com/2008/02/curl_multi_exec/) :
$nodes = array('http://www.google.com', 'http://www.microsoft.com', 'http://www.rustyrazorblade.com');
$node_count = count($nodes);
$curl_arr = array();
$master = curl_multi_init();
for($i = 0; $i < $node_count; $i++)
{
$url =$nodes[$i];
$curl_arr[$i] = curl_init($url);
curl_setopt($curl_arr[$i], CURLOPT_RETURNTRANSFER, true);
curl_multi_add_handle($master, $curl_arr[$i]);
}
do {
curl_multi_exec($master,$running);
} while($running > 0);
echo "results: ";
for($i = 0; $i < $node_count; $i++)
{
$results = curl_multi_getcontent ( $curl_arr[$i] );
echo( $i . "\n" . $results . "\n");
}
echo 'done';
More info can be found at Asynchronous/parallel HTTP requests using PHP multi_curl and How to use curl_multi() without blocking (amongst others).
BTW To process the feeds after they are loaded using curl_multi you will have to use simplexml_load_string instead of simplexml_load_file of course.

yes of course caching is the only sensible solution.
better to set up a cron job to retrieve these feeds and store the data locally.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.