On S3 I've got files around 100M (2.5M each) in this hierarchy:
id_folder / date_folder / hour_file.raw
I'm tried 3 different ways to fetch them ASAP:
I start with laravel Storage facade (I'm using laravel)..
Storage::disk('s3')->get($filePath); -> this one is the slowest
then I google a little and i found this class:
Amazon S3 PHP Class
http://undesigned.org.za/2007/10/22/amazon-s3-php-class/
I tried also to use Amazon instructions about creating S3Client and use getObject function and it still slow...
So, i need to get a lot of files from s3 to ec2 - what is the fastest way to do it?
Thanx!
If I'm understanding everything you're saying, there's not going to be a way around downloading that many objects being slow. 100,000,000 * 2.5MB = 250TB. That's a lot of data. There are things you can do to make it more efficient though.
If you try to get many (i.e thousands) objects "at once" by synchronously downloading them using S3\Client::getObjects, it will take forever. You get a little faster by using S3\Client::getObjectsAsync which returns a Guzzle\Promise\Promise. This isn't really asynchronous. All requests to S3 do not execute concurrently. No matter what, calling getObjectsAsync will block the thread until the request completes. And simply iterating through a loop and calling Guzzle\Promise\Promise::wait will still take forever.
However, if you break up your requests and execute them in batches of promises simultaneously you can shave some significant time from your requests. Guzzle provides a few options to wait on an array of promises, but I prefer the Guzzle\Promise\unwrap function. It returns an array of the results of the array of promises given to it.
Below is a generator I've written that does just that:
public function getObjectsBatch($bucket, $keys, $chunkSize = 350)
{
foreach (array_chunk($keys, $chunkSize) as $chunk) {
$promises = [];
foreach ($chunk as $key) {
$promises[] = $this->getClient()->getObjectAsync([
'Bucket' => $bucket,
'Key' => $key
])->then($success = function (Result $res) use ($key) {
$res->offsetSet('Key', $key);
return $res;
}, $fail = function (S3Exception $res) {
return $res;
});
}
yield unwrap($promises);
}
}
I'm using this to download thousands of objects, and stream them to the user as they are downloaded.
The size of your batch is important. In the example, I'm executing 350 requests at a time. I've done a bit of testing and this seems to be the most efficient. In my tests, I downloaded 4500 objects from S3 using various batch sizes. I performed my test 10 times for each batch size. 350 seems to be most efficient.
But your specific use case--downloading 250TB of data at one time--will take a long time no matter what way you do it. And you'll quickly run out of memory if you don't save the files to disk, then you'll also have to worry about disk space. I'm not sure why you need to download that many files, but it doesn't seem like a good idea.
Related
I have a PHP script that processes data downloaded from multiple REST APIs into a standardized format and builds an array or table of this data. The script currently executes everything synchronously and therefore takes too long.
I have been trying to learn how to execute the function that fetches and processes the data, simultaneously or asynchronously so that the total time is the time of the slowest call. From my research it appears that ReactPHP or Amp are the correct tools.
However, I have been unsuccessful in creating test code that actually executes correctly. A simple example is attached, with mysquare() representing my more complex function. Due to a lack of examples on the net of exactly what I'm trying to achieve I have been forced to use a brute force method with 3 examples listed in my code.
Q1: Am I using the right tool for the job?
Q2: Can you fix my example code to execute asynchronously?
NB: I am a real beginner, so the simplest possible code example with a minimum of high level programming lingo would be appreciated.
<?php
require_once("../vendor/autoload.php");
for ($i = 0; $i <= 4; $i++) {
// Experiment 1
$deferred[$i] = new React\Promise\Deferred(function () use ($i) {
echo $x."\n";
usleep(rand(0, 3000000)); // Simulates long network call
return array($x=> $x * $x);
});
// Experiment 2
$promise[$i]=$deferred[$i]->promise(function () use ($i) {
echo $x."\n";
usleep(rand(0, 3000000)); // Simulates long network call
return array($x=> $x * $x);
});
// Experiment 3
$functioncall[$i] = function () use ($i) {
echo $x."\n";
usleep(rand(0, 3000000)); // Simulates long network call
return array($x=> $x * $x);
};
}
$promises = React\Promise\all($deferred); // Doesn't work
$promises = React\Promise\all($promise); // Doesn't work
$promises = React\Promise\all($functioncall); // Doesn't work
// print_r($promises); // Doesn't return array of results but a complex object
// This is what I would like to execute simulatenously with a variety of inputs
function mysquare($x)
{
echo $x."\n";
usleep(rand(0, 3000000)); // Simulates long network call
return array($x=> $x * $x);
}
Asynchronous doesn't mean multiple threads execute in parallel. 2 functions can only really run at the 'same time', if they (for example) do IO such as a HTTP request.
usleep() blocks, so you gain nothing. Both ReactPHP and Amp will have some kind of 'sleep' function themselves that's built right into the event loop.
For the same reason you will not be able to just use curl, because it will also block out of the box. You need to use the HTTP libraries that React and Amp provide and/recommend.
Since your end-goal is just doing HTTP requests, you could also not use any of these frameworks and just use the curl_multi functions. They're a bit hard to use though.
I'm answering my own question in an attempt to help other users, however this solution was developed alone without the help of an experienced programmer and so I do not know if it is ultimately the best way to do this.
TL;DR
I switched from ReactPHP because I didn't understand it to using amphp/parallel-functions which offers a simplified end user interface... sample code using this interface attached.
<?php
require_once("../vendor/autoload.php");
use function Amp\ParallelFunctions\parallelMap;
use function Amp\Promise\wait;
$start = \microtime(true);
$mysquare = function ($x) {
sleep($x); // Simulates long network call
//echo $x."\n";
return $x * $x;
};
print_r(wait(parallelMap([5,4,3,2,1,6,7,8,9,10], $mysquare)));
print 'Took ' . (\microtime(true) - $start) . ' milliseconds.' . \PHP_EOL;
The example code executes in 10.2 seconds which is slightly longer than the longest running instance of $mysquare().
In my actual use case I was able to fetch data via HTTP from 90 separate sources in around 5 seconds.
Notes:
The amphp/parallel-functions library appears to be using threads under the hood. From my preliminary experience this appears to require a lot more memory than just a single threaded PHP script, but I haven't yet ascertained the full impact. This was highlighted when I was passing a large array to $mysquare via the "use ($myarray)" expression and array was 65Mb. This brought the code to a standstill and it increased execution time exponentially so much so that it took orders of magnitude longer than synchronous execution. Also the memory usage peaked at over 5G! at one point leading me to believe that amphp was duplicating $myarray for each instance. Reworking my code to avoid the "use ($myarray)" expression fixed that problem.
I have a daily cron job which will get a XML from web service. Sometimes it is large, contains more than 10K products information and the XML size will be 14M example.
What I need to do is parsing XML to object then processing them. The processing is quite complicated. Not like directly put them into the database, I need to do a lot operation on them, and finally put them into many database tables.
It is just in one PHP script. I don't have any experience on dealing with large data.
So the problem is it take a lot of memory. And very long time to do it. I turn my localhost PHP memory_limit to 4G and running 3.5hrs then got successful. But my production host is not allowed such amount memory.
I do a research but I am very confused which is a right way to dealing with this situation.
Here is a sample of my code:
function my_items_import($xml){
$results = new SimpleXMLElement($xml);
$results->registerXPathNamespace('i', 'http://schemas.microsoft.com/dynamics/2008/01/documents/Item');
//it will loop over 10K
foreach($results->xpath('//i:Item') as $data) {
$data->registerXPathNamespace('i', 'http://schemas.microsoft.com/dynamics/2008/01/documents/Item');
//my processing code here, it will call a other functions to do a lot things
processing($data);
}
unset($results);
}
As a start don't use SimpleXMLElement on the whole document. SimpleXMLElement loads everything in the memory and is not efficient for large data. Here is a snippet from a real code. You'll need to accommodate it to your case but hope you'll get the general idea.
$reader = new XMLReader();
$reader->xml($xml);
// Get cursor to first article
while($reader->read() && $reader->name !== 'article');
// Iterate articles
while($reader->name === 'article')
{
$doc = new DOMDocument('1.0', 'UTF-8');
$article = simplexml_import_dom($doc->importNode($reader->expand(), true));
processing($article);
$reader->next('article');
}
$reader->close();
$article is SimpleXMLElement which can be processed further.
This way you save a lot of memory by making only single article nodes go into memory.
Additionally if each processing() function take long time you can turn it into a background process which runs in separately from the main script and several processing() functions can be started in parallel.
Key hints:
dispose data during process.
Dispose data - mean over write it with blank data. BTW, unset is slower than overwrite with null
Use functions or static method, avoid as much oop instance as possible.
One extra question, how long it takes to loop your xml without do [lots things]:
function my_items_import($xml){
$results = new SimpleXMLElement($xml);
$results->registerXPathNamespace('i', 'http://schemas.microsoft.com/dynamics/2008/01/documents/Item');
//it will loop over 10K
foreach($results->xpath('//i:Item') as $data) {
$data->registerXPathNamespace('i', 'http://schemas.microsoft.com/dynamics/2008/01/documents/Item');
//my processing code here, it will call a other functions to do a lot things
//processing($data);
}
//unset($result);// no need
}
So I've been trying my hands on laravel's chunking in Eloquent but I've run into a problem. Consider the following code (a much more simplified version of my problem):
$data = DB::connection('mydb')->table('bigdata')
->chunk(200, function($data) {
echo memory_get_usage();
foreach($data as $d) {
Model::create(
array(
'foo' => $d->bar,
...
//etc
));
}
}
So when I run the following code my memory outputs look like this:
19039816
21490096
23898816
26267640
28670432
31038840
So without jumping into php.ini and changing the memory_limit value any clue why it isn't working? According to the documentation: "If you need to process a lot (thousands) of Eloquent records, using the chunk command will allow you to do without eating all of your RAM".
I tried unset($data) after the foreach function but it did not help. Any clue as to how I can make use of chunk or did I misinterpret what it does?
Chunking data doesn't reduce memory usage, you need to do it like pagination directly using the database.
Like first fetch starting 200 order by id or something, and after processing first 200, fire that query again with a where clause asking next 200 results.
You can use lazy collections to improve memory uses for a big collection of data. It uses PHP generators under the hood. Take a look at the cursor example here https://laravel.com/docs/5.4/eloquent#chunking-results
So I have this function for making non-blocking curl requests. It works fine on what I've tested so far (small amounts of requests). But I need this to scale up to thousands of requests (maybe max 10,000). My issue is that I don't want to run into issues with too many parallel requests running at once.
What would you suggest to rate-limit the requests? Usleep? Requests in batches? The function is below:
function poly_curl($requests){
$queue = curl_multi_init();
$curl_array = array();
$count = 0;
foreach($requests as $request)
{
$curl_array[$count] = curl_init($request);
curl_setopt($curl_array[$count], CURLOPT_RETURNTRANSFER, true);
curl_multi_add_handle($queue, $curl_array[$count]);
$count++;
}
$running = NULL;
do {
curl_multi_exec($queue,$running);
} while($running > 0);
$res = array();
$count = 0;
foreach($requests as $request)
{
$res[$count] = curl_multi_getcontent($curl_array[$count]);
$count++;
}
$count = 0;
foreach($requests as $request){
curl_multi_remove_handle($queue, $curl_array[$count]);
$count++;
}
curl_multi_close($queue);
return $res;
}
I think curl_multi_exec is bad for this purpose, because even if you use batches in groups of 100, 99 request could be finished and still will have to wait for the last request completion.
But you need 100 parallel requests and when one finishes, another is immediately started. So you cannot use curl_multi_exec at all.
I would use normal producer-consumer algorithm with multiple (constant number) consumers with every consumer processing only one url. For example php-resque and COUNT=100 php resque.php
You may want to implement something that is called Exponential Backoff (wikipedia).
Basically, it is an algorithm that allows you dynamically scale your processes depending on some feedback.
You define a rate in your application, and on the first time-out, error, or anything you decide, you decrease this rate until the request finishes.
You may implement it easily using the HTTP response code for example.
Last time i was doing something like this it was including downloading and "parsing" files. Was able to proceed only 4 subpages at a time limited by very weak hardware processor (2 cores with HT). What time i ended up with two queuest: 1 for waiting, 2 for in-process. Every time a task gone from 2nd queue, new was taken from 1st one.
May saund complicated, but ended in two loops inside eachother and simple count()'s
Btw, considering so hight rate i would think of using Node.js - for simplicity - or anything more nonblocking and more suitable for deamons than PHP.. As long as threads are PHP weakpoint, it just does not suit there.
PS: nice & useful bit of code, thanks.
We used to face the same problem with C++ connection pooling code. The approach is those days involved some serious analysis.
But, the essence was that, we created a pool and requests would get processed depending on number of available requests. What we also did was assign a maximum number of connection pools.[This was determined by testing].
What you really need is a method to determine how many requests are being processed and put a limit to it. In your case that is $count
Just compare $count to a maximum value[say, $max] to it and stop there. define the value depending on the system the program runs. $max could be hardcoded or dynamic.
I have to scrap a web site where i need to fetch multiple URLs and then process them one by one. The current process somewhat goes like this.
I fetch a base URL and get all secondary URLs from this page, then for each secondary url I fetch that URL, process found page, download some photos (which takes quite a long time) and store this data to database, then fetch next URL and repeat the process.
In this process, I think I am wasting some time in fetching secondary URL at the start of each iteration. So I am trying to fetch next URLs in parallel while processing first iteration.
The solution in my mind is, from main process call a PHP script, say downloader, which will download all the URL (with curl_multi or wget) and store them in some database.
My questions are
How to call such downloder asynchronously, I don't want my main script to wait till downloder completes.
Any location to store downloaded data, such as shared memory. Of course, other than database.
There any chances that data gets corrupt while storing and retrieving, how to avoid this?
Also, please guide me know if anyone have a better plan.
When I hear someone uses curl_multi_exec it usually turns out they just load it with, say, 100 urls, then wait when all complete, and then process them all, and then start over with the next 100 urls... Blame me, I was doing so too, but then I found out that it is possible to remove/add handles to curl_multi while something is still in progress, And it really saves a lot of time, especially if you reuse already open connections. I wrote a small library to handle queue of requests with callbacks; I'm not posting full version here of course ("small" is still quite a bit of code), but here's a simplified version of the main thing to give you the general idea:
public function launch() {
$channels = $freeChannels = array_fill(0, $this->maxConnections, NULL);
$activeJobs = array();
$running = 0;
do {
// pick jobs for free channels:
while ( !(empty($freeChannels) || empty($this->jobQueue)) ) {
// take free channel, (re)init curl handle and let
// queued object set options
$chId = key($freeChannels);
if (empty($channels[$chId])) {
$channels[$chId] = curl_init();
}
$job = array_pop($this->jobQueue);
$job->init($channels[$chId]);
curl_multi_add_handle($this->master, $channels[$chId]);
$activeJobs[$chId] = $job;
unset($freeChannels[$chId]);
}
$pending = count($activeJobs);
// launch them:
if ($pending > 0) {
while(($mrc = curl_multi_exec($this->master, $running)) == CURLM_CALL_MULTI_PERFORM);
// poke it while it wants
curl_multi_select($this->master);
// wait for some activity, don't eat CPU
while ($running < $pending && ($info = curl_multi_info_read($this->master))) {
// some connection(s) finished, locate that job and run response handler:
$pending--;
$chId = array_search($info['handle'], $channels);
$content = curl_multi_getcontent($channels[$chId]);
curl_multi_remove_handle($this->master, $channels[$chId]);
$freeChannels[$chId] = NULL;
// free up this channel
if ( !array_key_exists($chId, $activeJobs) ) {
// impossible, but...
continue;
}
$activeJobs[$chId]->onComplete($content);
unset($activeJobs[$chId]);
}
}
} while ( ($running > 0 && $mrc == CURLM_OK) || !empty($this->jobQueue) );
}
In my version $jobs are actually of separate class, not instances of controllers or models. They just handle setting cURL options, parsing response and call a given callback onComplete.
With this structure new requests will start as soon as something out of the pool finishes.
Of course it doesn't really save you if not just retrieving takes time but processing as well... And it isn't a true parallel handling. But I still hope it helps. :)
P.S. did a trick for me. :) Once 8-hour job now completes in 3-4 mintues using a pool of 50 connections. Can't describe that feeling. :) I didn't really expect it to work as planned, because with PHP it rarely works exactly as supposed... That was like "ok, hope it finishes in at least an hour... Wha... Wait... Already?! 8-O"
You can use curl_multi: http://www.somacon.com/p537.php
You may also want to consider doing this client side and using Javascript.
Another solution is to write a hunter/gatherer that you submit an array of URLs to, then it does the parallel work and returns a JSON array after it's completed.
Put another way: if you had 100 URLs you could POST that array (probably as JSON as well) to mysite.tld/huntergatherer - it does whatever it wants in whatever language you want and just returns JSON.
Aside from the curl multi solution, another one is just having a batch of gearman workers. If you go this route, I've found supervisord a nice way to start a load of deamon workers.
Things you should look at in addition to CURL multi:
Non-blocking streams (example: PHP-MIO)
ZeroMQ for spawning off many workers that do requests asynchronously
While node.js, ruby EventMachine or similar tools are quite great for doing this stuff, the things I mentioned make it fairly easy in PHP too.
Try execute from PHP, python-pycurl scripts. Easier, faster than PHP curl.