Non Blocking Curl requests in PHP

Non Blocking Curl requests in PHP - php

So I have this function for making non-blocking curl requests. It works fine on what I've tested so far (small amounts of requests). But I need this to scale up to thousands of requests (maybe max 10,000). My issue is that I don't want to run into issues with too many parallel requests running at once.
What would you suggest to rate-limit the requests? Usleep? Requests in batches? The function is below:
function poly_curl($requests){
$queue = curl_multi_init();
$curl_array = array();
$count = 0;
foreach($requests as $request)
{
$curl_array[$count] = curl_init($request);
curl_setopt($curl_array[$count], CURLOPT_RETURNTRANSFER, true);
curl_multi_add_handle($queue, $curl_array[$count]);
$count++;
}
$running = NULL;
do {
curl_multi_exec($queue,$running);
} while($running > 0);
$res = array();
$count = 0;
foreach($requests as $request)
{
$res[$count] = curl_multi_getcontent($curl_array[$count]);
$count++;
}
$count = 0;
foreach($requests as $request){
curl_multi_remove_handle($queue, $curl_array[$count]);
$count++;
}
curl_multi_close($queue);
return $res;
}

I think curl_multi_exec is bad for this purpose, because even if you use batches in groups of 100, 99 request could be finished and still will have to wait for the last request completion.
But you need 100 parallel requests and when one finishes, another is immediately started. So you cannot use curl_multi_exec at all.
I would use normal producer-consumer algorithm with multiple (constant number) consumers with every consumer processing only one url. For example php-resque and COUNT=100 php resque.php

You may want to implement something that is called Exponential Backoff (wikipedia).
Basically, it is an algorithm that allows you dynamically scale your processes depending on some feedback.
You define a rate in your application, and on the first time-out, error, or anything you decide, you decrease this rate until the request finishes.
You may implement it easily using the HTTP response code for example.

Last time i was doing something like this it was including downloading and "parsing" files. Was able to proceed only 4 subpages at a time limited by very weak hardware processor (2 cores with HT). What time i ended up with two queuest: 1 for waiting, 2 for in-process. Every time a task gone from 2nd queue, new was taken from 1st one.
May saund complicated, but ended in two loops inside eachother and simple count()'s
Btw, considering so hight rate i would think of using Node.js - for simplicity - or anything more nonblocking and more suitable for deamons than PHP.. As long as threads are PHP weakpoint, it just does not suit there.
PS: nice & useful bit of code, thanks.

We used to face the same problem with C++ connection pooling code. The approach is those days involved some serious analysis.
But, the essence was that, we created a pool and requests would get processed depending on number of available requests. What we also did was assign a maximum number of connection pools.[This was determined by testing].
What you really need is a method to determine how many requests are being processed and put a limit to it. In your case that is $count
Just compare $count to a maximum value[say, $max] to it and stop there. define the value depending on the system the program runs. $max could be hardcoded or dynamic.

Related

Cannot get ReactPHP promises to execute asynchronously

I have a PHP script that processes data downloaded from multiple REST APIs into a standardized format and builds an array or table of this data. The script currently executes everything synchronously and therefore takes too long.
I have been trying to learn how to execute the function that fetches and processes the data, simultaneously or asynchronously so that the total time is the time of the slowest call. From my research it appears that ReactPHP or Amp are the correct tools.
However, I have been unsuccessful in creating test code that actually executes correctly. A simple example is attached, with mysquare() representing my more complex function. Due to a lack of examples on the net of exactly what I'm trying to achieve I have been forced to use a brute force method with 3 examples listed in my code.
Q1: Am I using the right tool for the job?
Q2: Can you fix my example code to execute asynchronously?
NB: I am a real beginner, so the simplest possible code example with a minimum of high level programming lingo would be appreciated.
<?php
require_once("../vendor/autoload.php");
for ($i = 0; $i <= 4; $i++) {
// Experiment 1
$deferred[$i] = new React\Promise\Deferred(function () use ($i) {
echo $x."\n";
usleep(rand(0, 3000000)); // Simulates long network call
return array($x=> $x * $x);
});
// Experiment 2
$promise[$i]=$deferred[$i]->promise(function () use ($i) {
echo $x."\n";
usleep(rand(0, 3000000)); // Simulates long network call
return array($x=> $x * $x);
});
// Experiment 3
$functioncall[$i] = function () use ($i) {
echo $x."\n";
usleep(rand(0, 3000000)); // Simulates long network call
return array($x=> $x * $x);
};
}
$promises = React\Promise\all($deferred); // Doesn't work
$promises = React\Promise\all($promise); // Doesn't work
$promises = React\Promise\all($functioncall); // Doesn't work
// print_r($promises); // Doesn't return array of results but a complex object
// This is what I would like to execute simulatenously with a variety of inputs
function mysquare($x)
{
echo $x."\n";
usleep(rand(0, 3000000)); // Simulates long network call
return array($x=> $x * $x);
}

Asynchronous doesn't mean multiple threads execute in parallel. 2 functions can only really run at the 'same time', if they (for example) do IO such as a HTTP request.
usleep() blocks, so you gain nothing. Both ReactPHP and Amp will have some kind of 'sleep' function themselves that's built right into the event loop.
For the same reason you will not be able to just use curl, because it will also block out of the box. You need to use the HTTP libraries that React and Amp provide and/recommend.
Since your end-goal is just doing HTTP requests, you could also not use any of these frameworks and just use the curl_multi functions. They're a bit hard to use though.

I'm answering my own question in an attempt to help other users, however this solution was developed alone without the help of an experienced programmer and so I do not know if it is ultimately the best way to do this.
TL;DR
I switched from ReactPHP because I didn't understand it to using amphp/parallel-functions which offers a simplified end user interface... sample code using this interface attached.
<?php
require_once("../vendor/autoload.php");
use function Amp\ParallelFunctions\parallelMap;
use function Amp\Promise\wait;
$start = \microtime(true);
$mysquare = function ($x) {
sleep($x); // Simulates long network call
//echo $x."\n";
return $x * $x;
};
print_r(wait(parallelMap([5,4,3,2,1,6,7,8,9,10], $mysquare)));
print 'Took ' . (\microtime(true) - $start) . ' milliseconds.' . \PHP_EOL;
The example code executes in 10.2 seconds which is slightly longer than the longest running instance of $mysquare().
In my actual use case I was able to fetch data via HTTP from 90 separate sources in around 5 seconds.
Notes:
The amphp/parallel-functions library appears to be using threads under the hood. From my preliminary experience this appears to require a lot more memory than just a single threaded PHP script, but I haven't yet ascertained the full impact. This was highlighted when I was passing a large array to $mysquare via the "use ($myarray)" expression and array was 65Mb. This brought the code to a standstill and it increased execution time exponentially so much so that it took orders of magnitude longer than synchronous execution. Also the memory usage peaked at over 5G! at one point leading me to believe that amphp was duplicating $myarray for each instance. Reworking my code to avoid the "use ($myarray)" expression fixed that problem.

Asynchronous PHP execution

I tried to run a massive update of field values through an API and I ran into maximum execution time for my PHP script.
I divided my job into smaller tasks to run them asynchronously as smaller jobs...
Asynchronous PHP calls?
I found this post and It looks about right but the comments are a little off-putting... Will using curl to run external script files prevent the caller file triggering maximum execution time or will the curl still wait for a response from the server and kill my page?
The question really is: How do you do asynchronous jobs in PHP? Something like Ajax.
EDIT::///
There is a project management tool which has lots of rows of data.
I am using this tools API to access the rows of data and display them on my page.
The user using my tool will select multiple rows of data with a checkbox, and type a new value into a box.
The user will then press an "update row values" button which runs an update script.
this update script divides the hundreds or thousands of items possibly selected into groups of 100.
At this point I was going to use some asynchronous method to contact the project management tool and update all 100 items.
Because when it is updating those items, it could take that server a long time to run its process, I need to make sure that my original page splitting those jobs is no longer waiting for a request from that operation so that I can fire off more requests to update items. and allow my server page to say to my user "Okay, the update is currently happening, it may take a while and we'll send an email once its complete".
$step = 100;
$itemCount = GetItemCountByAppId( $appId );
$loopsRequired = $itemCount / $step;
$loopsRequired = ceil( $loopsRequired );
$process = array();
for( $a = 0; $a < $loopsRequired; $a++ )
{
$items = GetItemsByAppId( $appId, array(
"amount" => $step,
"offset" => ( $step * $a )
) );
foreach( $items[ "items" ] as $key => $item )
{
foreach( $fieldsGroup as $fieldId => $fieldValues )
{
$itemId = $item->__attributes[ "item_id" ];
/*array_push( $process, array(
"itemId" => $itemId,
"fieldId" => $fieldId,
) );*/
UpdateFieldValue( $itemId, $fieldId, $fieldValues );
// This Update function is actually calling the server and I assume it must be waiting for a response... thus my code times out after 30 secs of execution
}
}
//curl_post_async($url, $params);
}

If you are using PHP-CLI, try Threads, or fork() for non-thread-safe version.

Depending on how you implement it, asynchronous PHP might be used to decouple the web request from the processing and therefore isolate the web request from any timeout in the procesing (but you could do the same thing within a single thread). Will breaking the task into smaller concurrent parts make it run faster? Probably not - usually this will extend the length of time it takes for the job to complete - about the only time this is not the case is when you've got a very large processing capacity and can distribute the task effective (e.g. map-reduce). Are HTTP calls (curl) an efficient way to distribute work like this? No. There are other methods, including synchronous and asynchronous messaging, batch processing, process forking, threads....each with their own benefits and complications - and we don't know what the problem you are trying to solve is.
So even before we get to your specific questions, this does not look like a good strategy.
Will using curl to run external script files prevent the caller file triggering maximum execution time
It will be constrained by whatever timeouts are configured on the target server - if that's the same server as the invoking script, then it will be the same timeouts.
will the curl still wait for a response from the server and kill my page?
I don't know what you're asking here - it rather implies that there are functional dependenciesyou've not told us about.
It sounds like you've picked a solution and are now trying to make it fit your problem.

Is there any way of detecting how many requests finished within curl_multi_exec loop?

Using curl_multi_*,I want to execute a piece of code every x requests, is there any way of doing so?
The only way I came up with is to check the $still_running variable from curl_multi_exec, unfortunately, it doesn't work(it's inconsistent, sometimes it jumps from 7 to 1 without going through 6, 5, etc..)
Here's the code I came up with(it doesn't always work, as I said $still_running is inconsistent):
$still_running = null;
$callbackExecuted = 1;//Counts how many times callback function was executed.
//execute the handles
do
{
//Execute callback every 5 requests
if ($numberOfRequests - $still_running === 5 * $callbackExecuted)
{
callback();
$callbackExecuted++;
}
curl_multi_exec($mh, $still_running);
} while ($still_running > 0);

First off, curl_multi_exec() drives N transfers in parallel and for every invoke it drives them a little bit further. The $still_running counter is then how many of the transfers that are still in progress when curl_multi_exec() returns. And it will potentially require hundreds (or more) of invokes to finish a transfer - or more transfers.
If you want to act on when a transfer is completed, then you can see how $still_running is deducted as they are completed, or you can use curl_multi_info_read() to really see when each transfer is done.
Finally: your code example needs attention!! Due to the lack of a call to curl_multi_select(), this program will busy-loop like crazy and spend 100% CPU time until all transfers are done. That's not a very nice thing.

Downloading pages in parallel using PHP

I have to scrap a web site where i need to fetch multiple URLs and then process them one by one. The current process somewhat goes like this.
I fetch a base URL and get all secondary URLs from this page, then for each secondary url I fetch that URL, process found page, download some photos (which takes quite a long time) and store this data to database, then fetch next URL and repeat the process.
In this process, I think I am wasting some time in fetching secondary URL at the start of each iteration. So I am trying to fetch next URLs in parallel while processing first iteration.
The solution in my mind is, from main process call a PHP script, say downloader, which will download all the URL (with curl_multi or wget) and store them in some database.
My questions are
How to call such downloder asynchronously, I don't want my main script to wait till downloder completes.
Any location to store downloaded data, such as shared memory. Of course, other than database.
There any chances that data gets corrupt while storing and retrieving, how to avoid this?
Also, please guide me know if anyone have a better plan.

When I hear someone uses curl_multi_exec it usually turns out they just load it with, say, 100 urls, then wait when all complete, and then process them all, and then start over with the next 100 urls... Blame me, I was doing so too, but then I found out that it is possible to remove/add handles to curl_multi while something is still in progress, And it really saves a lot of time, especially if you reuse already open connections. I wrote a small library to handle queue of requests with callbacks; I'm not posting full version here of course ("small" is still quite a bit of code), but here's a simplified version of the main thing to give you the general idea:
public function launch() {
$channels = $freeChannels = array_fill(0, $this->maxConnections, NULL);
$activeJobs = array();
$running = 0;
do {
// pick jobs for free channels:
while ( !(empty($freeChannels) || empty($this->jobQueue)) ) {
// take free channel, (re)init curl handle and let
// queued object set options
$chId = key($freeChannels);
if (empty($channels[$chId])) {
$channels[$chId] = curl_init();
}
$job = array_pop($this->jobQueue);
$job->init($channels[$chId]);
curl_multi_add_handle($this->master, $channels[$chId]);
$activeJobs[$chId] = $job;
unset($freeChannels[$chId]);
}
$pending = count($activeJobs);
// launch them:
if ($pending > 0) {
while(($mrc = curl_multi_exec($this->master, $running)) == CURLM_CALL_MULTI_PERFORM);
// poke it while it wants
curl_multi_select($this->master);
// wait for some activity, don't eat CPU
while ($running < $pending && ($info = curl_multi_info_read($this->master))) {
// some connection(s) finished, locate that job and run response handler:
$pending--;
$chId = array_search($info['handle'], $channels);
$content = curl_multi_getcontent($channels[$chId]);
curl_multi_remove_handle($this->master, $channels[$chId]);
$freeChannels[$chId] = NULL;
// free up this channel
if ( !array_key_exists($chId, $activeJobs) ) {
// impossible, but...
continue;
}
$activeJobs[$chId]->onComplete($content);
unset($activeJobs[$chId]);
}
}
} while ( ($running > 0 && $mrc == CURLM_OK) || !empty($this->jobQueue) );
}
In my version $jobs are actually of separate class, not instances of controllers or models. They just handle setting cURL options, parsing response and call a given callback onComplete.
With this structure new requests will start as soon as something out of the pool finishes.
Of course it doesn't really save you if not just retrieving takes time but processing as well... And it isn't a true parallel handling. But I still hope it helps. :)
P.S. did a trick for me. :) Once 8-hour job now completes in 3-4 mintues using a pool of 50 connections. Can't describe that feeling. :) I didn't really expect it to work as planned, because with PHP it rarely works exactly as supposed... That was like "ok, hope it finishes in at least an hour... Wha... Wait... Already?! 8-O"

You can use curl_multi: http://www.somacon.com/p537.php
You may also want to consider doing this client side and using Javascript.
Another solution is to write a hunter/gatherer that you submit an array of URLs to, then it does the parallel work and returns a JSON array after it's completed.
Put another way: if you had 100 URLs you could POST that array (probably as JSON as well) to mysite.tld/huntergatherer - it does whatever it wants in whatever language you want and just returns JSON.

Aside from the curl multi solution, another one is just having a batch of gearman workers. If you go this route, I've found supervisord a nice way to start a load of deamon workers.

Things you should look at in addition to CURL multi:
Non-blocking streams (example: PHP-MIO)
ZeroMQ for spawning off many workers that do requests asynchronously
While node.js, ruby EventMachine or similar tools are quite great for doing this stuff, the things I mentioned make it fairly easy in PHP too.

Try execute from PHP, python-pycurl scripts. Easier, faster than PHP curl.

What's the fastest way to scrape a lot of pages in php?

I have a data aggregator that relies on scraping several sites, and indexing their information in a way that is searchable to the user.
I need to be able to scrape a vast number of pages, daily, and I have ran into problems using simple curl requests, that are fairly slow when executed in rapid sequence for a long time (the scraper runs 24/7 basically).
Running a multi curl request in a simple while loop is fairly slow. I speeded it up by doing individual curl requests in a background process, which works faster, but sooner or later the slower requests start piling up, which ends up crashing the server.
Are there more efficient ways of scraping data? perhaps command line curl?

With a large number of pages, you'll need some sort of multithreaded approach, because you will be spending most of your time waiting on network I/O.
Last time I played with PHP threads weren't all that great of an option, but perhaps that's changed. If you need to stick with PHP, that means you'll be forced to go a multi-process approach: split up your workload into N work units, and run N instances of your script that each receives 1 work unit.
Languages that provide robust and good thread implementations are another option. I've had good experiences with threads in ruby and C, and it seems like Java threads are also very mature and reliable.
Who knows - maybe PHP threads have improved since the last time I played with them (~4 years ago) and are worth a look.

In my experience running a curl_multi request with a fixed number of threads is the fastest way, could you share the code you're using so we can suggest some improvements? This answer has a fairly decent implementation of curl_multi with a threaded approach, here is the reproduced code:
// -- create all the individual cURL handles and set their options
$curl_handles = array();
foreach ($urls as $url) {
$curl_handles[$url] = curl_init();
curl_setopt($curl_handles[$url], CURLOPT_URL, $url);
// set other curl options here
}
// -- start going through the cURL handles and running them
$curl_multi_handle = curl_multi_init();
$i = 0; // count where we are in the list so we can break up the runs into smaller blocks
$block = array(); // to accumulate the curl_handles for each group we'll run simultaneously
foreach ($curl_handles as $a_curl_handle) {
$i++; // increment the position-counter
// add the handle to the curl_multi_handle and to our tracking "block"
curl_multi_add_handle($curl_multi_handle, $a_curl_handle);
$block[] = $a_curl_handle;
// -- check to see if we've got a "full block" to run or if we're at the end of out list of handles
if (($i % BLOCK_SIZE == 0) or ($i == count($curl_handles))) {
// -- run the block
$running = NULL;
do {
// track the previous loop's number of handles still running so we can tell if it changes
$running_before = $running;
// run the block or check on the running block and get the number of sites still running in $running
curl_multi_exec($curl_multi_handle, $running);
// if the number of sites still running changed, print out a message with the number of sites that are still running.
if ($running != $running_before) {
echo("Waiting for $running sites to finish...\n");
}
} while ($running > 0);
// -- once the number still running is 0, curl_multi_ is done, so check the results
foreach ($block as $handle) {
// HTTP response code
$code = curl_getinfo($handle, CURLINFO_HTTP_CODE);
// cURL error number
$curl_errno = curl_errno($handle);
// cURL error message
$curl_error = curl_error($handle);
// output if there was an error
if ($curl_error) {
echo(" *** cURL error: ($curl_errno) $curl_error\n");
}
// remove the (used) handle from the curl_multi_handle
curl_multi_remove_handle($curl_multi_handle, $handle);
}
// reset the block to empty, since we've run its curl_handles
$block = array();
}
}
// close the curl_multi_handle once we're done
curl_multi_close($curl_multi_handle);
The trick is to not load too many URLs at once, if you do that the whole process will hang until the slower requests are complete. I suggest using a BLOCK_SIZE of 8 or greater if you have the bandwidth.

If you want to run single curl requests you can start background processes under linux in PHP like:
proc_close ( proc_open ("php -q yourscript.php parameter1 parameter2 & 2> /dev/null 1> /dev/null", array(), $dummy ));
You can use parameters to give your php script some information about what url's to use, like LIMIT in sql.
You can keep track of the running processes by saving their PIDs somewhere to keep a wanted number of processes running at the same time or kill processes that have not finished in time.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.