How do we make multiple file_get_contents in PHP run faster? - php

We're building an API that performs repetitive file_get_contents.
I have an array of userids and the number of file_get_contents() will be repeated by the number of contents in the array. We will do thousands of requests.
function request($userid) {
global $access_token;
$url = 'https://<api-domain>'.$userid.'?access_token='.$access_token;
$response = file_get_contents($url);
return json_decode($response);
}
function loopUserIds($arrayUserIds) {
global $countFollowers;
$arrayAllUserIds = array();
foreach ($arrayUserIds as $userid) {
$followers = request($userid);
...
}
...
}
My concern is it takes time to get everything. Since the function will also be called in a loop. Please advise how we can make this (many file_get_contents() requests) run faster?

As #HankyPanky mentioned, you can use curl_multi_exec() to do many concurrent requests at the same time.
Something like this should help:
function fetchAndProcessUrls(array $urls, callable $f) {
$multi = curl_multi_init();
$reqs = [];
foreach ($urls as $url) {
$req = curl_init();
curl_setopt($req, CURLOPT_URL, $url);
curl_setopt($req, CURLOPT_HEADER, 0);
curl_multi_add_handle($multi, $req);
$reqs[] = $req;
}
// While we're still active, execute curl
$active = null;
// Execute the handles
do {
$mrc = curl_multi_exec($multi, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
while ($active && $mrc == CURLM_OK) {
if (curl_multi_select($multi) != -1) {
do {
$mrc = curl_multi_exec($multi, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
}
}
// Close the handles
foreach ($reqs as $req) {
$f(curl_multi_getcontent($req));
curl_multi_remove_handle($multi, $req);
}
curl_multi_close($multi);
}
You can use it like so:
$urlArray = [ 'http://www.example.com/' , 'http://www.example.com/', ... ];
fetchAndProcessUrls($urlArray, function($requestData) {
/* do stuff here */
// e.g.
$jsonData = json_decode($requestData, 1); //
});

When curl_multi_exec isn't available, you can gain performance when you reuse $ch instead of creating a new for every file, which will reuse the connection when downloading from the same host.

Related

Add Delay/Sleep in between url requests in multi cutl php

I am fetching json data from api which allows me to hit 1 request per minute. Currently using multi curl but don't have idea how to execute curl after every 60sec so that i can get the 10 json file from 10 different url/api request in 10 minutes. Pls check the code and help me out.
function multi_thread_curl($urlArray, $optionArray, $nThreads)
{
//Group your urls into groups/threads.
$curlArray = array_chunk($urlArray, $nThreads, $preserve_keys = true);
//Iterate through each batch of urls.
$ch = 'ch_';
foreach ($curlArray as $threads) {
//Create your cURL resources.
foreach ($threads as $thread => $value) {
${$ch . $thread} = curl_init();
curl_setopt_array(${$ch . $thread}, $optionArray); //Set your main curl options.
curl_setopt(${$ch . $thread}, CURLOPT_URL, $value); //Set url.
}
//Create the multiple cURL handler.
$mh = curl_multi_init();
//Add the handles.
foreach ($threads as $thread => $value) {
curl_multi_add_handle($mh, ${$ch . $thread});
}
$active = null;
//execute the handles.
do {
$mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
while ($active && $mrc == CURLM_OK) {
if (curl_multi_select($mh) != -1) {
do {
$mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
}
}
//Get your data and close the handles.
foreach ($threads as $thread => $value) {
$results[$thread] = curl_multi_getcontent(${$ch . $thread});
curl_multi_remove_handle($mh, ${$ch . $thread});
}
//Close the multi handle exec.
curl_multi_close($mh);
}
return $results;
}
$optionArray = array(
CURLOPT_RETURNTRANSFER => TRUE,
CURLOPT_TIMEOUT => 20
);
$nThreads = 1;
$urlArray = array("url1", "url2","url3", "url10");
$results = multi_thread_curl($urlArray, $optionArray, $nThreads);
I would recommend using a cron job to schedule the call to the API for one url every minute, instead of a single script / function that is going to take 60s/url. You mentioned that you get the json response of 10 different urls, so that's 10 minutes for the script to be running, that's likely above the max_execution_time set by php.ini.
You could increase that by using set_time_limit, and just add a sleep(60) in your script, which will cause PHP to pause the script at that line for 60 seconds...

PHP Multiple Simultaneous Data Requests

I am trying to get multiple data requests at once to speed up processing time. I have been able to get all of the data I need by looping requests one at a time, but it takes 30-60 seconds. I just discovered curl_multi, but I cannot get it to return data, and so in the new code below, I am trying to get 3 URLs to work, but it'll end up being probably 10+ requests at once. I edited out my keys and replaced them with xxxx, so I already know that it won't run as is, if you try.
Any help or direction is greatly appreciated!
Code that works for all requests, but is time consuming:
define(TIME, date('U')+3852);
require_once 'OAuth.php';
$consumer_key = 'xxxx';
$consumer_secret = 'xxxx';
$access_token = 'xxxx';
$access_secret = 'xxxx';
$i=0;
foreach ($expirations as $row){
$expiry_date = $expirations[$i];
$url = "https://api.tradeking.com/v1/market/options/search.xml?symbol=SPX&query=xdate-eq%3A$expiry_date";
//Option Expirations
//$url = 'https://api.tradeking.com/v1/market/options/expirations.xml?symbol=SPX';
//Account data
//$url = 'https://api.tradeking.com/v1/accounts';
$consumer = new OAuthConsumer($consumer_key,$consumer_secret);
$access = new OAuthToken($access_token,$access_secret);
$request = OAuthRequest::from_consumer_and_token($consumer, $access, 'GET', $url);
$request->sign_request(new OAuthSignatureMethod_HMAC_SHA1(), $consumer, $access);
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_HTTPHEADER, array($request->to_header()));
$response = curl_exec($ch);
//Turn XML ($response) into an array ($array)
$array=json_decode(json_encode(simplexml_load_string($response)),true);
//Trim excess fields from ($array) and define as ($chain)
$chain = $array['quotes']['quote'];
$bigarray[$i] = $chain;
$i++;
}
Returns a huge array as variable $bigarray
Code that isn't working for three simultaneous requests:
define(TIME, date('U')+3852);
require_once 'OAuth.php';
$consumer_key = 'xxxx';
$consumer_secret = 'xxxx';
$access_token = 'xxxx';
$access_secret = 'xxxx';
$consumer = new OAuthConsumer($consumer_key,$consumer_secret);
$access = new OAuthToken($access_token,$access_secret);
// Run the parallel get and print the total time
$s = microtime(true);
// Define the URLs
$urls = array(
"https://api.tradeking.com/v1/market/options/search.xml?symbol=SPX&query=xdate-eq%3A$20160311",
"https://api.tradeking.com/v1/market/options/search.xml?symbol=SPX&query=xdate-eq%3A$20160318",
"https://api.tradeking.com/v1/market/options/search.xml?symbol=SPX&query=xdate-eq%3A$20160325"
);
$pg = new ParallelGet($urls);
print "<br />Total time: ".round(microtime(true) - $s, 4)." seconds";
// Class to run parallel GET requests and return the transfer
class ParallelGet
{
function __construct($urls)
{
// Create get requests for each URL
$mh = curl_multi_init();
foreach($urls as $i => $url)
{
$request = OAuthRequest::from_consumer_and_token($consumer, $access, 'GET', $url);
$request->sign_request(new OAuthSignatureMethod_HMAC_SHA1(), $consumer, $access);
$ch[$i] = curl_init($url);
curl_setopt($ch[$i], CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch[$i], CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch[$i], CURLOPT_HTTPHEADER, array($request->to_header()));
curl_multi_add_handle($mh, $ch[$i]);
}
// Start performing the request
do {
$execReturnValue = curl_multi_exec($mh, $runningHandles);
} while ($execReturnValue == CURLM_CALL_MULTI_PERFORM);
// Loop and continue processing the request
while ($runningHandles && $execReturnValue == CURLM_OK) {
// Wait forever for network
$numberReady = curl_multi_select($mh);
if ($numberReady != -1) {
// Pull in any new data, or at least handle timeouts
do {
$execReturnValue = curl_multi_exec($mh, $runningHandles);
} while ($execReturnValue == CURLM_CALL_MULTI_PERFORM);
}
}
// Check for any errors
if ($execReturnValue != CURLM_OK) {
trigger_error("Curl multi read error $execReturnValue\n", E_USER_WARNING);
}
// Extract the content
foreach($urls as $i => $url)
{
// Check for errors
$curlError = curl_error($ch[$i]);
if($curlError == "") {
$res[$i] = curl_multi_getcontent($ch[$i]);
} else {
print "Curl error on handle $i: $curlError\n";
}
// Remove and close the handle
curl_multi_remove_handle($mh, $ch[$i]);
curl_close($ch[$i]);
}
// Clean up the curl_multi handle
curl_multi_close($mh);
// Print the response data
print_r($res);
}
}
?>
Returns:
Array ( [0] => [1] => [2] => )
Total time: 0.4572 seconds

Multi-threaded cURL with SSL and redirect

I have very simple scraper now does what I need, but it's very slow it scrapes 2 pictures in 3 seconds what I need to do is at least 1000 pictures in a few seconds.
This is the code I use now
<?php
require_once('config.php');
//Calling PHasher class file.
include_once('classes/phasher.class.php');
$I = PHasher::Instance();
//Prevent execution timeout.
set_time_limit(0);
//Solving SSL Problem.
$arrContextOptions=array(
"ssl"=>array(
"verify_peer"=>false,
"verify_peer_name"=>false,
),
);
//Check if the database contains hashed pictures or if it's empty, Then start from the latest hashed picture or start from 4.
$check = mysqli_query($con, "SELECT fid FROM images ORDER BY fid DESC LIMIT 1;");
if(mysqli_num_rows($check) > 0){
$max_fid = mysqli_fetch_row($check);
$fid = $max_fid[0]+1;
} else {
$fid = 4;
}
$deletedProfile = "https://z-1-static.xx.fbcdn.net/rsrc.php/v2/yo/r/UlIqmHJn-SK.gif";
//Infinte while loop to fetch profiles pictures and save them inside avatar folder.
$initial = $fid;
while($fid = $initial){
$url = 'https://graph.facebook.com/'.$fid.'/picture?width=378&height=378';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // follow the redirects
curl_setopt($ch, CURLOPT_HEADER, false); // no needs to pass the headers to the data stream
curl_setopt($ch, CURLOPT_NOBODY, true); // get the resource without a body
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); // accept any server certificate
curl_exec($ch);
// get the last used URL
$lastUrl = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
curl_close($ch);
if($lastUrl == $deletedProfile){
$initial++;
}else{
$imageUrl = file_get_contents($url, false, stream_context_create($arrContextOptions));
$savedImage = dirname(__file__).'/avatar/image.jpg';
file_put_contents($savedImage, $imageUrl);
//Exclude deleted profiles or corrupted pictures.
if(getimagesize($savedImage) > 0 ){
//PHasher class call to hash the images to hexdecimal values or binary values.
$hash = $I->FastHashImage($savedImage);
$hex = $I->HashAsString($hash);
//Store Facebook id and hashed values for the images in hexa values.
mysqli_query($con, "INSERT INTO images(fid, hash) VALUES ('$fid', '$hex')");
$initial++;
} else {
$initial++;
}
}
}
?>
I didn't figure out how to do it, but what I am thinking of now is:
1- Divide into 1000 profiles for each loop and store them in an array.
$items = array();
for($i=$fid; $i <= $fid+1000; $i++){
$url = 'https://graph.facebook.com/'.$i.'/picture?width=378&height=378';
$items[$i] = array($url);
}
but the results are incorrect I want to know how to fix the output of the array.
Array ( [28990] => Array ( [0] => https://graph.facebook.com/28990/picture?width=378&height=378 )
[28991] => Array ( [0] => https://graph.facebook.com/28991/picture?width=378&height=378 )
[28992] => Array ( [0] => https://graph.facebook.com/28992/picture?width=378&height=378 )
[28993] => Array ( [0] => https://graph.facebook.com/28993/picture?width=378&height=378 )
[28994] => Array ( [0] => https://graph.facebook.com/28994/picture?width=378&height=378 )
[28995] => Array ( [0] => https://graph.facebook.com/28995/picture?width=378&height=378 )
[28996] => Array ( [0] => https://graph.facebook.com/28996/picture?width=378&height=378 )
[28997] => Array ( [0] => https://graph.facebook.com/28997/picture?width=378&height=378 )
2- Then I want to use the output array inside Mulit curl, allows the processing of multiple cURL handles asynchronously.
3- Check the output URLs if it's equal to the deleted profile if not pass it to be converted as a hash value using PHasher and store it inside the DB.
I just have what you need, although I haven't been able to reach that kind of throughput (1000 parallel requests per sec)
I forgot where I got this before but I am using this to download reddit content:
class ParallelCurl {
public $max_requests;
public $options;
public $outstanding_requests;
public $multi_handle;
public function __construct($in_max_requests = 10, $in_options = array()) {
$this->max_requests = $in_max_requests;
$this->options = $in_options;
$this->outstanding_requests = array();
$this->multi_handle = curl_multi_init();
}
//Ensure all the requests finish nicely
public function __destruct() {
$this->finishAllRequests();
}
// Sets how many requests can be outstanding at once before we block and wait for one to
// finish before starting the next one
public function setMaxRequests($in_max_requests) {
$this->max_requests = $in_max_requests;
}
// Sets the options to pass to curl, using the format of curl_setopt_array()
public function setOptions($in_options) {
$this->options = $in_options;
}
// Start a fetch from the $url address, calling the $callback function passing the optional
// $user_data value. The callback should accept 3 arguments, the url, curl handle and user
// data, eg on_request_done($url, $ch, $user_data);
public function startRequest($url, $callback, $user_data = array(), $post_fields = null, $headers = null) {
if ($this->max_requests > 0)
$this->waitForOutstandingRequestsToDropBelow($this->max_requests);
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt_array($ch, $this->options);
curl_setopt($ch, CURLOPT_URL, $url);
if (isset($post_fields)) {
curl_setopt($ch, CURLOPT_POST, TRUE);
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_fields);
}
if (is_array($headers)) {
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
}
curl_multi_add_handle($this->multi_handle, $ch);
$ch_array_key = (int) $ch;
$this->outstanding_requests[$ch_array_key] = array(
'link_url' => $url,
'callback' => $callback,
'user_data' => $user_data,
);
$this->checkForCompletedRequests();
}
// You *MUST* call this function at the end of your script. It waits for any running requests
// to complete, and calls their callback functions
public function finishAllRequests() {
$this->waitForOutstandingRequestsToDropBelow(1);
}
// Checks to see if any of the outstanding requests have finished
private function checkForCompletedRequests() {
/*
// Call select to see if anything is waiting for us
if (curl_multi_select($this->multi_handle, 0.0) === -1)
return;
// Since something's waiting, give curl a chance to process it
do {
$mrc = curl_multi_exec($this->multi_handle, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
*/
// fix for https://bugs.php.net/bug.php?id=63411
do {
$mrc = curl_multi_exec($this->multi_handle, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
while ($active && $mrc == CURLM_OK) {
if (curl_multi_select($this->multi_handle) != -1) {
do {
$mrc = curl_multi_exec($this->multi_handle, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
} else
return;
}
// Now grab the information about the completed requests
while ($info = curl_multi_info_read($this->multi_handle)) {
$ch = $info['handle'];
$ch_array_key = (int) $ch;
if (!isset($this->outstanding_requests[$ch_array_key])) {
die("Error - handle wasn't found in requests: '$ch' in " .
print_r($this->outstanding_requests, true));
}
$request = $this->outstanding_requests[$ch_array_key];
$url = $request['link_url'];
$content = curl_multi_getcontent($ch);
$callback = $request['callback'];
$user_data = $request['user_data'];
call_user_func($callback, $content, $url, $ch, $user_data);
unset($this->outstanding_requests[$ch_array_key]);
curl_multi_remove_handle($this->multi_handle, $ch);
}
}
// Blocks until there's less than the specified number of requests outstanding
private function waitForOutstandingRequestsToDropBelow($max) {
while (1) {
$this->checkForCompletedRequests();
if (count($this->outstanding_requests) < $max)
break;
usleep(10000);
}
}
}
The way this works is you pass to ParallelCurl::startRequest() a URL and a callback function (could be anonymous), and this queues a download for this URL, then calls the function when the download finishes.
$pcurl = new ParallelCurl(10, array(
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_FOLLOWLOCATION => 1,
CURLOPT_SSL_VERIFYPEER => 1,
));
$pcurl->startRequest($url, function($data) {
// download finished. $data is html or binary, whatever you requested
echo $data;
});

cURL Multi Threading?

I've heard a lot about php's multi threading with cURL but have never really tried it and I find it a bit tough to understand how it actually works. Could anyone convert this into curl_multi?
$path1 = array("path1", "path2", "path3"); //example
$path2 = array("path1", "path2", "path3"); //example
$opt = curl_init($path1);
curl_setopt($opt, CURLOPT_RETURNTRANSFER, true);
$content = curl_exec($opt);
curl_close($opt);
file_put_contents($path2, $content);
What I want to actually do is to download multiple files from the arrays path 1 into path 2 using curl_multi.
This is nice project to start with...
https://github.com/jmathai/php-multi-curl
I am using curl multi and it is awesome indeed. I am using this to make faster push notifications.
https://github.com/Krutarth/FlashSnsPns
The above accepted answer is outdated/wrong, So, correct answer has to be up voted.
http://php.net/manual/en/function.curl-multi-init.php
Now, PHP supports fetching multiple URLs at the same time.
There is a very good function written by someone, http://archevery.blogspot.in/2013/07/php-curl-multi-threading.html
This is the function:
function runRequests($url_array, $thread_width = 4) {
$threads = 0;
$master = curl_multi_init();
$curl_opts = array(CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_MAXREDIRS => 5,
CURLOPT_CONNECTTIMEOUT => 15,
CURLOPT_TIMEOUT => 15,
CURLOPT_RETURNTRANSFER => TRUE);
$results = array();
$count = 0;
foreach($url_array as $url) {
$ch = curl_init();
$curl_opts[CURLOPT_URL] = $url;
curl_setopt_array($ch, $curl_opts);
curl_multi_add_handle($master, $ch); //push URL for single rec send into curl stack
$results[$count] = array("url" => $url, "handle" => $ch);
$threads++;
$count++;
if($threads >= $thread_width) { //start running when stack is full to width
while($threads >= $thread_width) {
usleep(100);
while(($execrun = curl_multi_exec($master, $running)) === -1){}
curl_multi_select($master);
// a request was just completed - find out which one and remove it from stack
while($done = curl_multi_info_read($master)) {
foreach($results as &$res) {
if($res['handle'] == $done['handle']) {
$res['result'] = curl_multi_getcontent($done['handle']);
}
}
curl_multi_remove_handle($master, $done['handle']);
curl_close($done['handle']);
$threads--;
}
}
}
}
do { //finish sending remaining queue items when all have been added to curl
usleep(100);
while(($execrun = curl_multi_exec($master, $running)) === -1){}
curl_multi_select($master);
while($done = curl_multi_info_read($master)) {
foreach($results as &$res) {
if($res['handle'] == $done['handle']) {
$res['result'] = curl_multi_getcontent($done['handle']);
}
}
curl_multi_remove_handle($master, $done['handle']);
curl_close($done['handle']);
$threads--;
}
} while($running > 0);
curl_multi_close($master);
return $results;
}
You can just use it.

PHP Parallel curl requests

I am doing a simple app that reads json data from 15 different URLs. I have a special need that I need to do this serverly. I am using file_get_contents($url).
Since I am using file_get_contents($url). I wrote a simple script, is it:
$websites = array(
$url1,
$url2,
$url3,
...
$url15
);
foreach ($websites as $website) {
$data[] = file_get_contents($website);
}
and it was proven to be very slow, because it waits for the first request and then do the next one.
If you mean multi-curl then, something like this might help:
$nodes = array($url1, $url2, $url3);
$node_count = count($nodes);
$curl_arr = array();
$master = curl_multi_init();
for($i = 0; $i < $node_count; $i++)
{
$url =$nodes[$i];
$curl_arr[$i] = curl_init($url);
curl_setopt($curl_arr[$i], CURLOPT_RETURNTRANSFER, true);
curl_multi_add_handle($master, $curl_arr[$i]);
}
do {
curl_multi_exec($master,$running);
} while($running > 0);
for($i = 0; $i < $node_count; $i++)
{
$results[] = curl_multi_getcontent ( $curl_arr[$i] );
}
print_r($results);
i don't particularly like the approach of any of the existing answers
Timo's code: might sleep/select() during CURLM_CALL_MULTI_PERFORM which is wrong, it might also fail to sleep when ($still_running > 0 && $exec != CURLM_CALL_MULTI_PERFORM) which may make the code spin at 100% cpu usage (of 1 core) for no reason
Sudhir's code: will not sleep when $still_running > 0 , and spam-call the async-function curl_multi_exec() until everything has been downloaded, which cause php to use 100% cpu (of 1 cpu core) until everything has been downloaded, in other words it fails to sleep while downloading
here's an approach with neither of those issues:
$websites = array(
"http://google.com",
"http://example.org"
// $url2,
// $url3,
// ...
// $url15
);
$mh = curl_multi_init();
foreach ($websites as $website) {
$worker = curl_init($website);
curl_setopt_array($worker, [
CURLOPT_RETURNTRANSFER => 1
]);
curl_multi_add_handle($mh, $worker);
}
for (;;) {
$still_running = null;
do {
$err = curl_multi_exec($mh, $still_running);
} while ($err === CURLM_CALL_MULTI_PERFORM);
if ($err !== CURLM_OK) {
// handle curl multi error?
}
if ($still_running < 1) {
// all downloads completed
break;
}
// some haven't finished downloading, sleep until more data arrives:
curl_multi_select($mh, 1);
}
$results = [];
while (false !== ($info = curl_multi_info_read($mh))) {
if ($info["result"] !== CURLE_OK) {
// handle download error?
}
$results[curl_getinfo($info["handle"], CURLINFO_EFFECTIVE_URL)] = curl_multi_getcontent($info["handle"]);
curl_multi_remove_handle($mh, $info["handle"]);
curl_close($info["handle"]);
}
curl_multi_close($mh);
var_export($results);
note that an issue shared by all 3 approaches here (my answer, and Sudhir's answer, and Timo's answer) is that they will open all connections simultaneously, if you have 1,000,000 websites to fetch, these scripts will try to open 1,000,000 connections simultaneously. if you need to like.. only download 50 websites at a time, or something like that, maybe try:
$websites = array(
"http://google.com",
"http://example.org"
// $url2,
// $url3,
// ...
// $url15
);
var_dump(fetch_urls($websites,50));
function fetch_urls(array $urls, int $max_connections, int $timeout_ms = 10000, bool $return_fault_reason = true): array
{
if ($max_connections < 1) {
throw new InvalidArgumentException("max_connections MUST be >=1");
}
foreach ($urls as $key => $foo) {
if (! is_string($foo)) {
throw new \InvalidArgumentException("all urls must be strings!");
}
if (empty($foo)) {
unset($urls[$key]); // ?
}
}
unset($foo);
// DISABLED for benchmarking purposes: $urls = array_unique($urls); // remove duplicates.
$ret = array();
$mh = curl_multi_init();
$workers = array();
$work = function () use (&$ret, &$workers, &$mh, $return_fault_reason) {
// > If an added handle fails very quickly, it may never be counted as a running_handle
while (1) {
do {
$err = curl_multi_exec($mh, $still_running);
} while ($err === CURLM_CALL_MULTI_PERFORM);
if ($still_running < count($workers)) {
// some workers finished, fetch their response and close them
break;
}
$cms = curl_multi_select($mh, 1);
// var_dump('sr: ' . $still_running . " c: " . count($workers)." cms: ".$cms);
}
while (false !== ($info = curl_multi_info_read($mh))) {
// echo "NOT FALSE!";
// var_dump($info);
{
if ($info['msg'] !== CURLMSG_DONE) {
continue;
}
if ($info['result'] !== CURLE_OK) {
if ($return_fault_reason) {
$ret[$workers[(int) $info['handle']]] = print_r(array(
false,
$info['result'],
"curl_exec error " . $info['result'] . ": " . curl_strerror($info['result'])
), true);
}
} elseif (CURLE_OK !== ($err = curl_errno($info['handle']))) {
if ($return_fault_reason) {
$ret[$workers[(int) $info['handle']]] = print_r(array(
false,
$err,
"curl error " . $err . ": " . curl_strerror($err)
), true);
}
} else {
$ret[$workers[(int) $info['handle']]] = curl_multi_getcontent($info['handle']);
}
curl_multi_remove_handle($mh, $info['handle']);
assert(isset($workers[(int) $info['handle']]));
unset($workers[(int) $info['handle']]);
curl_close($info['handle']);
}
}
// echo "NO MORE INFO!";
};
foreach ($urls as $url) {
while (count($workers) >= $max_connections) {
// echo "TOO MANY WORKERS!\n";
$work();
}
$neww = curl_init($url);
if (! $neww) {
trigger_error("curl_init() failed! probably means that max_connections is too high and you ran out of system resources", E_USER_WARNING);
if ($return_fault_reason) {
$ret[$url] = array(
false,
- 1,
"curl_init() failed"
);
}
continue;
}
$workers[(int) $neww] = $url;
curl_setopt_array($neww, array(
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_SSL_VERIFYHOST => 0,
CURLOPT_SSL_VERIFYPEER => 0,
CURLOPT_TIMEOUT_MS => $timeout_ms
));
curl_multi_add_handle($mh, $neww);
// curl_multi_exec($mh, $unused_here); LIKELY TO BE MUCH SLOWER IF DONE IN THIS LOOP: TOO MANY SYSCALLS
}
while (count($workers) > 0) {
// echo "WAITING FOR WORKERS TO BECOME 0!";
// var_dump(count($workers));
$work();
}
curl_multi_close($mh);
return $ret;
}
that will download the entire list and not download more than 50 urls simultaneously
(but even that approach stores all the results in-ram, so even that approach may end up running out of ram; if you want to store it in a database instead of in ram, the curl_multi_getcontent part can be modified to store it in a database instead of in a ram-persistent variable.)
I would like to provide a more complete example without hitting the CPU at 100% and crashing when there's a slight error or something unexpected.
It also shows you how to fetch the headers, the body, request info and manual redirect following.
Disclaimer, this code is intended to be extended and implemented into a library or as a quick starting point, and as such the functions inside of it are kept to a minimum.
function mtime(){
return microtime(true);
}
function ptime($prev){
$t = microtime(true) - $prev;
$t = $t * 1000;
return str_pad($t, 20, 0, STR_PAD_RIGHT);
}
// This function exists to add compatibility for CURLM_CALL_MULTI_PERFORM for old curl versions, on modern curl it will only run once and be the equivalent of calling curl_multi_exec
function curl_multi_exec_full($mh, &$still_running) {
// In theory curl_multi_exec should never return CURLM_CALL_MULTI_PERFORM (-1) because it has been deprecated
// In practice it sometimes does
// So imagine that this just runs curl_multi_exec once and returns it's value
do {
$state = curl_multi_exec($mh, $still_running);
// curl_multi_select($mh, $timeout) simply blocks for $timeout seconds while curl_multi_exec() returns CURLM_CALL_MULTI_PERFORM
// We add it to prevent CPU 100% usage in case this thing misbehaves (especially for old curl on windows)
} while ($still_running > 0 && $state === CURLM_CALL_MULTI_PERFORM && curl_multi_select($mh, 0.1));
return $state;
}
// This function replaces curl_multi_select and makes the name make more sense, since all we're doing is waiting for curl, it also forces a minimum sleep time between requests to avoid excessive CPU usage.
function curl_multi_wait($mh, $minTime = 0.001, $maxTime = 1){
$umin = $minTime*1000000;
$start_time = microtime(true);
// it sleeps until there is some activity on any of the descriptors (curl files)
// it returns the number of descriptors (curl files that can have activity)
$num_descriptors = curl_multi_select($mh, $maxTime);
// if the system returns -1, it means that the wait time is unknown, and we have to decide the minimum time to wait
// but our `$timespan` check below catches this edge case, so this `if` isn't really necessary
if($num_descriptors === -1){
usleep($umin);
}
$timespan = (microtime(true) - $start_time);
// This thing runs very fast, up to 1000 times for 2 urls, which wastes a lot of CPU
// This will reduce the runs so that each interval is separated by at least minTime
if($timespan < $umin){
usleep($umin - $timespan);
//print "sleep for ".($umin - $timeDiff).PHP_EOL;
}
}
$handles = [
[
CURLOPT_URL=>"http://example.com/",
CURLOPT_HEADER=>false,
CURLOPT_RETURNTRANSFER=>true,
CURLOPT_FOLLOWLOCATION=>false,
],
[
CURLOPT_URL=>"http://www.php.net",
CURLOPT_HEADER=>false,
CURLOPT_RETURNTRANSFER=>true,
CURLOPT_FOLLOWLOCATION=>false,
// this function is called by curl for each header received
// This complies with RFC822 and RFC2616, please do not suggest edits to make use of the mb_ string functions, it is incorrect!
// https://stackoverflow.com/a/41135574
CURLOPT_HEADERFUNCTION=>function($ch, $header)
{
print "header from http://www.php.net: ".$header;
//$header = explode(':', $header, 2);
//if (count($header) < 2){ // ignore invalid headers
// return $len;
//}
//$headers[strtolower(trim($header[0]))][] = trim($header[1]);
return strlen($header);
}
]
];
//create the multiple cURL handle
$mh = curl_multi_init();
$chandles = [];
foreach($handles as $opts) {
// create cURL resources
$ch = curl_init();
// set URL and other appropriate options
curl_setopt_array($ch, $opts);
// add the handle
curl_multi_add_handle($mh, $ch);
$chandles[] = $ch;
}
//execute the multi handle
$prevRunning = null;
$count = 0;
do {
$time = mtime();
// $running contains the number of currently running requests
$status = curl_multi_exec_full($mh, $running);
$count++;
print ptime($time).": curl_multi_exec status=$status running $running".PHP_EOL;
// One less is running, meaning one has finished
if($running < $prevRunning){
print ptime($time).": curl_multi_info_read".PHP_EOL;
// msg: The CURLMSG_DONE constant. Other return values are currently not available.
// result: One of the CURLE_* constants. If everything is OK, the CURLE_OK will be the result.
// handle: Resource of type curl indicates the handle which it concerns.
while ($read = curl_multi_info_read($mh, $msgs_in_queue)) {
$info = curl_getinfo($read['handle']);
if($read['result'] !== CURLE_OK){
// handle the error somehow
print "Error: ".$info['url'].PHP_EOL;
}
if($read['result'] === CURLE_OK){
/*
// This will automatically follow the redirect and still give you control over the previous page
// TODO: max redirect checks and redirect timeouts
if(isset($info['redirect_url']) && trim($info['redirect_url'])!==''){
print "running redirect: ".$info['redirect_url'].PHP_EOL;
$ch3 = curl_init();
curl_setopt($ch3, CURLOPT_URL, $info['redirect_url']);
curl_setopt($ch3, CURLOPT_HEADER, 0);
curl_setopt($ch3, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch3, CURLOPT_FOLLOWLOCATION, 0);
curl_multi_add_handle($mh,$ch3);
}
*/
print_r($info);
$body = curl_multi_getcontent($read['handle']);
print $body;
}
}
}
// Still running? keep waiting...
if ($running > 0) {
curl_multi_wait($mh);
}
$prevRunning = $running;
} while ($running > 0 && $status == CURLM_OK);
//close the handles
foreach($chandles as $ch){
curl_multi_remove_handle($mh, $ch);
}
curl_multi_close($mh);
print $count.PHP_EOL;

Categories