PHP Parallel curl requests - php

I am doing a simple app that reads json data from 15 different URLs. I have a special need that I need to do this serverly. I am using file_get_contents($url).
Since I am using file_get_contents($url). I wrote a simple script, is it:
$websites = array(
$url1,
$url2,
$url3,
...
$url15
);
foreach ($websites as $website) {
$data[] = file_get_contents($website);
}
and it was proven to be very slow, because it waits for the first request and then do the next one.

If you mean multi-curl then, something like this might help:
$nodes = array($url1, $url2, $url3);
$node_count = count($nodes);
$curl_arr = array();
$master = curl_multi_init();
for($i = 0; $i < $node_count; $i++)
{
$url =$nodes[$i];
$curl_arr[$i] = curl_init($url);
curl_setopt($curl_arr[$i], CURLOPT_RETURNTRANSFER, true);
curl_multi_add_handle($master, $curl_arr[$i]);
}
do {
curl_multi_exec($master,$running);
} while($running > 0);
for($i = 0; $i < $node_count; $i++)
{
$results[] = curl_multi_getcontent ( $curl_arr[$i] );
}
print_r($results);

i don't particularly like the approach of any of the existing answers
Timo's code: might sleep/select() during CURLM_CALL_MULTI_PERFORM which is wrong, it might also fail to sleep when ($still_running > 0 && $exec != CURLM_CALL_MULTI_PERFORM) which may make the code spin at 100% cpu usage (of 1 core) for no reason
Sudhir's code: will not sleep when $still_running > 0 , and spam-call the async-function curl_multi_exec() until everything has been downloaded, which cause php to use 100% cpu (of 1 cpu core) until everything has been downloaded, in other words it fails to sleep while downloading
here's an approach with neither of those issues:
$websites = array(
"http://google.com",
"http://example.org"
// $url2,
// $url3,
// ...
// $url15
);
$mh = curl_multi_init();
foreach ($websites as $website) {
$worker = curl_init($website);
curl_setopt_array($worker, [
CURLOPT_RETURNTRANSFER => 1
]);
curl_multi_add_handle($mh, $worker);
}
for (;;) {
$still_running = null;
do {
$err = curl_multi_exec($mh, $still_running);
} while ($err === CURLM_CALL_MULTI_PERFORM);
if ($err !== CURLM_OK) {
// handle curl multi error?
}
if ($still_running < 1) {
// all downloads completed
break;
}
// some haven't finished downloading, sleep until more data arrives:
curl_multi_select($mh, 1);
}
$results = [];
while (false !== ($info = curl_multi_info_read($mh))) {
if ($info["result"] !== CURLE_OK) {
// handle download error?
}
$results[curl_getinfo($info["handle"], CURLINFO_EFFECTIVE_URL)] = curl_multi_getcontent($info["handle"]);
curl_multi_remove_handle($mh, $info["handle"]);
curl_close($info["handle"]);
}
curl_multi_close($mh);
var_export($results);
note that an issue shared by all 3 approaches here (my answer, and Sudhir's answer, and Timo's answer) is that they will open all connections simultaneously, if you have 1,000,000 websites to fetch, these scripts will try to open 1,000,000 connections simultaneously. if you need to like.. only download 50 websites at a time, or something like that, maybe try:
$websites = array(
"http://google.com",
"http://example.org"
// $url2,
// $url3,
// ...
// $url15
);
var_dump(fetch_urls($websites,50));
function fetch_urls(array $urls, int $max_connections, int $timeout_ms = 10000, bool $return_fault_reason = true): array
{
if ($max_connections < 1) {
throw new InvalidArgumentException("max_connections MUST be >=1");
}
foreach ($urls as $key => $foo) {
if (! is_string($foo)) {
throw new \InvalidArgumentException("all urls must be strings!");
}
if (empty($foo)) {
unset($urls[$key]); // ?
}
}
unset($foo);
// DISABLED for benchmarking purposes: $urls = array_unique($urls); // remove duplicates.
$ret = array();
$mh = curl_multi_init();
$workers = array();
$work = function () use (&$ret, &$workers, &$mh, $return_fault_reason) {
// > If an added handle fails very quickly, it may never be counted as a running_handle
while (1) {
do {
$err = curl_multi_exec($mh, $still_running);
} while ($err === CURLM_CALL_MULTI_PERFORM);
if ($still_running < count($workers)) {
// some workers finished, fetch their response and close them
break;
}
$cms = curl_multi_select($mh, 1);
// var_dump('sr: ' . $still_running . " c: " . count($workers)." cms: ".$cms);
}
while (false !== ($info = curl_multi_info_read($mh))) {
// echo "NOT FALSE!";
// var_dump($info);
{
if ($info['msg'] !== CURLMSG_DONE) {
continue;
}
if ($info['result'] !== CURLE_OK) {
if ($return_fault_reason) {
$ret[$workers[(int) $info['handle']]] = print_r(array(
false,
$info['result'],
"curl_exec error " . $info['result'] . ": " . curl_strerror($info['result'])
), true);
}
} elseif (CURLE_OK !== ($err = curl_errno($info['handle']))) {
if ($return_fault_reason) {
$ret[$workers[(int) $info['handle']]] = print_r(array(
false,
$err,
"curl error " . $err . ": " . curl_strerror($err)
), true);
}
} else {
$ret[$workers[(int) $info['handle']]] = curl_multi_getcontent($info['handle']);
}
curl_multi_remove_handle($mh, $info['handle']);
assert(isset($workers[(int) $info['handle']]));
unset($workers[(int) $info['handle']]);
curl_close($info['handle']);
}
}
// echo "NO MORE INFO!";
};
foreach ($urls as $url) {
while (count($workers) >= $max_connections) {
// echo "TOO MANY WORKERS!\n";
$work();
}
$neww = curl_init($url);
if (! $neww) {
trigger_error("curl_init() failed! probably means that max_connections is too high and you ran out of system resources", E_USER_WARNING);
if ($return_fault_reason) {
$ret[$url] = array(
false,
- 1,
"curl_init() failed"
);
}
continue;
}
$workers[(int) $neww] = $url;
curl_setopt_array($neww, array(
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_SSL_VERIFYHOST => 0,
CURLOPT_SSL_VERIFYPEER => 0,
CURLOPT_TIMEOUT_MS => $timeout_ms
));
curl_multi_add_handle($mh, $neww);
// curl_multi_exec($mh, $unused_here); LIKELY TO BE MUCH SLOWER IF DONE IN THIS LOOP: TOO MANY SYSCALLS
}
while (count($workers) > 0) {
// echo "WAITING FOR WORKERS TO BECOME 0!";
// var_dump(count($workers));
$work();
}
curl_multi_close($mh);
return $ret;
}
that will download the entire list and not download more than 50 urls simultaneously
(but even that approach stores all the results in-ram, so even that approach may end up running out of ram; if you want to store it in a database instead of in ram, the curl_multi_getcontent part can be modified to store it in a database instead of in a ram-persistent variable.)

I would like to provide a more complete example without hitting the CPU at 100% and crashing when there's a slight error or something unexpected.
It also shows you how to fetch the headers, the body, request info and manual redirect following.
Disclaimer, this code is intended to be extended and implemented into a library or as a quick starting point, and as such the functions inside of it are kept to a minimum.
function mtime(){
return microtime(true);
}
function ptime($prev){
$t = microtime(true) - $prev;
$t = $t * 1000;
return str_pad($t, 20, 0, STR_PAD_RIGHT);
}
// This function exists to add compatibility for CURLM_CALL_MULTI_PERFORM for old curl versions, on modern curl it will only run once and be the equivalent of calling curl_multi_exec
function curl_multi_exec_full($mh, &$still_running) {
// In theory curl_multi_exec should never return CURLM_CALL_MULTI_PERFORM (-1) because it has been deprecated
// In practice it sometimes does
// So imagine that this just runs curl_multi_exec once and returns it's value
do {
$state = curl_multi_exec($mh, $still_running);
// curl_multi_select($mh, $timeout) simply blocks for $timeout seconds while curl_multi_exec() returns CURLM_CALL_MULTI_PERFORM
// We add it to prevent CPU 100% usage in case this thing misbehaves (especially for old curl on windows)
} while ($still_running > 0 && $state === CURLM_CALL_MULTI_PERFORM && curl_multi_select($mh, 0.1));
return $state;
}
// This function replaces curl_multi_select and makes the name make more sense, since all we're doing is waiting for curl, it also forces a minimum sleep time between requests to avoid excessive CPU usage.
function curl_multi_wait($mh, $minTime = 0.001, $maxTime = 1){
$umin = $minTime*1000000;
$start_time = microtime(true);
// it sleeps until there is some activity on any of the descriptors (curl files)
// it returns the number of descriptors (curl files that can have activity)
$num_descriptors = curl_multi_select($mh, $maxTime);
// if the system returns -1, it means that the wait time is unknown, and we have to decide the minimum time to wait
// but our `$timespan` check below catches this edge case, so this `if` isn't really necessary
if($num_descriptors === -1){
usleep($umin);
}
$timespan = (microtime(true) - $start_time);
// This thing runs very fast, up to 1000 times for 2 urls, which wastes a lot of CPU
// This will reduce the runs so that each interval is separated by at least minTime
if($timespan < $umin){
usleep($umin - $timespan);
//print "sleep for ".($umin - $timeDiff).PHP_EOL;
}
}
$handles = [
[
CURLOPT_URL=>"http://example.com/",
CURLOPT_HEADER=>false,
CURLOPT_RETURNTRANSFER=>true,
CURLOPT_FOLLOWLOCATION=>false,
],
[
CURLOPT_URL=>"http://www.php.net",
CURLOPT_HEADER=>false,
CURLOPT_RETURNTRANSFER=>true,
CURLOPT_FOLLOWLOCATION=>false,
// this function is called by curl for each header received
// This complies with RFC822 and RFC2616, please do not suggest edits to make use of the mb_ string functions, it is incorrect!
// https://stackoverflow.com/a/41135574
CURLOPT_HEADERFUNCTION=>function($ch, $header)
{
print "header from http://www.php.net: ".$header;
//$header = explode(':', $header, 2);
//if (count($header) < 2){ // ignore invalid headers
// return $len;
//}
//$headers[strtolower(trim($header[0]))][] = trim($header[1]);
return strlen($header);
}
]
];
//create the multiple cURL handle
$mh = curl_multi_init();
$chandles = [];
foreach($handles as $opts) {
// create cURL resources
$ch = curl_init();
// set URL and other appropriate options
curl_setopt_array($ch, $opts);
// add the handle
curl_multi_add_handle($mh, $ch);
$chandles[] = $ch;
}
//execute the multi handle
$prevRunning = null;
$count = 0;
do {
$time = mtime();
// $running contains the number of currently running requests
$status = curl_multi_exec_full($mh, $running);
$count++;
print ptime($time).": curl_multi_exec status=$status running $running".PHP_EOL;
// One less is running, meaning one has finished
if($running < $prevRunning){
print ptime($time).": curl_multi_info_read".PHP_EOL;
// msg: The CURLMSG_DONE constant. Other return values are currently not available.
// result: One of the CURLE_* constants. If everything is OK, the CURLE_OK will be the result.
// handle: Resource of type curl indicates the handle which it concerns.
while ($read = curl_multi_info_read($mh, $msgs_in_queue)) {
$info = curl_getinfo($read['handle']);
if($read['result'] !== CURLE_OK){
// handle the error somehow
print "Error: ".$info['url'].PHP_EOL;
}
if($read['result'] === CURLE_OK){
/*
// This will automatically follow the redirect and still give you control over the previous page
// TODO: max redirect checks and redirect timeouts
if(isset($info['redirect_url']) && trim($info['redirect_url'])!==''){
print "running redirect: ".$info['redirect_url'].PHP_EOL;
$ch3 = curl_init();
curl_setopt($ch3, CURLOPT_URL, $info['redirect_url']);
curl_setopt($ch3, CURLOPT_HEADER, 0);
curl_setopt($ch3, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch3, CURLOPT_FOLLOWLOCATION, 0);
curl_multi_add_handle($mh,$ch3);
}
*/
print_r($info);
$body = curl_multi_getcontent($read['handle']);
print $body;
}
}
}
// Still running? keep waiting...
if ($running > 0) {
curl_multi_wait($mh);
}
$prevRunning = $running;
} while ($running > 0 && $status == CURLM_OK);
//close the handles
foreach($chandles as $ch){
curl_multi_remove_handle($mh, $ch);
}
curl_multi_close($mh);
print $count.PHP_EOL;

Related

PHP multi cURL performance worse than sequential file_get_contents

I am writing an interface in which I must launch 4 http requests to get some infomation.
I implemented the interface in 2 ways:
using sequential file_get_contents.
using multi curl.
I have benchmarked the 2 versions with jmeter. The result shows that multi curl is much better than sequential file_get_contents when there's only 1 thread in jmeter making requests, but much worse when 100 threads.
The question is: which could bring the bad performance of multi curl?
My multi curl code is as below:
$curl_handle_arr = array ();
$master = curl_multi_init();
foreach ( $call_url_arr as $key => $url )
{
$curl_handle = curl_init( $url );
$curl_handle_arr [$key] = $curl_handle;
curl_setopt( $curl_handle , CURLOPT_RETURNTRANSFER , true );
curl_setopt( $curl_handle , CURLOPT_POST , true );
curl_setopt( $curl_handle , CURLOPT_POSTFIELDS , http_build_query( $params_arr [$key] ) );
curl_multi_add_handle( $master , $curl_handle );
}
$running = null;
$mrc = null;
do
{
$mrc = curl_multi_exec( $master , $running );
}
while ( $mrc == CURLM_CALL_MULTI_PERFORM );
while ( $running && $mrc == CURLM_OK )
{
if (curl_multi_select( $master ) != - 1)
{
do
{
$mrc = curl_multi_exec( $master , $running );
}
while ( $mrc == CURLM_CALL_MULTI_PERFORM );
}
}
foreach ( $call_url_arr as $key => $url )
{
$curl_handle = $curl_handle_arr [$key];
if (curl_error( $curl_handle ) == '')
{
$result_str_arr [$key] = curl_multi_getcontent( $curl_handle );
}
curl_multi_remove_handle( $master , $curl_handle );
}
curl_multi_close( $master );
1. Simple optimization
You should sleep about 2500 microseconds if curl_multi_select failed.
Actually, it defintely fails sometimes for each execution.
Without sleeping, your CPU resources get occupied by lots of while (true) { } loops.
If you do nothing after some (not all) of the requests have finished,
you should let maximum timeout seconds larger.
Your code is written for old libcurls. As of libcurl version 7.2,
the state CURLM_CALL_MULTI_PERFORM does not appear anymore.
So, the following code
$running = null;
$mrc = null;
do
{
$mrc = curl_multi_exec( $master , $running );
}
while ( $mrc == CURLM_CALL_MULTI_PERFORM );
while ( $running && $mrc == CURLM_OK )
{
if (curl_multi_select( $master ) != - 1)
{
do
{
$mrc = curl_multi_exec( $master , $running );
}
while ( $mrc == CURLM_CALL_MULTI_PERFORM );
}
}
should be
curl_multi_exec($master, $running);
do
{
if (curl_multi_select($master, 99) === -1)
{
usleep(2500);
continue;
}
curl_multi_exec($master, $running);
} while ($running);
Note
The timeout value of curl_multi_select should be tuned only if you want to do something like...
curl_multi_exec($master, $running);
do
{
if (curl_multi_select($master, $TIMEOUT) === -1)
{
usleep(2500);
continue;
}
curl_multi_exec($master, $running);
while ($info = curl_multi_info_read($master))
{
/* Do something with $info */
}
} while ($running);
Otherwise, the value should be extreamly large.
(However, PHP_INT_MAX is too large; libcurl treats it as an invalid value.)
2. Easy experiment in one PHP process
I tested using my parallel cURL executor library: mpyw/co
(The prep. for is improper and it should be by, sorry for my poor English xD)
<?php
require 'vendor/autoload.php';
use mpyw\Co\Co;
function four_sequencial_requests_for_one_hundread_people()
{
for ($i = 0; $i < 100; ++$i) {
$tasks[] = function () use ($i) {
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => 'example.com',
CURLOPT_FORBID_REUSE => true,
CURLOPT_RETURNTRANSFER => true,
]);
for ($j = 0; $j < 4; ++$j) {
yield $ch;
}
};
}
$start = microtime(true);
yield $tasks;
$end = microtime(true);
printf("Time of %s: %.2f sec\n", __FUNCTION__, $end - $start);
}
function requests_for_four_hundreds_people()
{
for ($i = 0; $i < 400; ++$i) {
$tasks[] = function () use ($i) {
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => 'example.com',
CURLOPT_FORBID_REUSE => true,
CURLOPT_RETURNTRANSFER => true,
]);
yield $ch;
};
}
$start = microtime(true);
yield $tasks;
$end = microtime(true);
printf("Time of %s: %.2f sec\n", __FUNCTION__, $end - $start);
}
Co::wait(four_sequencial_requests_for_one_hundread_people(), [
'concurrency' => 0, // Zero means unlimited
]);
Co::wait(requests_for_four_hundreds_people(), [
'concurrency' => 0, // Zero means unlimited
]);
I tried for five times to get the following results:
I also tried in reverse order (The 3rd request was kicked xD):
These results represent too many concurrent TCP connections actually decrease throughputs.
3. Advanced optimization
3-A. For different destinations
If you want to optimize for both few and many concurrent requests, the following dirty solution may help you.
Share the number of requesters using apcu_add / apcu_fetch / apcu_delete.
Switch methods(sequencial or parallel) by current value.
3-B. For the same destinations
CURLMOPT_PIPELINING will help you. This option bundles all HTTP/1.1 connections for the same destination into one TCP connection.
curl_multi_setopt($master, CURLMOPT_PIPELINING, 1);

PHP Multiple Simultaneous Data Requests

I am trying to get multiple data requests at once to speed up processing time. I have been able to get all of the data I need by looping requests one at a time, but it takes 30-60 seconds. I just discovered curl_multi, but I cannot get it to return data, and so in the new code below, I am trying to get 3 URLs to work, but it'll end up being probably 10+ requests at once. I edited out my keys and replaced them with xxxx, so I already know that it won't run as is, if you try.
Any help or direction is greatly appreciated!
Code that works for all requests, but is time consuming:
define(TIME, date('U')+3852);
require_once 'OAuth.php';
$consumer_key = 'xxxx';
$consumer_secret = 'xxxx';
$access_token = 'xxxx';
$access_secret = 'xxxx';
$i=0;
foreach ($expirations as $row){
$expiry_date = $expirations[$i];
$url = "https://api.tradeking.com/v1/market/options/search.xml?symbol=SPX&query=xdate-eq%3A$expiry_date";
//Option Expirations
//$url = 'https://api.tradeking.com/v1/market/options/expirations.xml?symbol=SPX';
//Account data
//$url = 'https://api.tradeking.com/v1/accounts';
$consumer = new OAuthConsumer($consumer_key,$consumer_secret);
$access = new OAuthToken($access_token,$access_secret);
$request = OAuthRequest::from_consumer_and_token($consumer, $access, 'GET', $url);
$request->sign_request(new OAuthSignatureMethod_HMAC_SHA1(), $consumer, $access);
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_HTTPHEADER, array($request->to_header()));
$response = curl_exec($ch);
//Turn XML ($response) into an array ($array)
$array=json_decode(json_encode(simplexml_load_string($response)),true);
//Trim excess fields from ($array) and define as ($chain)
$chain = $array['quotes']['quote'];
$bigarray[$i] = $chain;
$i++;
}
Returns a huge array as variable $bigarray
Code that isn't working for three simultaneous requests:
define(TIME, date('U')+3852);
require_once 'OAuth.php';
$consumer_key = 'xxxx';
$consumer_secret = 'xxxx';
$access_token = 'xxxx';
$access_secret = 'xxxx';
$consumer = new OAuthConsumer($consumer_key,$consumer_secret);
$access = new OAuthToken($access_token,$access_secret);
// Run the parallel get and print the total time
$s = microtime(true);
// Define the URLs
$urls = array(
"https://api.tradeking.com/v1/market/options/search.xml?symbol=SPX&query=xdate-eq%3A$20160311",
"https://api.tradeking.com/v1/market/options/search.xml?symbol=SPX&query=xdate-eq%3A$20160318",
"https://api.tradeking.com/v1/market/options/search.xml?symbol=SPX&query=xdate-eq%3A$20160325"
);
$pg = new ParallelGet($urls);
print "<br />Total time: ".round(microtime(true) - $s, 4)." seconds";
// Class to run parallel GET requests and return the transfer
class ParallelGet
{
function __construct($urls)
{
// Create get requests for each URL
$mh = curl_multi_init();
foreach($urls as $i => $url)
{
$request = OAuthRequest::from_consumer_and_token($consumer, $access, 'GET', $url);
$request->sign_request(new OAuthSignatureMethod_HMAC_SHA1(), $consumer, $access);
$ch[$i] = curl_init($url);
curl_setopt($ch[$i], CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch[$i], CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch[$i], CURLOPT_HTTPHEADER, array($request->to_header()));
curl_multi_add_handle($mh, $ch[$i]);
}
// Start performing the request
do {
$execReturnValue = curl_multi_exec($mh, $runningHandles);
} while ($execReturnValue == CURLM_CALL_MULTI_PERFORM);
// Loop and continue processing the request
while ($runningHandles && $execReturnValue == CURLM_OK) {
// Wait forever for network
$numberReady = curl_multi_select($mh);
if ($numberReady != -1) {
// Pull in any new data, or at least handle timeouts
do {
$execReturnValue = curl_multi_exec($mh, $runningHandles);
} while ($execReturnValue == CURLM_CALL_MULTI_PERFORM);
}
}
// Check for any errors
if ($execReturnValue != CURLM_OK) {
trigger_error("Curl multi read error $execReturnValue\n", E_USER_WARNING);
}
// Extract the content
foreach($urls as $i => $url)
{
// Check for errors
$curlError = curl_error($ch[$i]);
if($curlError == "") {
$res[$i] = curl_multi_getcontent($ch[$i]);
} else {
print "Curl error on handle $i: $curlError\n";
}
// Remove and close the handle
curl_multi_remove_handle($mh, $ch[$i]);
curl_close($ch[$i]);
}
// Clean up the curl_multi handle
curl_multi_close($mh);
// Print the response data
print_r($res);
}
}
?>
Returns:
Array ( [0] => [1] => [2] => )
Total time: 0.4572 seconds

How do we make multiple file_get_contents in PHP run faster?

We're building an API that performs repetitive file_get_contents.
I have an array of userids and the number of file_get_contents() will be repeated by the number of contents in the array. We will do thousands of requests.
function request($userid) {
global $access_token;
$url = 'https://<api-domain>'.$userid.'?access_token='.$access_token;
$response = file_get_contents($url);
return json_decode($response);
}
function loopUserIds($arrayUserIds) {
global $countFollowers;
$arrayAllUserIds = array();
foreach ($arrayUserIds as $userid) {
$followers = request($userid);
...
}
...
}
My concern is it takes time to get everything. Since the function will also be called in a loop. Please advise how we can make this (many file_get_contents() requests) run faster?
As #HankyPanky mentioned, you can use curl_multi_exec() to do many concurrent requests at the same time.
Something like this should help:
function fetchAndProcessUrls(array $urls, callable $f) {
$multi = curl_multi_init();
$reqs = [];
foreach ($urls as $url) {
$req = curl_init();
curl_setopt($req, CURLOPT_URL, $url);
curl_setopt($req, CURLOPT_HEADER, 0);
curl_multi_add_handle($multi, $req);
$reqs[] = $req;
}
// While we're still active, execute curl
$active = null;
// Execute the handles
do {
$mrc = curl_multi_exec($multi, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
while ($active && $mrc == CURLM_OK) {
if (curl_multi_select($multi) != -1) {
do {
$mrc = curl_multi_exec($multi, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
}
}
// Close the handles
foreach ($reqs as $req) {
$f(curl_multi_getcontent($req));
curl_multi_remove_handle($multi, $req);
}
curl_multi_close($multi);
}
You can use it like so:
$urlArray = [ 'http://www.example.com/' , 'http://www.example.com/', ... ];
fetchAndProcessUrls($urlArray, function($requestData) {
/* do stuff here */
// e.g.
$jsonData = json_decode($requestData, 1); //
});
When curl_multi_exec isn't available, you can gain performance when you reuse $ch instead of creating a new for every file, which will reuse the connection when downloading from the same host.

cURL Multi Threading?

I've heard a lot about php's multi threading with cURL but have never really tried it and I find it a bit tough to understand how it actually works. Could anyone convert this into curl_multi?
$path1 = array("path1", "path2", "path3"); //example
$path2 = array("path1", "path2", "path3"); //example
$opt = curl_init($path1);
curl_setopt($opt, CURLOPT_RETURNTRANSFER, true);
$content = curl_exec($opt);
curl_close($opt);
file_put_contents($path2, $content);
What I want to actually do is to download multiple files from the arrays path 1 into path 2 using curl_multi.
This is nice project to start with...
https://github.com/jmathai/php-multi-curl
I am using curl multi and it is awesome indeed. I am using this to make faster push notifications.
https://github.com/Krutarth/FlashSnsPns
The above accepted answer is outdated/wrong, So, correct answer has to be up voted.
http://php.net/manual/en/function.curl-multi-init.php
Now, PHP supports fetching multiple URLs at the same time.
There is a very good function written by someone, http://archevery.blogspot.in/2013/07/php-curl-multi-threading.html
This is the function:
function runRequests($url_array, $thread_width = 4) {
$threads = 0;
$master = curl_multi_init();
$curl_opts = array(CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_MAXREDIRS => 5,
CURLOPT_CONNECTTIMEOUT => 15,
CURLOPT_TIMEOUT => 15,
CURLOPT_RETURNTRANSFER => TRUE);
$results = array();
$count = 0;
foreach($url_array as $url) {
$ch = curl_init();
$curl_opts[CURLOPT_URL] = $url;
curl_setopt_array($ch, $curl_opts);
curl_multi_add_handle($master, $ch); //push URL for single rec send into curl stack
$results[$count] = array("url" => $url, "handle" => $ch);
$threads++;
$count++;
if($threads >= $thread_width) { //start running when stack is full to width
while($threads >= $thread_width) {
usleep(100);
while(($execrun = curl_multi_exec($master, $running)) === -1){}
curl_multi_select($master);
// a request was just completed - find out which one and remove it from stack
while($done = curl_multi_info_read($master)) {
foreach($results as &$res) {
if($res['handle'] == $done['handle']) {
$res['result'] = curl_multi_getcontent($done['handle']);
}
}
curl_multi_remove_handle($master, $done['handle']);
curl_close($done['handle']);
$threads--;
}
}
}
}
do { //finish sending remaining queue items when all have been added to curl
usleep(100);
while(($execrun = curl_multi_exec($master, $running)) === -1){}
curl_multi_select($master);
while($done = curl_multi_info_read($master)) {
foreach($results as &$res) {
if($res['handle'] == $done['handle']) {
$res['result'] = curl_multi_getcontent($done['handle']);
}
}
curl_multi_remove_handle($master, $done['handle']);
curl_close($done['handle']);
$threads--;
}
} while($running > 0);
curl_multi_close($master);
return $results;
}
You can just use it.

Optimising PHP cURL based link checker script - currently very slow

I'm using a PHP script (using cURL) to check whether:
The links in my database are correct (ie return HTTP status 200)
The links are in fact redirected and redirect to an appropriate/similar page (using the contents of the page )
The results of this are saved to a log file and emailed to me as an attachment.
This is all fine and working, however it is slow as all hell and half the time it times out and aborts itself early. Of note, I have about 16,000 links to check.
Was wondering how best to make this run quicker, and what I'm doing wrong?
Code below:
function echoappend ($file,$tobewritten) {
fwrite($file,$tobewritten);
echo $tobewritten;
}
error_reporting(E_ALL);
ini_set('display_errors', '1');
$filename=date('YmdHis') . "linkcheck.htm";
echo $filename;
$file = fopen($filename,"w+");
try {
$conn = new PDO('mysql:host=localhost;dbname=databasename',$un,$pw);
$conn->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
echo '<b>connected to db</b><br /><br />';
$sitearray = array("medical.posterous","ebm.posterous","behavenet","guidance.nice","www.rch","emedicine","www.chw","www.rxlist","www.cks.nhs.uk");
foreach ($sitearray as $key => $value) {
$site=$value;
echoappend ($file, "<h1>" . $site . "</h1>");
$q="SELECT * FROM link WHERE url LIKE :site";
$stmt = $conn->prepare($q);
$stmt->execute(array(':site' => 'http://' . $site . '%'));
$result = $stmt->fetchAll();
$totallinks = 0;
$workinglinks = 0;
foreach($result as $row)
{
$ch = curl_init();
$originalurl = $row['url'];
curl_setopt($ch, CURLOPT_URL, $originalurl);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
$output = curl_exec($ch);
if ($output === FALSE) {
echo "cURL Error: " . curl_error($ch);
}
$urlinfo = curl_getinfo($ch);
if ($urlinfo['http_code'] == 200)
{
echoappend($file, $row['name'] . ": <b>working!</b><br />");
$workinglinks++;
}
else if ($urlinfo['http_code'] == 301 || 302)
{
$redirectch = curl_init();
curl_setopt($redirectch, CURLOPT_URL, $originalurl);
curl_setopt($redirectch, CURLOPT_HEADER, 1);
curl_setopt($redirectch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($redirectch, CURLOPT_NOBODY, false);
curl_setopt($redirectch, CURLOPT_FOLLOWLOCATION, true);
$redirectoutput = curl_exec($redirectch);
$doc = new DOMDocument();
#$doc->loadHTML($redirectoutput);
$nodes = $doc->getElementsByTagName('title');
$title = $nodes->item(0)->nodeValue;
echoappend ($file, $row['name'] . ": <b>redirect ... </b>" . $title . " ... ");
if (strpos(strtolower($title),strtolower($row['name']))===false) {
echoappend ($file, "FAIL<br />");
}
else {
$header = curl_getinfo($redirectch);
echoappend ($file, $header['url']);
echoappend ($file, "SUCCESS<br />");
}
curl_close($redirectch);
}
else
{
echoappend ($file, $row['name'] . ": <b>FAIL code</b>" . $urlinfo['http_code'] . "<br />");
}
curl_close($ch);
$totallinks++;
}
echoappend ($file, '<br />');
echoappend ($file, $site . ": " . $workinglinks . "/" . $totallinks . " links working. <br /><br />");
}
$conn = null;
echo '<br /><b>connection closed</b><br /><br />';
} catch(PDOException $e) {
echo 'ERROR: ' . $e->getMessage();
}
Short answer is use the curl_multi_* methods to parallelize your requests.
The reason for the slowness is that web requests are comparatively slow. Sometimes VERY slow. Using the curl_multi_* functions lets you run multiple requests simultaneously.
One thing to be careful about is to limit the number of requests you run at once. In other words, don't run 16,000 requests at once. Maybe start at 16 and see how that goes.
The following example should help you get started:
<?php
//
// Fetch a bunch of URLs in parallel. Returns an array of results indexed
// by URL.
//
function fetch_urls($urls, $curl_options = array()) {
$curl_multi = curl_multi_init();
$handles = array();
$options = $curl_options + array(
CURLOPT_HEADER => true,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_NOBODY => true,
CURLOPT_FOLLOWLOCATION => true);
foreach($urls as $url) {
$handles[$url] = curl_init($url);
curl_setopt_array($handles[$url], $options);
curl_multi_add_handle($curl_multi, $handles[$url]);
}
$active = null;
do {
$status = curl_multi_exec($curl_multi, $active);
} while ($status == CURLM_CALL_MULTI_PERFORM);
while ($active && ($status == CURLM_OK)) {
if (curl_multi_select($curl_multi) != -1) {
do {
$status = curl_multi_exec($curl_multi, $active);
} while ($status == CURLM_CALL_MULTI_PERFORM);
}
}
if ($status != CURLM_OK) {
trigger_error("Curl multi read error $status\n", E_USER_WARNING);
}
$results = array();
foreach($handles as $url => $handle) {
$results[$url] = curl_getinfo($handle);
curl_multi_remove_handle($curl_multi, $handle);
curl_close($handle);
}
curl_multi_close($curl_multi);
return $results;
}
//
// The urls to test
//
$urls = array("http://google.com", "http://yahoo.com", "http://google.com/probably-bogus", "http://www.google.com.au");
//
// The number of URLs to test simultaneously
//
$request_limit = 2;
//
// Test URLs in batches
//
$redirected_urls = array();
for ($i = 0 ; $i < count($urls) ; $i += $request_limit) {
$results = fetch_urls(array_slice($urls, $i, $request_limit));
foreach($results as $url => $result) {
if ($result['http_code'] == 200) {
$status = "Worked!";
} else {
$status = "FAILED with {$result['http_code']}";
}
if ($result["redirect_count"] > 0) {
array_push($redirected_urls, $url);
echo "{$url}: ${status}\n";
} else {
echo "{$url}: redirected to {$result['url']} and {$status}\n";
}
}
}
//
// Handle redirected URLs
//
echo "Processing redirected URLs...\n";
for ($i = 0 ; $i < count($redirected_urls) ; $i += $request_limit) {
$results = fetch_urls(array_slice($redirected_urls, $i, $request_limit), array(CURLOPT_FOLLOWLOCATION => false));
foreach($results as $url => $result) {
if ($result['http_code'] == 301) {
echo "{$url} permanently redirected to {$result['url']}\n";
} else if ($result['http_code'] == 302) {
echo "{$url} termporarily redirected to {$result['url']}\n";
} else {
echo "{$url}: FAILED with {$result['http_code']}\n";
}
}
}
The above code processes a list of URLs in batches. It works in two passes. In the first pass, each request is configured to follow redirects and simply reports whether each URL ultimately lead to a successful request, or a failure.
The second pass processes any redirected URLs detected in the first pass and reports whether the redirect was a permanent redirection (meaning you can update your database with the new URL), or temporary (meaning you should NOT update your database).
NOTE:
In your original code, you have the following line, which will not work the way you expect it to:
else if ($urlinfo['http_code'] == 301 || 302)
The expression will ALWAYS return TRUE. The correct expression is:
else if ($urlinfo['http_code'] == 301 || $urlinfo['http_code'] == 302)
Also, put
set_time_limit(0);
at the top of your script to stop it aborting when it hits 30 seconds.

Categories