I have 1000 feed urls sitting in a MySQL database table. I need to do a http request to all these urls every 2 minutes. I wrote a php script to do that, but the script takes 5min 30sec to run.
I want to be able to finish all the 1000 requests in under a minute. Is there a way to run multiple async processes to get the job done faster? Any help is appreciated. Thanks in advance.
Since your question is about sending http requests, not really ping, you can use Grequests (Requests+gevent) to do it easily and fast (in my experience seconds for a couple hundred url requests):
import grequests
urls = [
'http://www.python.org',
'http://python-requests.org',
'http://www.google.com',
]
rs = (grequests.get(u) for u in urls)
grequests.map(rs) # [<Response [200]>, <Response [200]>, <Response [200]>]
Your Php script takes 5 mins to run because it is synchronous code, which means that for every request you sent, you have to wait for response to arrive before moving onto sending the next request.
The trick here is not to wait (or block as many would call) for responses but go straight to make the next request, and you can achieve it easily with gevent(coroutine-based) or nodejs. You can read more on it here.
Have a look at the AnyEvent::Ping or AnyEvent::FastPing modules on CPAN.
Below is straightforward example of using AnyEvent::Ping to ping 10000 urls:
use strict;
use warnings;
use AnyEvent;
use AnyEvent::Ping;
my $cv = AnyEvent->condvar;
my $ping = AnyEvent::Ping->new;
my #results;
for my $url (get_ten_thousand_urls()) {
$cv->begin;
# ping each URL just once
$ping->ping( $url, 1, sub {
# [ url address, ping status ]
push #results, [ $url, $_[0]->[0][0] ];
$cv->end;
});
}
$cv->recv;
# now do something with #results
Some quick tests of above using 10,000 random URLs all took just over 7 seconds to run on my Macbook Air. With tweaking and/or using faster event loop then this time will drop further (above used default pure Perl event loop).
NB. AnyEvent is an abstraction library which will allow you to use the async event system provided by (or installed on) your system. If you want to use a specific event loop then remember to install the relevant Perl module from CPAN, for e.g. EV if using libev. AnyEvent will default to a pure Perl event loop if nothing else is found (installed).
BTW, If you just need to check an HTTP request (ie. not ping) then simply replace AnyEvent::Ping part with AnyEvent::HTTP.
You tagged this with "python", so I'll assume that using Python is an option here. Look at the multiprocessing module. For example:
#!/usr/bin/env python
import multiprocessing
import os
import requests
import subprocess
addresses = ['1.2.3.4', '1.2.3.5', '4.2.2.1', '8.8.8.8']
null = open(os.devnull, 'w')
def fbstatus(count):
"""Returns the address, and True if the ping returned in under 5 seconds or
else False"""
return (count,
requests.get('http://www.facebook.com/status.php').status_code)
def ping(address):
"""Returns the address, and True if the ping returned in under 5 seconds or
else False"""
return address, not subprocess.call(['ping', '-c1', '-W5', address],
stdout=null)
pool = multiprocessing.Pool(15)
if False:
print pool.map(ping, addresses)
else:
pool.map(fbstatus, range(1000))
New - Fetching pages
The fbstatus() function fetches a page from Facebook. This scaled almost linearly with the size of the pool up through 30 concurrent processes. It averaged a total runtime of about 80 seconds on my laptop. At 30 workers, it took a total of about 3.75 wall clock seconds to finish.
Old - Pinging
This uses the subprocess module to call the ping command with a 5 second timeout and a count of 1. It uses the return value of ping (0 for success, 1 for failure) and negates it to get False for failure and True for success. The ping() function returns the address it was called with plus that boolean result.
The last bit creates a multiprocessing pool with 5 child processes, then calls ping() on each of the values in addresses. Since ping() returns its address, it's really easy to see the result of pinging each of those addresses.
Running it, I get this output:
[('1.2.3.4', False), ('1.2.3.5', False), ('4.2.2.1', True), ('8.8.8.8', True)]
That run took 5.039 seconds of wallclock time and 0% CPU. In other words, it spent almost 100% of its time waiting for ping to return. In your script, you'd want to use something like Requests to fetch your feed URLs (and not the literal ping command that I was using as an example), but the basic structure could be nearly identical.
You could try multithreading ping on python.
Here is good example.
#!/usr/bin/env python2.5
from threading import Thread
import subprocess
from Queue import Queue
num_threads = 4
queue = Queue()
ips = ["10.0.1.1", "10.0.1.3", "10.0.1.11", "10.0.1.51"]
#wraps system ping command
def pinger(i, q):
"""Pings subnet"""
while True:
ip = q.get()
print "Thread %s: Pinging %s" % (i, ip)
ret = subprocess.call("ping -c 1 %s" % ip,
shell=True,
stdout=open('/dev/null', 'w'),
stderr=subprocess.STDOUT)
if ret == 0:
print "%s: is alive" % ip
else:
print "%s: did not respond" % ip
q.task_done()
#Spawn thread pool
for i in range(num_threads):
worker = Thread(target=pinger, args=(i, queue))
worker.setDaemon(True)
worker.start()
#Place work in queue
for ip in ips:
queue.put(ip)
#Wait until worker threads are done to exit
queue.join()
I used Perl's POE Ping Component module for this task quite extensively.
[Update: Re-tested this with maxSockets = 100 and while connected to a very good network connection. The script finished in < 1 second, meaning the biggest factor is probably network thruput / latency, as previously noted. Your results will almost certainly vary. ;) ]
You can use node.js for this, as it's API for doing HTTP is powerful, clean, and simple. E.g. The following script fetches ~1000 requests in 10 seconds less than one second on my MacBook Pro:
test.js
var http = require('http');
// # of simultaneouse requests allowed
http.globalAgent.maxSockets = 100;
var n = 0;
var start = Date.now();
function getOne(url) {
var id = n++;
var req = http.get(url, function(res) {
res.on('data', function(chunk){
// do whatever with response data here
});
res.on('end', function(){
console.log('Response #' + id + ' complete');
n--;
if (n == 0) {
console.log('DONE in ' + (Date.now() - start)/1000 + ' secs');
}
});
});
}
// Set # of simultaneous connections allowed
for (var i = 0; i < 1000; i++) {
getOne('http://www.facebook.com/status.php');
}
Outputs ...
$ node test.js
Response #3 complete
Response #0 complete
Response #2 complete
...
Response #999 complete
DONE in 0.658 secs
Thanks Alex Lunix for the suggestion. I looked up curl_multi_* and found a solution to do it in curl, so I don't have to change my code much. But thank you all the others for the answers. Here is what I did:
<?php
require("class.php");
$obj=new module();
$det=$obj->get_url();
$batch_size = 40;
function curlTest2($urls) {
clearstatcache();
$batch_size = count($urls);
$return = '';
echo "<br/><br/>Batch:";
foreach ($urls as &$url)
{
echo "<br/>".$url;
if(substr($url,0,4)!="http") $url = "http://".$url;
$url = "https://ajax.googleapis.com/ajax/services/feed/load?v=1.0&num=-1&q=".$url;
}
$userAgent = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)';
$chs = array();
for ($i = 0; $i < $batch_size; $i++)
{
$ch = curl_init();
array_push($chs, $ch);
}
for ($i = 0; $i < $batch_size; $i++)
{
curl_setopt($chs[$i], CURLOPT_HEADER, 1);
curl_setopt($chs[$i], CURLOPT_NOBODY, 1);
curl_setopt($chs[$i], CURLOPT_USERAGENT, $userAgent);
curl_setopt($chs[$i], CURLOPT_RETURNTRANSFER, 1);
curl_setopt($chs[$i], CURLOPT_CONNECTTIMEOUT, 15);
curl_setopt($chs[$i], CURLOPT_FAILONERROR, 1);
curl_setopt($chs[$i], CURLOPT_FRESH_CONNECT, 1);
curl_setopt($chs[$i], CURLOPT_URL, $urls[$i]);
}
$mh = curl_multi_init();
for ($i = 0; $i < $batch_size; $i++)
{
curl_multi_add_handle($mh, $chs[$i]);
}
$active = null;
//execute the handles
do {
$mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
while ($active && $mrc == CURLM_OK) {
if (curl_multi_select($mh) != -1) {
do {
$mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
}
}
//close the handles
for ($i = 0; $i < $batch_size; $i++)
{
curl_multi_remove_handle($mh, $chs[$i]);
}
curl_multi_close($mh);
}
$startTime = time();
$urls = array();
foreach($det as $key=>$value){
array_push($urls, $value['url']);
if (count($urls) == $batch_size)
{
curlTest2($urls);
$urls = array();
}
}
echo "<br/><br/>Time: ".(time() - $startTime)."sec";
?>
This brought down my processing time from 332sec to 18sec. The code probably can be optimized a little but you get the gist of it.
Related
I am writing a php script to interact with a CouchDb server. The script reads an SQL database and creates documents and PUTs them on the server. Each script runs every 5 minutes and puts about 2000 documents (creates and updates).
Running sync, this takes about 3 minutes to PUT all the docs. In a test I did using node and promises, I found CouchDb can handle 100 asnyc puts at the same time and respond back in only slightly more time than it took to do a single document. I want to utilize this feature in PHP instead.
I have available, php 5.3 and php 7.0.10 on a Windows server.
How do I do this asnyc?
My first thought was using the pclose(popen()) trick, but that spawns a new process each time, and even if I restrict this to 100 docs at a time (my tests show up to 700 at a time is doable), that would still results in 6 scripts creating and recreating total of 600 new processes every 100/2000 docs every 5 minutes, or a total of 12,000 processes created and run every 5 minutes. I don't think Windows can handle that.
My second idea was to set up a basic node script to handle it, with PHP creating and formatting the data, writing to a file, and passing the file to a node script to process async and report back to PHP using exec. But I am hoping to find a pure PHP solution.
I currently send requests to couch like this
private function sendJSONRequest($method, $url, $post_data = NULL)
{
// Open socket
$s = fsockopen($this->db_host, $this->db_port, $errno, $errstr);
if (!$s) {
throw new Exception("fsockopen: $errno: $errstr");
}
// Prepare request
$request = "$method $url HTTP/1.0\r\n" .
($this->db_auth === false ? "" : "Authorization: $this->db_auth\r\n") .
"User-Agent: couchdb-php/1.0\r\n" .
"Host: $this->db_host:$this->db_port\r\n" .
"Accept: application/json\r\n" .
"Connection: close\r\n";
if ($method == "POST" || $method == "PUT") {
$json_data = json_encode($post_data);
$request .= "Content-Type: application/json\r\n" .
"Content-Length: " . strlen($json_data) . "\r\n\r\n" .
$json_data;
} else {
$request .= "\r\n";
}
// Send request
fwrite($s, $request);
$response = "";
// Receive response
while (!feof($s)) {
$response .= fgets($s);
}
$headers = array();
$body = '';
$reason = '';
if (!empty($response)) {
// Split header & body
list($header, $body) = explode("\r\n\r\n", $response);
// Parse header
$first = true;
foreach (explode("\r\n", $header) as $line) {
if ($first) {
$status = intval(substr($line, 9, 3));
$reason = substr($line, 13);
$first = false;
} else {
$p = strpos($line, ":");
$headers[strtolower(substr($line, 0, $p))] = substr($line, $p + 2);
}
}
} else {
$status = 200;
}
// Return results
return array($status, $reason, $headers, json_decode($body));
}
My PHP knowledge is only basic, so examples to learn from would be greatly appreciated.
Thank you
Guzzle is a PHP library that helps send HTTP requests and can do so asynchronously. The documentation for the async function can be found here.
It's a litte bit ago since i have researched in this topic, but simply what you are looking for is a queue runner system. At my old employee i have worked with a custom built queue runner in php.
That mean, you have e.g. 4 queue runners. Thats are php processes which are watching a control table maybe "queue". Each time a queue process is inserted in maybe status "new" a runner lock this entry and start the process with a fork job.
PHP forking: http://php.net/manual/de/function.pcntl-fork.php
so... this 4 queue runners can maybe let's say fork 10 processes, than you have 40 parallel working processes.
To seperate them what each do is in best way another control tables from which each job selects a amount of data with LIMIT and OFFSET Queries. Lets say job 1 selects the first 0-20 rows, job 2 the 21-40 rows.
Edit:
After a little research this looks nearly similar to on what i've worked: https://github.com/CoderKungfu/php-queue
My situation:
I have multiple servers running a raw TCP API that requires me to send a string to get information from them. I need to get a response within a timeout of 5 seconds. All APIs should be contacted at the same time and from there on they got 5 seconds to respond. (So the maximum execution time is 5 seconds for all servers at once)
I already managed to do so for HTTP/S APIs with PHP cURL:
// array of curl handles
$multiCurl = array();
// data to be returned
$result = array();
// multi handle
$mh = curl_multi_init();
foreach ($row_apis as $api) {
$id = $api[0];
$ip = $api[1];
$port = $api[2];
// URL from which data will be fetched
$fetchURL = "$ip:$port/api/status";
$multiCurl[$id] = curl_init();
curl_setopt($multiCurl[$id], CURLOPT_URL,$fetchURL);
//curl_setopt($multiCurl[$id], CURLOPT_HEADER,0);
curl_setopt($multiCurl[$id], CURLOPT_HTTP_VERSION,CURL_HTTP_VERSION_1_1);
curl_setopt($multiCurl[$id], CURLOPT_CUSTOMREQUEST,"GET");
curl_setopt($multiCurl[$id], CURLOPT_TIMEOUT,5);
curl_setopt($multiCurl[$id], CURLOPT_RETURNTRANSFER,1);
curl_multi_add_handle($mh, $multiCurl[$id]);
}
$index=null;
do {
curl_multi_exec($mh,$index);
} while($index > 0);
// get content and remove handles
foreach($multiCurl as $k => $ch) {
$result[$k] = json_decode(curl_multi_getcontent($ch), true);
curl_multi_remove_handle($mh, $ch);
}
// close
curl_multi_close($mh);
This sample fetches all APIs at once and waits 5 seconds for a respose. It will never take longer than 5 seconds.
Is there a way to do the same thing with raw TCP APIs in PHP?
I already tried to use sockets and was able to get the information but every API is fetched after another, so the script takes way to long for multiple servers.
Thanks for your help.
EDIT:
I've tried to implement your suggestions and my code now looks like this:
$apis = array();
$apis[0] = array(1, "123.123.123.123", 1880);
$method = "summary";
$sockets = array();
//Create socket array
foreach($apis as $api){
$id = $api[0];
$ip = $api[1];
$port = $api[2];
$sockets[$id] = socket_create(AF_INET, SOCK_STREAM, SOL_TCP);
socket_set_nonblock($sockets[$id]);
#socket_connect($sockets[$id], $ip, $port);
//socket_write($sockets[$id], $method);
}
//Write to every socket
/*
foreach($sockets as $socket){
socket_write($socket, $method);
//fwrite($socket, "$method");
}
*/
//Read all sockets for 5 seconds
$write = NULL;
$except = NULL;
$responses = socket_select($sockets, $write, $except, 5);
//Check result
if($responses === false){
echo "Did not work";
}
elseif($responses > 0){
echo "At least one has responded";
}
//Access the data
//???
But I'm getting a 0 as the result of socket_select...
When do I need to write the method to the socket?
And if I will get something back, how do I access the data that was in the response?
absolutely. set SO_SNDBUF to the appropriate size, so you can send all the requests instantly/non-blockingly, then send all the reqeusts, then start waiting for/reading the responses.
the easy way to do the reading is to call socket_set_block on the sockets, and read all responses 1 by 1, but this doesn't give a hard guarantee of a 5 second timeout (but then again, neither does your example curl_multi code), if you need a 5 second timeout, use socket_set_nonblock & socket_select instead.
When I run a check on 10 urls, if I am able to get a connection with the host server, the handle will return a success message (CURLE_OK)
When processing each handle if a server refuses the connection, the handle will include a error message.
The problem
I assumed that when we get a bad handle, CURL will mark this handle but continue to process the unprocessed handles, however this is not what seems to happen.
When we come across a bad handle, CURL will mark this handle as bad, but will not process the remaining unprocessed handles.
This can be hard to detect, if I do get a connection with all handles, which is what happens most of the time, then the problem is not visible.(CURL only stops on first bad connection);
For the test, I had to find a suitable site which loads slow/refuses x amount simultaneous of connections.
set_time_limit(0);
$l = array(
'http://smotri.com/video/list/',
'http://smotri.com/video/list/sports/',
'http://smotri.com/video/list/animals/',
'http://smotri.com/video/list/travel/',
'http://smotri.com/video/list/hobby/',
'http://smotri.com/video/list/gaming/',
'http://smotri.com/video/list/mult/',
'http://smotri.com/video/list/erotic/',
'http://smotri.com/video/list/auto/',
'http://smotri.com/video/list/humour/',
'http://smotri.com/video/list/film/'
);
$mh = curl_multi_init();
$s = 0;
$f = 10;
while($s <= $f)
{
$ch = curl_init();
$curlsettings = array(
CURLOPT_URL => $l[$s],
CURLOPT_TIMEOUT => 0,
CURLOPT_CONNECTTIMEOUT => 0,
CURLOPT_RETURNTRANSFER => 1
);
curl_setopt_array($ch, $curlsettings);
curl_multi_add_handle($mh,$ch);
$s++;
}
$active = null;
do
{
curl_multi_exec($mh,$active);
curl_multi_select($mh);
$info = curl_multi_info_read($mh);
echo '<pre>';
var_dump($info);
if($info['result'] === CURLE_OK)
echo curl_getinfo($info['handle'],CURLINFO_EFFECTIVE_URL) . ' success<br>';
if($info['result'] != 0)
echo curl_getinfo($info['handle'],CURLINFO_EFFECTIVE_URL) . ' failed<br>';
} while ($active > 0);
curl_multi_close($mh);
I have dumped $info in the script which asks the Multi Handle if there is any new information on any handles whilst running.
When the script has ended we will see some bool(false) - when no new information was available(handles were still processing), along with all handles if all was successful or limited handles if one handle failed.
I have failed at fixing this, its probably something I have overlooked and I have gone too far down the road on attempting to fix things which are not relevant.
Some attempts at fixing this was.
Assign each $ch handle to a array - $ch[1], $ch[2] etc (instead of
adding current $ch handle to multi_handle then overwriting - as whats
in the test)
Removing handles after success/failure with
curl_multi_remove_handle
Set CURLOPT_CONNECTTIMEOUT and CURLOPT_TIMEOUT to infinity.
many more.(I will update this post as I have forgotten all of them)
Testing this with Php version 5.4.14
Hopefully I have illustrated the points well enough.
Thanks for reading.
I've been mucking around with your script for a while now trying to get it to work.It was only when I read Repeated calls to this function will return a new result each time, until a FALSE is returned as a signal that there is no more to get at this point., for http://se2.php.net/manual/en/function.curl-multi-info-read.php, that I realized a while loop might work.
The extra while loop makes it behave exactly how you'd expect. Here is the output I get:
http://smotri.com/video/list/sports/ failed
http://smotri.com/video/list/travel/ failed
http://smotri.com/video/list/gaming/ failed
http://smotri.com/video/list/erotic/ failed
http://smotri.com/video/list/humour/ failed
http://smotri.com/video/list/animals/ success
http://smotri.com/video/list/film/ success
http://smotri.com/video/list/auto/ success
http://smotri.com/video/list/ failed
http://smotri.com/video/list/hobby/ failed
http://smotri.com/video/list/mult/ failed
Here's the code I used for testing:
<?php
set_time_limit(0);
$l = array(
'http://smotri.com/video/list/',
'http://smotri.com/video/list/sports/',
'http://smotri.com/video/list/animals/',
'http://smotri.com/video/list/travel/',
'http://smotri.com/video/list/hobby/',
'http://smotri.com/video/list/gaming/',
'http://smotri.com/video/list/mult/',
'http://smotri.com/video/list/erotic/',
'http://smotri.com/video/list/auto/',
'http://smotri.com/video/list/humour/',
'http://smotri.com/video/list/film/'
);
$mh = curl_multi_init();
$s = 0;
$f = 10;
while($s <= $f)
{
$ch = curl_init();
if($s%2)
{
$curlsettings = array(
CURLOPT_URL => $l[$s],
CURLOPT_TIMEOUT_MS => 3000,
CURLOPT_RETURNTRANSFER => 1,
);
}
else
{
$curlsettings = array(
CURLOPT_URL => $l[$s],
CURLOPT_TIMEOUT_MS => 4000,
CURLOPT_RETURNTRANSFER => 1,
);
}
curl_setopt_array($ch, $curlsettings);
curl_multi_add_handle($mh,$ch);
$s++;
}
$active = null;
do
{
$mrc = curl_multi_exec($mh,$active);
curl_multi_select($mh);
while($info = curl_multi_info_read($mh))
{
echo '<pre>';
//var_dump($info);
if($info['result'] === 0)
{
echo curl_getinfo($info['handle'],CURLINFO_EFFECTIVE_URL) . ' success<br>';
}
else
{
echo curl_getinfo($info['handle'],CURLINFO_EFFECTIVE_URL) . ' failed<br>';
}
}
} while ($active > 0);
curl_multi_close($mh);
Hope that helps. For testing just adjust CURLOPT_TIMEOUT_MS to your internet connection. I made it so it alternates between 3000 and 4000 milliseconds as 3000 will fail and 4000 usually succeeds.
Update
After going through the PHP and libCurl docs I have found how curl_multi_exec works (in libCurl its curl_multi_perform). Upon first being called it starts handling transfers for all the added handles (added before via curl_multi_add_handle).
The number it assigns $active is the number of transfers still running. So if it's less than the total number of handles you have then you know one or more transfers are complete. So curl_multi_exec acts as a kind of progress indicator as well.
As all transfers are handled in a non-blocking fashion (transfers can finish simultaneously) the while loop curl_multi_exec's in cannot represent each iteration of completed url requests.
All data is stored in a queue so as soon as one or more transfers are complete you can call curl_multi_info_read to fetch this data.
In my original answer I had curl_multi_info_read in a while loop. This loop would keep iterating until curl_multi_info_read found no remaining data in the queue. After which the outer while loop would move onto the next iteration if $active != 0 (meaning curl_multi_exec reported transfers still not complete).
To summarize, the outer loop keeps iterating when there are still transfers not completed and the inner loop iterates only when there's data from a completed transfer.
The PHP documentation is pretty bad for curl multi functions so I hope this cleared a few things up. Below is an alternative way to do the same thing.
do
{
curl_multi_exec($mh,$active);
} while ($active > 0);
// while($info = curl_multi_info_read($mh)) would work also here
for($i = 0; $i <= $f; $i++){
$info = curl_multi_info_read($mh);
if($info['result'] === 0)
{
echo curl_getinfo($info['handle'],CURLINFO_EFFECTIVE_URL) . ' success<br>';
}
else
{
echo curl_getinfo($info['handle'],CURLINFO_EFFECTIVE_URL) . ' failed<br>';
}
}
From this information you can also see curl_multi_select is not needed as you don't want something that blocks until there is activity.
With the code you provided in your question it only seemed like curl wasn't proceeding after a few failed transfers but there was actually still data queued in the buffer. Your code just wasn't calling curl_multi_info_read enough times. The reason all the successful transfers were picked up by your code is due to PHP being run on a single thread and so the script hanged waiting for the requests. The timeouts for the failed requests didn't impact PHP enough to make it hang/wait that long so the number of iterations the while loop was doing was less than the number of queued data.
Well, I am attempting to reuse the handles I've spawned in the initial process, however after the first run it simply stops working. If I remove (or recreate the entire handler) the handles and add them again, it works fine. What could be the culprit of this?
My code currently looks like this:
<?php
echo 'Handler amount: ';
$threads = (int) trim(fgets(STDIN));
if($threads < 1) {
$threads = 1;
}
$s = microtime(true);
$url = 'http://mywebsite.com/some-script.php';
$mh = curl_multi_init();
$ch = array();
for($i = 0; $i < $threads; $i++) {
$ch[$i] = curl_init($url);
curl_setopt_array($ch[$i], array(
CURLOPT_USERAGENT => 'Mozilla/5.0 (X11; Linux i686; rv:21.0) Gecko/20130213 Firefox/21.0',
CURLOPT_REFERER => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_NOBODY => true
));
curl_multi_add_handle($mh, $ch[$i]);
}
while($mh) {
$running = null;
do {
curl_multi_exec($mh, $running);
} while($running > 0);
$e = microtime(true);
$totalTime = number_format($e - $s, 2);
if($totalTime >= 1) {
echo floor($threads / $totalTime) . ' requests per second (total time '.$totalTime.'s)' . "\r";
$s = microtime(true);
}
}
foreach($ch as $handler) {
curl_multi_remove_handle($mh, $handler);
curl_close($handler);
}
curl_multi_close($mh);
?>
When I have CURLOPT_VERBOSE set to true, I see many "additional stuff not fine transfer.c:1037: 0 0" messages, I read about them on a different question, and it seems that it is caused by some obvious things:
Too fast
Firewall
ISP restricting
AFAIK, this is not it, because if I recreate the handles every time, they successfully complete at about 79 requests per second (about 529 bytes each)
My process for reusing the handles:
Create the multi handler, and add the specified number of handles to the multi handler
While the mutli handler is working, execute all the handles
After the while loop has stopped (it seems very unlikely that it will), close all the handles and the multi curl handler
It executes all handles once and then stops.
This is really stumping me. Any ideas?
I ran into the same problem (using C++ though) and found out that I need to remove the curl easy handle(s) and add it back in again. My solution was to remove all handles at the end of the curl_multi_perform loop and add them back in at the beginning of the outer loop in which I reuse existing keep-alive connections:
for(;;) // loop using keep-alive connections
{
curl_multi_add_handle(...)
while ( stillRunning ) // curl_multi_perform loop
{
...
curl_multi_perform(...)
...
}
curl_multi_remove_handle(...)
}
Perhaps this also applies to your PHP scenario. Remember: don't curl_easy_cleanup or curl_easy_init the curl handle in between.
If you turn on CURLOPT_VERBOSE you can follow along in the console and very that your connections are indeed reused. That has solved this problem for me.
I am connecting to an unreliable API via file_get_contents. Since it's unreliable, I decided to put the api call into a while loop thusly:
$resultJSON = FALSE;
while(!$resultJSON) {
$resultJSON = file_get_contents($apiURL);
set_time_limit(10);
}
Putting it another way: Say the API fails twice before succeeding on the 3rd try. Have I sent 3 requests, or have I sent however many hundreds of requests as will fit into that 3 second window?
file_get_contents(), like basically all functions in PHP, is a blocking call.
Yes, it is a blocking function. You should also check to see if the value is specifically "false". (Note that === is used, not ==.) Lastly, you want to sleep for 10 seconds. set_time_limit() is used to set the max execution time before it is automatically killed.
set_time_limit(300); //Run for up to 5 minutes.
$resultJSON = false;
while($resultJSON === false)
{
$resultJSON = file_get_contents($apiURL);
sleep(10);
}
Expanding on #Sammitch suggestion to use cURL instead of file_get_contents():
<?php
$apiURL = 'http://stackoverflow.com/';
$curlh = curl_init($apiURL);
// Use === not ==
// if ($curlh === FALSE) handle error;
curl_setopt($curlh, CURLOPT_FOLLOWLOCATION, TRUE); // maybe, up to you
curl_setopt($curlh, CURLOPT_HEADER, FALSE); // or TRUE, according to your needs
curl_setopt($curlh, CURLOPT_RETURNTRANSFER, TRUE);
// set your timeout in seconds here
curl_setopt($curlh, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($curlh, CURLOPT_TIMEOUT, 30);
$resultJSON = curl_exec($curlh);
curl_close($curlh);
// if ($resultJSON === FALSE) handle error;
echo "$resultJSON\n"; // Now process $resultJSON
?>
There are a lot more curl_setopt options. You should check them out.
Of course, this assumes you have cURL available.
I am not aware of any function in PHP that does not "block". As an alternative, and if your server permits such things, you can:
Use pcntl_fork() and do other stuff in your script while waiting for the API call to go through.
Use exec() to call another script in the background [using &] to do the API call for you if pcntl_fork() is unavailable.
However, if you literally cannot do anything else in your script without a successful call to that API then it doesn't really matter if the call 'blocks' or not. What you should really be concerned about is spending so much time waiting for this API that you exceed the configured max_execution_time and your script is aborted in the middle without being properly completed.
$max_calls = 5;
for( $i=1; $i<=$max_calls; $i++ ) {
$resultJSON = file_get_contents($apiURL);
if( $resultJSON !== false ) {
break;
} else if( $i = $max_calls ) {
throw new Exception("Could not reach API within $max_calls requests.");
}
usleep(250000); //wait 250ms between attempts
}
It's worth noting that file_get_contents() has a default timeout of 60 seconds so you're really in danger of the script being killed. Give serious consideration to using cURL instead since you can set much more reasonable timeout values.