Throttle cURL using usleep() - php

I'm using a web service to send 100's of http posts. However, the service only allows 5 per second. I'm wondering if the usleep command is the best way to do this. For example:
foreach($JSONarray['DATABASE'] as $E)
{
$aws = curl_init();
//curl stuff
curl_exec($aws);
curl_close($aws);
usleep(200000);
}

Now this is untested, but it should provide you with the idea of what I would do(and perhaps this snippet just work as it is - who knows...) :
// presets
$thissecond = time();
$cnt = 0;
foreach($JSONarray['DATABASE'] as $E)
{
while ($thissecond == time() && $cnt > 4) { // go into "waiting" when we going to fast
usleep(100000); // wait .1 second and ask again
}
if ($thissecond != time()) { // remember to reset this second and the cnt
$thissecond = time();
$cnt = 0;
}
// off with the payload
$aws = curl_init();
//curl stuf
curl_exec($aws);
curl_close($aws);
// remember to count it all
$cnt++;
}

Related

PHP loop curl request one by one

How to make a foreach or a for loop to run only when the curl response is received..
as example :
for ($i = 1; $i <= 10; $i++) {
$ch = curl_init();
curl_setopt($ch,CURLOPT_URL,"http://www.example.com");
if(curl_exec($ch)){ // ?? - if request and data are completely received
// ?? - go to the next loop
}
// DONT go to the next loop until the above data is complete or returns true
}
i don't want it to move to the next loop without having the current curl request data received.. one by one, so basically it opens up the url at first time, waits for the request data, if something matched or came true then go to the next loop,
you dont have to be bothered about the 'curl' part, i just want the loop to move one by one ( giving it a specific condition or something ) and not all at once
The loop ought to already work that way, for you're using the blocking cURL interface and not the cURL Multi interface.
$ch = curl_init();
for ($i = 1; $i <= 10; $i++)
{
curl_setopt($ch, CURLOPT_URL, "http://www.example.com");
$res = curl_exec($ch);
// Code checking $res is not false, or, if you returned the page
// into $res, code to check $res is as expected
// If you're here, cURL call completed. To know if successfully or not,
// check $res or the cURL error status.
// Removing the examples below, this code will hit always the same site
// ten times, one after the other.
// Example
if (something is wrong, e.g. False === $res)
continue; // Continue with the next iteration
Here extra code to be executed if call was *successful*
// A different example
if (something is wrong)
break; // exit the loop immediately, aborting the next iterations
sleep(1); // Wait 1 second before retrying
}
curl_close($ch);
Your code (as is) will not move to the next iteration until the curl call is completed.
A couple of issues to consider
You could set a higher timeout for curl to ensure that there are no communication delays. CURLOPT_CONNECTTIMEOUT, CURLOPT_CONNECTTIMEOUT_MS (milliseconds), CURLOPT_DNS_CACHE_TIMEOUT, CURLOPT_TIMEOUT and CURLOPT_TIMEOUT_MS (milliseconds) can be used to increase the timeouts. 0 makes curl wait indefinitely for any of these timeouts.
If your curl request fails for whatever reason, you can just put an exit there to stop execution, This way it will not move to the next URL.
If you want the script to continue even after the first failure, you can just log the result (after the failed request) and let it continue in the loop. Examining the log file will give you information as to what happened.
The continue control structure should be what you are looking for:
continue is used within looping structures to skip the rest of the current loop iteration and continue execution at the condition evaluation and then the beginning of the next iteration.
http://php.net/manual/en/control-structures.continue.php
for ($i = 1; $i <= 10; $i++) {
$ch = curl_init();
curl_setopt($ch,CURLOPT_URL,"http://www.example.com");
if(curl_exec($ch)){ // ?? - if request and data are completely received
continue; // ?? - go to the next loop
}
// DONT go to the next loop until the above data is complete or returns true
}
You can break out of a loop with the break keyword:
foreach ($list as $thing) {
if ($success) {
// ...
} else {
break;
}
}
for($i = 1; $i <= 10; $i++) {
$ch = curl_init();
curl_setopt($ch,CURLOPT_URL,"http://www.example.com");
if(curl_exec($ch)){ // ?? - if request and data are completely received
continue;
}else{
break;
}
// DONT go to the next loop until the above data is complete or returns true
}
or
for($i = 1; $i <= 10; $i++) {
$ch = curl_init();
curl_setopt($ch,CURLOPT_URL,"http://www.example.com");
if(curl_exec($ch)===false){ // ?? - if request and data are completely received
break;
}
}

multi-thread, multi-curl crawler in PHP

Hi everyone once again!
We need some help to develop and implement a multi-curl functionality into our crawler. We have a huge array of "links to be scanned" and we loop throw them with a Foreach.
Let's use some pseudo code to understand the logic:
1) While ($links_to_be_scanned > 0).
2) Foreach ($links_to_be_scanned as $link_to_be_scanned).
3) Scan_the_link() and run some other functions.
4) Extract the new links from the xdom.
5) Push the new links into $links_to_be_scanned.
5) Push the current link into $links_already_scanned.
6) Remove the current link from $links_to_be_scanned.
Now, we need to define a maximum number of parallel connections and be able to run this process for each link in parallel.
I understand that we're gonna have to create a $links_being_scanned or some kind of queue.
I'm really not sure how to approach this problem to be honest, if anyone could provide some snippet or idea to solve it, it would be greatly appreciated.
Thanks in advance!
Chris;
Extended:
I just realized that is not the multi-curl itself the tricky part, but the amount of operations done with each link after the request.
Even after the muticurl, I would eventually have to find a way to run all this operations in parallel. The whole algorithm described below would have to run in parallel.
So now rethinking, we would have to do something like this:
While (There's links to be scanned)
Foreach ($Link_to_scann as $link)
If (There's less than 10 scanners running)
Launch_a_new_scanner($link)
Remove the link from $links_to_be_scanned array
Push the link into $links_on_queue array
Endif;
And each scanner does (This should be run in parallel):
Create an object with the given link
Send a curl request to the given link
Create a dom and an Xdom with the response body
Perform other operations over the response body
Remove the link from the $links_on_queue array
Push the link into the $links_already_scanned array
I assume we could approach this creating a new PHP file with the scanner algorithm, and using pcntl_fork() for each parallel proccess?
Since even using multi-curl, I would eventually have to wait looping on a regular foreach structure for the other processes.
I assume I would have to approach this using fsockopen or pcntl_fork.
Suggestions, comments, partial solutions, and even a "good luck" will be more than appreciated!
Thanks a lot!
DISCLAIMER: This answer links an open-source project with which I'm involved. There. You've been warned.
The Artax HTTP client is a socket-based HTTP library that (among other things) offers custom control over the number of concurrent open socket connections to individual hosts while making multiple asynchronous HTTP requests.
Limiting the number of concurrent connections is easily accomplished. Consider:
<?php
use Artax\Client, Artax\Response;
require dirname(__DIR__) . '/autoload.php';
$client = new Client;
// Defaults to max of 8 concurrent connections per host
$client->setOption('maxConnectionsPerHost', 2);
$requests = array(
'so-home' => 'http://stackoverflow.com',
'so-php' => 'http://stackoverflow.com/questions/tagged/php',
'so-python' => 'http://stackoverflow.com/questions/tagged/python',
'so-http' => 'http://stackoverflow.com/questions/tagged/http',
'so-html' => 'http://stackoverflow.com/questions/tagged/html',
'so-css' => 'http://stackoverflow.com/questions/tagged/css',
'so-js' => 'http://stackoverflow.com/questions/tagged/javascript'
);
$onResponse = function($requestKey, Response $r) {
echo $requestKey, ' :: ', $r->getStatus();
};
$onError = function($requestKey, Exception $e) {
echo $requestKey, ' :: ', $e->getMessage();
}
$client->requestMulti($requests, $onResponse, $onError);
IMPORTANT: In the above example the Client::requestMulti method is making all the specified requests asynchronously. Because the per-host concurrency limit is set to 2, the client will open up new connections for the first two requests and subsequently reuse those same sockets for the other requests, queuing requests until one of the two sockets become available.
you could try something like this, haven't checked it, but you should get the idea
$request_pool = array();
function CreateHandle($url) {
$handle = curl_init($url);
// set curl options here
return $handle;
}
function Process($data) {
global $request_pool;
// do something with data
array_push($request_pool , CreateHandle($some_new_url));
}
function RunMulti() {
global $request_pool;
$multi_handle = curl_multi_init();
$active_request_pool = array();
$running = 0;
$active_request_count = 0;
$active_request_max = 10; // adjust as necessary
do {
$waiting_request_count = count($request_pool);
while(($active_request_count < $active_request_max) && ($waiting_request_count > 0)) {
$request = array_shift($request_pool);
curl_multi_add_handle($multi_handle , $request);
$active_request_pool[(int)$request] = $request;
$waiting_request_count--;
$active_request_count++;
}
curl_multi_exec($multi_handle , $running);
curl_multi_select($multi_handle);
while($info = curl_multi_info_read($multi_handle)) {
$curl_handle = $info['handle'];
call_user_func('Process' , curl_multi_getcontent($curl_handle));
curl_multi_remove_handle($multi_handle , $curl_handle);
curl_close($curl_handle);
$active_request_count--;
}
} while($active_request_count > 0 || $waiting_request_count > 0);
curl_multi_close($multi_handle);
}
You should look for some more robust solution to your problem. RabbitMQ
is a very good solution I used. There is also Gearman but I think it is your choice.
I prefer RabbitMQ.
I will share with you my code which I have used to collect email addresses from certain website.
You can modify it to fit your needs.
There were some problems with relative URL's there.
And I do not use CURL here.
<?php
error_reporting(E_ALL);
$home = 'http://kharkov-reklama.com.ua/jborudovanie/';
$writer = new RWriter('C:\parser_13-09-2012_05.txt');
set_time_limit(0);
ini_set('memory_limit', '512M');
function scan_page($home, $full_url, &$writer) {
static $done = array();
$done[] = $full_url;
// Scan only internal links. Do not scan all the internet!))
if (strpos($full_url, $home) === false) {
return false;
}
$html = #file_get_contents($full_url);
if (empty($html) || (strpos($html, '<body') === false && strpos($html, '<BODY') === false)) {
return false;
}
echo $full_url . '<br />';
preg_match_all('/([A-Za-z0-9_\-]+\.)*[A-Za-z0-9_\-]+#([A-Za-z0-9][A-Za-z0-9\-]*[A-Za-z0-9]\.)+[A-Za-z]{2,4}/', $html, $emails);
if (!empty($emails) && is_array($emails)) {
foreach ($emails as $email_group) {
if (is_array($email_group)) {
foreach ($email_group as $email) {
if (filter_var($email, FILTER_VALIDATE_EMAIL)) {
$writer->write($email);
}
}
}
}
}
$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
preg_match_all("/$regexp/siU", $html, $matches, PREG_SET_ORDER);
if (is_array($matches)) {
foreach($matches as $match) {
if (!empty($match[2]) && is_scalar($match[2])) {
$url = $match[2];
if (!filter_var($url, FILTER_VALIDATE_URL)) {
$url = $home . $url;
}
if (!in_array($url, $done)) {
scan_page($home, $url, $writer);
}
}
}
}
}
class RWriter {
private $_fh = null;
private $_written = array();
public function __construct($fname) {
$this->_fh = fopen($fname, 'w+');
}
public function write($line) {
if (in_array($line, $this->_written)) {
return;
}
$this->_written[] = $line;
echo $line . '<br />';
fwrite($this->_fh, "{$line}\r\n");
}
public function __destruct() {
fclose($this->_fh);
}
}
scan_page($home, 'http://kharkov-reklama.com.ua/jborudovanie/', $writer);

Libevent timeout loop exit

I'm having some difficulties getting the PHP libevent extension to break out of a loop on a timeout. Here's what I've got so far based on the demos on the PHP.net docs pages:
// From here: http://www.php.net/manual/en/libevent.examples.php
function print_line($fd, $events, $arg) {
static $max_requests = 0;
$max_requests++;
printf("Received event: %s after %s\n%s", implode(getEventFlags($events)), getTimer(), fgets($fd));
if ($max_requests == 10) {
// exit loop after 3 writes
echo " [EXIT]\n";
event_base_loopexit($arg[1]);
}
}
// create base and event
$base = event_base_new();
$event = event_new();
getTimer(); // Initialise time
$fd = STDIN;
event_set($event, $fd, EV_READ | EV_PERSIST, "print_line", array($event, $base));
event_base_set($event, $base);
event_add($event, 2000000);
event_base_loop($base);
// extract flags from bitmask
function getEventFlags ($ebm) {
$expFlags = array('EV_TIMEOUT', 'EV_SIGNAL', 'EV_READ', 'EV_WRITE', 'EV_PERSIST');
$ret = array();
foreach ($expFlags as $exf) {
if ($ebm & constant($exf)) {
$ret[] = $exf;
}
}
return $ret;
}
// Used to track time!
function getTimer () {
static $ts;
if (is_null($ts)) {
$ts = microtime(true);
return "Timer initialised";
}
$newts = microtime(true);
$r = sprintf("Delta: %.3f", $newts - $ts);
$ts = $newts;
return $r;
}
I can see that the timeout value passed to event_add effects the events passed to print_line(), if these events are any more than 2 seconds apart I get an EV_TIMEOUT instead of an EV_READ. What I want however is for libevent to call print_line as soon as the timeout is reached rather than waiting for the next event in order to give me the timeout.
I've tried using event_base_loopexit($base, 2000000), this causes the event loop to exit immediately without blocking for events. I've also tried passing EV_TIMEOUT to event_set, this seems to have no effect at all.
Has anyone managed to get this working before? I know that the event_buffer_* stuff works with timeouts, however I want to use the standard event_base functions. One of the PECL bugs talks about event_timer_* functions and these functions do exist on my system, however they're not documented at all.
Problem is in fgets() in:
printf("Received event: %s after %s\n%s", implode(getEventFlags($events)), getTimer(), fgets($fd));
This blocks processing and waits for data from STDIN (but there are none on timeout)
Change to something like that:
$text = '';
if ($events & EV_READ) {
$text = fgets($fd);
}
printf("Received event: %s after %s\n%s", implode(getEventFlags($events)), getTimer(), $text);

Data Change on CURL Respond PHP

Just one question. I got an cURL based code, and it send a request to the serwer, then if the respond is 'valid' it's making a sql query, but if the respond is 'busy' I need to change the proxy which the script is using.
I'm making it this way:
$proxys = file('http_proxy.txt');
...then...
for($n = 0, $count = count($proxys); $n <= $count; $n++) {
...and to change the proxy I used something like this:
$proxy = $proxys[$n + 1];
but it doesn't work.
Any suggestions?
Regards.
For starters, file('http_proxy.txt'); will retain the newlines in your file, so use the FILE_IGNORE_NEW_LINES flag to omit this. Then, you could use break; to stop the loop after succesfully using CURL on a proxy:
$proxys = file('http_proxy.txt', FILE_IGNORE_NEW_LINES);
foreach($proxys as $proxy)
{
$response = sendRequestTo($proxy);
if($response == 'valid')
{
performQuery($proxy);
break;
}
}

How can I use cURL to open multiple URLs simultaneously with PHP?

Here is my current code:
$SQL = mysql_query("SELECT url FROM urls") or die(mysql_error()); //Query the urls table
while($resultSet = mysql_fetch_array($SQL)){ //Put all the urls into one variable
// Now for some cURL to run it.
$ch = curl_init($resultSet['url']); //load the urls
curl_setopt($ch, CURLOPT_TIMEOUT, 2); //No need to wait for it to load. Execute it and go.
curl_exec($ch); //Execute
curl_close($ch); //Close it off
} //While loop
I'm relatively new to cURL. By relatively new, I mean this is my first time using cURL. Currently it loads one for two seconds, then loads the next one for 2 seconds, then the next. however, I want to make it load ALL of them at the same time. I'm sure its possible, I'm just unsure as to how. If someone could point me in the right direction, I'd appreciate it.
You set up each cURL handle in the same way, then add them to a curl_multi_ handle. The functions to look at are the curl_multi_* functions documented here. In my experience, though, there were issues with trying to load too many URLs at once (though I can't find my notes on it at the moment), so the last time I used curl_mutli_, I set it up to do batches of 5 URLs at a time.
edit: Here is a reduced version of the code I have using curl_multi_:
edit: Slightly rewritten and lots of added comments, which hopefully will help.
// -- create all the individual cURL handles and set their options
$curl_handles = array();
foreach ($urls as $url) {
$curl_handles[$url] = curl_init();
curl_setopt($curl_handles[$url], CURLOPT_URL, $url);
// set other curl options here
}
// -- start going through the cURL handles and running them
$curl_multi_handle = curl_multi_init();
$i = 0; // count where we are in the list so we can break up the runs into smaller blocks
$block = array(); // to accumulate the curl_handles for each group we'll run simultaneously
foreach ($curl_handles as $a_curl_handle) {
$i++; // increment the position-counter
// add the handle to the curl_multi_handle and to our tracking "block"
curl_multi_add_handle($curl_multi_handle, $a_curl_handle);
$block[] = $a_curl_handle;
// -- check to see if we've got a "full block" to run or if we're at the end of out list of handles
if (($i % BLOCK_SIZE == 0) or ($i == count($curl_handles))) {
// -- run the block
$running = NULL;
do {
// track the previous loop's number of handles still running so we can tell if it changes
$running_before = $running;
// run the block or check on the running block and get the number of sites still running in $running
curl_multi_exec($curl_multi_handle, $running);
// if the number of sites still running changed, print out a message with the number of sites that are still running.
if ($running != $running_before) {
echo("Waiting for $running sites to finish...\n");
}
} while ($running > 0);
// -- once the number still running is 0, curl_multi_ is done, so check the results
foreach ($block as $handle) {
// HTTP response code
$code = curl_getinfo($handle, CURLINFO_HTTP_CODE);
// cURL error number
$curl_errno = curl_errno($handle);
// cURL error message
$curl_error = curl_error($handle);
// output if there was an error
if ($curl_error) {
echo(" *** cURL error: ($curl_errno) $curl_error\n");
}
// remove the (used) handle from the curl_multi_handle
curl_multi_remove_handle($curl_multi_handle, $handle);
}
// reset the block to empty, since we've run its curl_handles
$block = array();
}
}
// close the curl_multi_handle once we're done
curl_multi_close($curl_multi_handle);
Given that you don't need anything back from the URLs, you probably don't need a lot of what's there, but this is how I chunked the requests into blocks of BLOCK_SIZE, waited for each block to run before moving on, and caught errors from cURL.

Categories