Libevent timeout loop exit - php

I'm having some difficulties getting the PHP libevent extension to break out of a loop on a timeout. Here's what I've got so far based on the demos on the PHP.net docs pages:
// From here: http://www.php.net/manual/en/libevent.examples.php
function print_line($fd, $events, $arg) {
static $max_requests = 0;
$max_requests++;
printf("Received event: %s after %s\n%s", implode(getEventFlags($events)), getTimer(), fgets($fd));
if ($max_requests == 10) {
// exit loop after 3 writes
echo " [EXIT]\n";
event_base_loopexit($arg[1]);
}
}
// create base and event
$base = event_base_new();
$event = event_new();
getTimer(); // Initialise time
$fd = STDIN;
event_set($event, $fd, EV_READ | EV_PERSIST, "print_line", array($event, $base));
event_base_set($event, $base);
event_add($event, 2000000);
event_base_loop($base);
// extract flags from bitmask
function getEventFlags ($ebm) {
$expFlags = array('EV_TIMEOUT', 'EV_SIGNAL', 'EV_READ', 'EV_WRITE', 'EV_PERSIST');
$ret = array();
foreach ($expFlags as $exf) {
if ($ebm & constant($exf)) {
$ret[] = $exf;
}
}
return $ret;
}
// Used to track time!
function getTimer () {
static $ts;
if (is_null($ts)) {
$ts = microtime(true);
return "Timer initialised";
}
$newts = microtime(true);
$r = sprintf("Delta: %.3f", $newts - $ts);
$ts = $newts;
return $r;
}
I can see that the timeout value passed to event_add effects the events passed to print_line(), if these events are any more than 2 seconds apart I get an EV_TIMEOUT instead of an EV_READ. What I want however is for libevent to call print_line as soon as the timeout is reached rather than waiting for the next event in order to give me the timeout.
I've tried using event_base_loopexit($base, 2000000), this causes the event loop to exit immediately without blocking for events. I've also tried passing EV_TIMEOUT to event_set, this seems to have no effect at all.
Has anyone managed to get this working before? I know that the event_buffer_* stuff works with timeouts, however I want to use the standard event_base functions. One of the PECL bugs talks about event_timer_* functions and these functions do exist on my system, however they're not documented at all.

Problem is in fgets() in:
printf("Received event: %s after %s\n%s", implode(getEventFlags($events)), getTimer(), fgets($fd));
This blocks processing and waits for data from STDIN (but there are none on timeout)
Change to something like that:
$text = '';
if ($events & EV_READ) {
$text = fgets($fd);
}
printf("Received event: %s after %s\n%s", implode(getEventFlags($events)), getTimer(), $text);

Related

Sleep never continues execution

I have made a script that checks a server's availability.
The site was down and I was awaiting a fix(I was on call for a client and was awaiting a ticket from the provider), to limit calls I have used sleep():
$client = new \GuzzleHttp\Client();
$available = false;
date_default_timezone_set('doesntMatter');
//The server was more likely to respond after 5 AM, hence the decrese between intervals
$hours = array( //Minutes between calls based on current hour
0=>30,
1=>30,
2=>30,
3=>30,
4=>20,
5=>20,
6=>10,
7=>10,
8=>10
);
$lastResponse = null;
while(!$available) {
$time = time();
$hour = date('G', $time);
echo "\n Current hour ".$hour;
try {
$crawler = $client->request('GET', 'www.someSiteToCheck.com');
$available = true; //When the server returns a stus code of 200 available is set to TRUE
} catch (\GuzzleHttp\Exception\ServerException $e) {}
if(!$available) {
$secondsToSleep = $hours[$hour]*60;
echo "\n Sleeping for ".$secondsToSleep;
sleep($hours[$hour]*$secondsToSleep); //Sleep until the next request
} else {
exec('start ringtone.mp3'); //Blast my stereo to wake me up
}
}
Problem:
When I started the script it went in a 1800 second sleep and froze, it didn't re-execute anything
Given:
I tested my script with a sleep of 160 (for ex) and it made multiple calls
Checked my power settings so that the machine wouldn't go in stand-by
Checked error logs
(Even if obvious) I ran in CLI
Checked sleep() documentation for issues but nothing
Couldn't find anithing related
I think you have an error in your logic.
For example:
When it's 5AM
Then $secondsToSleep is 20*60 = 1200sec;
When you call the sleep function you multiply it again with 20
sleep($hours[$hour]*$secondsToSleep); => sleep(20*1200); => 24000sec => 6,66... hours
If you simply update your sleep parameter the result should be as expected.
if(!$available) {
$secondsToSleep = $hours[$hour]*60;
echo "\n Sleeping for ".$secondsToSleep;
sleep($secondsToSleep); //Sleep until the next request
}

Retrieving results from background Gearman job/task

Subject is quite self-explanatory, but I definitely need a fresh pair of eyes on this.
I am using mmoreram/GearmanBundle Symfony2 bundle to send jobs to execute. So, far I have managed to send a job, execute it and return results. That part works as expected.
However, I am trying to the same with background job/tasks. I know that, in this scenario, client does not wait for job to complete, but I was hoping that job handle can help me with that (e.g. retrieve job status).
$gearman = $this->get('gearman');
$jobId = $gearman->doHighBackgroundJob("CsvWorker~parseCsv", json_encode(["foo", "bar", "123"]));
sleep(3);
// At this point, job has completed for sure (it's very simple)
var_dump($jobId);
var_dump($gearman->getJobStatus($jobId));
This outputs the following:
string 'H:localhost.localdomain:10' (length=26)
object(Mmoreram\GearmanBundle\Module\JobStatus)[410]
private 'known' => boolean false
private 'running' => boolean false
private 'completed' => int 0
private 'completionTotal' => int 0
The known => false, in particular, really puzzles me. During the job execution, I made sure to invoke correctly sendStatus and sendComplete methods.
So, I guess, a general question would be: once the job has completed, is it still known to Gearman?
UPDATE:
I managed to add some code changes to the bundle which allowed me to listen for data being returned by job. That way, I may be able to persist that in database, however, my client (job creator) is still pretty much left in the dark on whether the job has actually finished.
I found here such an option for solving the problem.
It is convenient when you need to complete a task, and the answer is needed only for a while.
Worker
$gmworker = new GearmanWorker();
$gmworker->addServer();
$gmworker->addFunction("long_running_task", "long_running_task_fn");
print "Waiting for job...\n";
while($gmworker->work()) {
if ($gmworker->returnCode() != GEARMAN_SUCCESS) {
echo "return_code: " . $gmworker->returnCode() . "\n";
break;
}
}
function long_running_task_fn($job) {
$mc = memcache_connect('localhost', 11211);
$result = 1;
$n = $job->workload();
for ($i = 1; $i <= $n; $i++) {
$result *= $i;
$job->sendStatus($i, $n);
sleep(1);
}
memcache_set($mc, $job->handle(), $result);
}
Client
<?php
if ($_POST['start']) {
$gmc = new GearmanClient();
$gmc->addServer();
$handle = $gmc->doBackground('long_running_task', '10');
header('Location: /client.php?handle='.urlencode($handle));
}
if ($_GET['handle']) {
$handle = $_GET['handle'];
$gmc = new GearmanClient();
$gmc->addServer();
$status = $gmc->jobStatus($handle);
}
function get_result($handle) {
$mc = memcache_connect('localhost', 11211);
$reply = memcache_get($mc, $handle);
memcache_close($mc);
return $reply;
}
?>
As described in PHP Manual, as long as the job is known to the server, it is not completed.

PHP waiting for correct response with cURL

I don't know how to make this.
There is an XML Api server and I'm getting contents with cURL; it works fine. Now I have to call the creditCardPreprocessors state. It has 'in progress state' too and PHP should wait until the progess is finished. I tried already with sleep and other ways, but I can't make it. This is a simplified example variation of what I tried:
function process_state($xml){
if($result = request($xml)){
// It'll return NULL on bad state for example
return $result;
}
sleep(3);
process_state($xml);
}
I know, this can be an infite loop but I've tried to add counting to exit if it reaches five; it won't exit, the server will hang up and I'll have 500 errors for minutes and Apache goes unreachable for that vhost.
EDIT:
Another example
$i = 0;
$card_state = false;
// We're gona assume now the request() turns back NULL if card state is processing TRUE if it's done
while(!$card_state && $i < 10){
$i++;
if($result = request('XML STUFF')){
$card_state = $result;
break;
}
sleep(2);
}
The recursive method you've defined could cause problems depending on the response timing you get back from the server. I think you'd want to use a while loop here. It keeps the requests serialized.
$returnable_responses = array('code1','code2','code3'); // the array of responses that you want the function to stop after receiving
$max_number_of_calls = 5; // or some number
$iterator = 0;
$result = NULL;
while(!in_array($result,$returnable_responses) && ($iterator < $max_number_of_calls)) {
$result = request($xml);
$iterator++;
}

multi-thread, multi-curl crawler in PHP

Hi everyone once again!
We need some help to develop and implement a multi-curl functionality into our crawler. We have a huge array of "links to be scanned" and we loop throw them with a Foreach.
Let's use some pseudo code to understand the logic:
1) While ($links_to_be_scanned > 0).
2) Foreach ($links_to_be_scanned as $link_to_be_scanned).
3) Scan_the_link() and run some other functions.
4) Extract the new links from the xdom.
5) Push the new links into $links_to_be_scanned.
5) Push the current link into $links_already_scanned.
6) Remove the current link from $links_to_be_scanned.
Now, we need to define a maximum number of parallel connections and be able to run this process for each link in parallel.
I understand that we're gonna have to create a $links_being_scanned or some kind of queue.
I'm really not sure how to approach this problem to be honest, if anyone could provide some snippet or idea to solve it, it would be greatly appreciated.
Thanks in advance!
Chris;
Extended:
I just realized that is not the multi-curl itself the tricky part, but the amount of operations done with each link after the request.
Even after the muticurl, I would eventually have to find a way to run all this operations in parallel. The whole algorithm described below would have to run in parallel.
So now rethinking, we would have to do something like this:
While (There's links to be scanned)
Foreach ($Link_to_scann as $link)
If (There's less than 10 scanners running)
Launch_a_new_scanner($link)
Remove the link from $links_to_be_scanned array
Push the link into $links_on_queue array
Endif;
And each scanner does (This should be run in parallel):
Create an object with the given link
Send a curl request to the given link
Create a dom and an Xdom with the response body
Perform other operations over the response body
Remove the link from the $links_on_queue array
Push the link into the $links_already_scanned array
I assume we could approach this creating a new PHP file with the scanner algorithm, and using pcntl_fork() for each parallel proccess?
Since even using multi-curl, I would eventually have to wait looping on a regular foreach structure for the other processes.
I assume I would have to approach this using fsockopen or pcntl_fork.
Suggestions, comments, partial solutions, and even a "good luck" will be more than appreciated!
Thanks a lot!
DISCLAIMER: This answer links an open-source project with which I'm involved. There. You've been warned.
The Artax HTTP client is a socket-based HTTP library that (among other things) offers custom control over the number of concurrent open socket connections to individual hosts while making multiple asynchronous HTTP requests.
Limiting the number of concurrent connections is easily accomplished. Consider:
<?php
use Artax\Client, Artax\Response;
require dirname(__DIR__) . '/autoload.php';
$client = new Client;
// Defaults to max of 8 concurrent connections per host
$client->setOption('maxConnectionsPerHost', 2);
$requests = array(
'so-home' => 'http://stackoverflow.com',
'so-php' => 'http://stackoverflow.com/questions/tagged/php',
'so-python' => 'http://stackoverflow.com/questions/tagged/python',
'so-http' => 'http://stackoverflow.com/questions/tagged/http',
'so-html' => 'http://stackoverflow.com/questions/tagged/html',
'so-css' => 'http://stackoverflow.com/questions/tagged/css',
'so-js' => 'http://stackoverflow.com/questions/tagged/javascript'
);
$onResponse = function($requestKey, Response $r) {
echo $requestKey, ' :: ', $r->getStatus();
};
$onError = function($requestKey, Exception $e) {
echo $requestKey, ' :: ', $e->getMessage();
}
$client->requestMulti($requests, $onResponse, $onError);
IMPORTANT: In the above example the Client::requestMulti method is making all the specified requests asynchronously. Because the per-host concurrency limit is set to 2, the client will open up new connections for the first two requests and subsequently reuse those same sockets for the other requests, queuing requests until one of the two sockets become available.
you could try something like this, haven't checked it, but you should get the idea
$request_pool = array();
function CreateHandle($url) {
$handle = curl_init($url);
// set curl options here
return $handle;
}
function Process($data) {
global $request_pool;
// do something with data
array_push($request_pool , CreateHandle($some_new_url));
}
function RunMulti() {
global $request_pool;
$multi_handle = curl_multi_init();
$active_request_pool = array();
$running = 0;
$active_request_count = 0;
$active_request_max = 10; // adjust as necessary
do {
$waiting_request_count = count($request_pool);
while(($active_request_count < $active_request_max) && ($waiting_request_count > 0)) {
$request = array_shift($request_pool);
curl_multi_add_handle($multi_handle , $request);
$active_request_pool[(int)$request] = $request;
$waiting_request_count--;
$active_request_count++;
}
curl_multi_exec($multi_handle , $running);
curl_multi_select($multi_handle);
while($info = curl_multi_info_read($multi_handle)) {
$curl_handle = $info['handle'];
call_user_func('Process' , curl_multi_getcontent($curl_handle));
curl_multi_remove_handle($multi_handle , $curl_handle);
curl_close($curl_handle);
$active_request_count--;
}
} while($active_request_count > 0 || $waiting_request_count > 0);
curl_multi_close($multi_handle);
}
You should look for some more robust solution to your problem. RabbitMQ
is a very good solution I used. There is also Gearman but I think it is your choice.
I prefer RabbitMQ.
I will share with you my code which I have used to collect email addresses from certain website.
You can modify it to fit your needs.
There were some problems with relative URL's there.
And I do not use CURL here.
<?php
error_reporting(E_ALL);
$home = 'http://kharkov-reklama.com.ua/jborudovanie/';
$writer = new RWriter('C:\parser_13-09-2012_05.txt');
set_time_limit(0);
ini_set('memory_limit', '512M');
function scan_page($home, $full_url, &$writer) {
static $done = array();
$done[] = $full_url;
// Scan only internal links. Do not scan all the internet!))
if (strpos($full_url, $home) === false) {
return false;
}
$html = #file_get_contents($full_url);
if (empty($html) || (strpos($html, '<body') === false && strpos($html, '<BODY') === false)) {
return false;
}
echo $full_url . '<br />';
preg_match_all('/([A-Za-z0-9_\-]+\.)*[A-Za-z0-9_\-]+#([A-Za-z0-9][A-Za-z0-9\-]*[A-Za-z0-9]\.)+[A-Za-z]{2,4}/', $html, $emails);
if (!empty($emails) && is_array($emails)) {
foreach ($emails as $email_group) {
if (is_array($email_group)) {
foreach ($email_group as $email) {
if (filter_var($email, FILTER_VALIDATE_EMAIL)) {
$writer->write($email);
}
}
}
}
}
$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
preg_match_all("/$regexp/siU", $html, $matches, PREG_SET_ORDER);
if (is_array($matches)) {
foreach($matches as $match) {
if (!empty($match[2]) && is_scalar($match[2])) {
$url = $match[2];
if (!filter_var($url, FILTER_VALIDATE_URL)) {
$url = $home . $url;
}
if (!in_array($url, $done)) {
scan_page($home, $url, $writer);
}
}
}
}
}
class RWriter {
private $_fh = null;
private $_written = array();
public function __construct($fname) {
$this->_fh = fopen($fname, 'w+');
}
public function write($line) {
if (in_array($line, $this->_written)) {
return;
}
$this->_written[] = $line;
echo $line . '<br />';
fwrite($this->_fh, "{$line}\r\n");
}
public function __destruct() {
fclose($this->_fh);
}
}
scan_page($home, 'http://kharkov-reklama.com.ua/jborudovanie/', $writer);

Gearman with multiple servers and php workers

I'm having a problem with gearman workers running on multiple servers which i can't seem to solve.
The problem occurs when a worker server is taken offline, rather than the worker process being cancelled, and causes all other worker processes to error and fail.
Example with just 1 client and 2 workers -
Client:
$client = new GearmanClient ();
$client->addServer ('192.168.1.200');
$client->addServer ('192.168.1.201');
$job = $client->do ('generate_tile', serialize ($arrData));
Worker:
$worker = new GearmanWorker ();
$worker->addServer ('192.168.1.200');
$worker->addServer ('192.168.1.201');
$worker->addFunction ('generate_tile', 'generate_tile');
while (1)
{
if (!$worker->work ())
{
switch ($worker->returnCode ())
{
default:
echo "Error: " . $worker->returnCode () . ': ' . $worker->error () . "\n";
break;
}
}
}
function generate_tile ($job) { ... }
The worker code is being run on 2 separate servers. When every server is up and running both workers execute jobs as expected. When one of the worker processes is cancelled, the other worker executes all jobs as expected.
However, when the server with the cancelled worker process is shutdown and taken completely offline, requests to the client script hang and the remaining worker process does not pick up any jobs.
I get the following set of errors from the remaining worker process:
Error: 46: gearman_con_wait:timeout reached
Error: 46: gearman_con_wait:timeout reached
Error: 4: gearman_con_flush:write:110
Error: 46: gearman_con_wait:timeout reached
Error: 4: gearman_con_flush:write:113
Error: 4: gearman_con_flush:write:113
Error: 4: gearman_con_flush:write:113
....
When i start-up the other server, not starting the worker process on it, the remaining worker process immediately jumps into life and executes any remaining jobs.
It seems clear to me that i need some code in the worker process to cope with any servers that may be offline, however i cannot see how to do this.
Many thanks,
Andy
Our tests with multiple gearman servers shows that if the last server in the list (192.168.1.201 in your case) is taken down, the workers stop executing the way you are describing. (Also, the workers grab jobs from the last server. They process jobs on .200 only if on .201 there are no jobs).
It seems that this is a bug with the linked list in the gearman server, which is reported to be fixed multiple times, but with all available versions of gearman, the bug persist. Sorry, I know that's not a solution, but we had the same problem and didn't found a solution. (if someone can provide working solution for this problem, I agree to give large bounty)
Further to #Darhazer 's comment above. We found that as well and solved like thus :-
// Gearman workers show a strong preference for servers at the end of a list so randomize the order
$worker = new GearmanWorker();
$s2 = explode(",", Configure::read('workers.servers'));
shuffle($s2);
$servers = implode(",", $s2);
$worker->addServers($servers);
We run 6 to 10 workers at any time, and expire them after they've completed x requests.
I use this class, which keep track of which jobs work on which servers. It hasn't been thoroughly tested, just wrote it now. I've pasted an edited version, so there might be a typo or somesuch, but otherwise appears to solve the issue.
<?
class MyGearmanClient {
static $server = "server1,server2,server3";
static $server_array = false;
static $workingServers = false;
static $gmclient = false;
static $timeout = 5000;
static $defaultTimeout = 5000;
static function randomServer() {
return self::$server_array[rand(0, count(self::$server_array) -1)];
}
static function getServer($job = false) {
if (self::$server_array == false) {
self::$server_array = explode(",", self::$server);
self::$workingServers = array();
}
$serverList = array();
if ($job) {
if (array_key_exists($job, self::$workingServers)) {
foreach (self::$server_array as $server) {
if (array_key_exists($server, self::$workingServers[$job])) {
if (self::$workingServers[$job][$server]) {
$serverList[] = $server;
}
} else {
$serverList[] = $server;
}
}
if (count($serverList) == 0) {
# All servers have failed, need to insert all the servers again and retry.
$serverList = self::$workingServers[$job] = self::$server_array;
}
return $serverList[rand(0, count($serverList) - 1)];
} else {
return self::randomServer();
}
} else {
return self::randomServer();
}
}
static function serverWorked($server, $job) {
self::$workingServers[$job][$server] = $server;
}
static function serverFailed($server, $job) {
self::$workingServers[$job][$server] = false;
}
static function Connect($server = false, $job = false) {
if ($server) {
self::$server = self::getServer();
}
self::$gmclient= new GearmanClient();
self::$gmclient->setTimeout(self::$timeout);
# add the default job server
self::$gmclient->addServer($server = self::getServer($job));
return $server;
}
static function Destroy() {
self::$gmclient = false;
}
static function Client($name, $vars, $timeout = false) {
if (is_int($timeout)) {
self::$timeout = $timeout;
} else {
self::$timeout = self::$defaultTimeout;
}
do {
$server = self::Connect(false, $name);
$value = self::$gmclient->do($name, $vars);
$return_code = self::$gmclient->returnCode();
if (!$value) {
$error_message = self::$gmclient->error();
if ($return_code == 47) {
self::serverFailed($server, $name);
if (count(self::$server_array) > 1) {
// ADDED SINGLE SERVER LOOP AVOIDANCE // echo "Timeout on server $server, trying another server...\n";
continue;
} else {
return false;
}
}
echo "ERR: $error_message ($return_code)\n";
}
# printf("Worker has returned\n");
$short_value = substr($value, 0, 80);
switch ($return_code)
{
case GEARMAN_WORK_DATA:
echo "DATA: $short_value\n";
break;
case GEARMAN_SUCCESS:
self::serverWorked($server, $name);
break;
case GEARMAN_WORK_STATUS:
list($numerator, $denominator)= self::$gmclient->doStatus();
echo "Status: $numerator/$denominator\n";
break;
case GEARMAN_TIMEOUT:
// self::Connect();
// Fall through
default:
echo "ERR: $error_message " . self::$gmclient->error() . " ($return_code)\n";
break;
}
}
while($return_code != GEARMAN_SUCCESS);
$rv = unserialize($value);
return $rv["rv"];
}
}
# Example usage:
# $rv = MyGearmanClient::Client("Function", $args);
?>
since 'addServer' from gearman client is not working properly this code can choose a jobserver randomly and if fails try the next one, this way you can balance the load.
// job servers
$jobservers = array('192.168.1.1','192.168.1.2');
// prepare gearman client
$gmclient = new GearmanClient();
// shuffle job servers (deliver jobs equally by server)
shuffle($jobservers);
// add job servers
foreach($jobservers as $jobserver) {
// add random jobserver
$gmclient->addServer($jobserver);
// check server state if ok end foreach
if (#$gmclient->ping('ping')) break;
// if connections fails reset client
$gmclient = new GearmanClient();
}
Solution tested and working ok.
$client = new GearmanClient();
if(!$client->addServer("11.11.65.73",4730))
$client->addServer("11.11.65.79",4730);

Categories