Gearman with multiple servers and php workers - php

I'm having a problem with gearman workers running on multiple servers which i can't seem to solve.
The problem occurs when a worker server is taken offline, rather than the worker process being cancelled, and causes all other worker processes to error and fail.
Example with just 1 client and 2 workers -
Client:
$client = new GearmanClient ();
$client->addServer ('192.168.1.200');
$client->addServer ('192.168.1.201');
$job = $client->do ('generate_tile', serialize ($arrData));
Worker:
$worker = new GearmanWorker ();
$worker->addServer ('192.168.1.200');
$worker->addServer ('192.168.1.201');
$worker->addFunction ('generate_tile', 'generate_tile');
while (1)
{
if (!$worker->work ())
{
switch ($worker->returnCode ())
{
default:
echo "Error: " . $worker->returnCode () . ': ' . $worker->error () . "\n";
break;
}
}
}
function generate_tile ($job) { ... }
The worker code is being run on 2 separate servers. When every server is up and running both workers execute jobs as expected. When one of the worker processes is cancelled, the other worker executes all jobs as expected.
However, when the server with the cancelled worker process is shutdown and taken completely offline, requests to the client script hang and the remaining worker process does not pick up any jobs.
I get the following set of errors from the remaining worker process:
Error: 46: gearman_con_wait:timeout reached
Error: 46: gearman_con_wait:timeout reached
Error: 4: gearman_con_flush:write:110
Error: 46: gearman_con_wait:timeout reached
Error: 4: gearman_con_flush:write:113
Error: 4: gearman_con_flush:write:113
Error: 4: gearman_con_flush:write:113
....
When i start-up the other server, not starting the worker process on it, the remaining worker process immediately jumps into life and executes any remaining jobs.
It seems clear to me that i need some code in the worker process to cope with any servers that may be offline, however i cannot see how to do this.
Many thanks,
Andy

Our tests with multiple gearman servers shows that if the last server in the list (192.168.1.201 in your case) is taken down, the workers stop executing the way you are describing. (Also, the workers grab jobs from the last server. They process jobs on .200 only if on .201 there are no jobs).
It seems that this is a bug with the linked list in the gearman server, which is reported to be fixed multiple times, but with all available versions of gearman, the bug persist. Sorry, I know that's not a solution, but we had the same problem and didn't found a solution. (if someone can provide working solution for this problem, I agree to give large bounty)

Further to #Darhazer 's comment above. We found that as well and solved like thus :-
// Gearman workers show a strong preference for servers at the end of a list so randomize the order
$worker = new GearmanWorker();
$s2 = explode(",", Configure::read('workers.servers'));
shuffle($s2);
$servers = implode(",", $s2);
$worker->addServers($servers);
We run 6 to 10 workers at any time, and expire them after they've completed x requests.

I use this class, which keep track of which jobs work on which servers. It hasn't been thoroughly tested, just wrote it now. I've pasted an edited version, so there might be a typo or somesuch, but otherwise appears to solve the issue.
<?
class MyGearmanClient {
static $server = "server1,server2,server3";
static $server_array = false;
static $workingServers = false;
static $gmclient = false;
static $timeout = 5000;
static $defaultTimeout = 5000;
static function randomServer() {
return self::$server_array[rand(0, count(self::$server_array) -1)];
}
static function getServer($job = false) {
if (self::$server_array == false) {
self::$server_array = explode(",", self::$server);
self::$workingServers = array();
}
$serverList = array();
if ($job) {
if (array_key_exists($job, self::$workingServers)) {
foreach (self::$server_array as $server) {
if (array_key_exists($server, self::$workingServers[$job])) {
if (self::$workingServers[$job][$server]) {
$serverList[] = $server;
}
} else {
$serverList[] = $server;
}
}
if (count($serverList) == 0) {
# All servers have failed, need to insert all the servers again and retry.
$serverList = self::$workingServers[$job] = self::$server_array;
}
return $serverList[rand(0, count($serverList) - 1)];
} else {
return self::randomServer();
}
} else {
return self::randomServer();
}
}
static function serverWorked($server, $job) {
self::$workingServers[$job][$server] = $server;
}
static function serverFailed($server, $job) {
self::$workingServers[$job][$server] = false;
}
static function Connect($server = false, $job = false) {
if ($server) {
self::$server = self::getServer();
}
self::$gmclient= new GearmanClient();
self::$gmclient->setTimeout(self::$timeout);
# add the default job server
self::$gmclient->addServer($server = self::getServer($job));
return $server;
}
static function Destroy() {
self::$gmclient = false;
}
static function Client($name, $vars, $timeout = false) {
if (is_int($timeout)) {
self::$timeout = $timeout;
} else {
self::$timeout = self::$defaultTimeout;
}
do {
$server = self::Connect(false, $name);
$value = self::$gmclient->do($name, $vars);
$return_code = self::$gmclient->returnCode();
if (!$value) {
$error_message = self::$gmclient->error();
if ($return_code == 47) {
self::serverFailed($server, $name);
if (count(self::$server_array) > 1) {
// ADDED SINGLE SERVER LOOP AVOIDANCE // echo "Timeout on server $server, trying another server...\n";
continue;
} else {
return false;
}
}
echo "ERR: $error_message ($return_code)\n";
}
# printf("Worker has returned\n");
$short_value = substr($value, 0, 80);
switch ($return_code)
{
case GEARMAN_WORK_DATA:
echo "DATA: $short_value\n";
break;
case GEARMAN_SUCCESS:
self::serverWorked($server, $name);
break;
case GEARMAN_WORK_STATUS:
list($numerator, $denominator)= self::$gmclient->doStatus();
echo "Status: $numerator/$denominator\n";
break;
case GEARMAN_TIMEOUT:
// self::Connect();
// Fall through
default:
echo "ERR: $error_message " . self::$gmclient->error() . " ($return_code)\n";
break;
}
}
while($return_code != GEARMAN_SUCCESS);
$rv = unserialize($value);
return $rv["rv"];
}
}
# Example usage:
# $rv = MyGearmanClient::Client("Function", $args);
?>

since 'addServer' from gearman client is not working properly this code can choose a jobserver randomly and if fails try the next one, this way you can balance the load.
// job servers
$jobservers = array('192.168.1.1','192.168.1.2');
// prepare gearman client
$gmclient = new GearmanClient();
// shuffle job servers (deliver jobs equally by server)
shuffle($jobservers);
// add job servers
foreach($jobservers as $jobserver) {
// add random jobserver
$gmclient->addServer($jobserver);
// check server state if ok end foreach
if (#$gmclient->ping('ping')) break;
// if connections fails reset client
$gmclient = new GearmanClient();
}

Solution tested and working ok.
$client = new GearmanClient();
if(!$client->addServer("11.11.65.73",4730))
$client->addServer("11.11.65.79",4730);

Related

How use multi-threading with this

I've been reading up on multi-threading with PHP, but I'm having a tough time integrating it into my command line php script.
I read multithreading
and multithread foreach.
But I'm really not sure. Any thoughts how to apply multi-threading here? The reason I need multi-threading here is that Telnet takes forever (see shell script). But I can't write to my DB concurrently ($stmt2). I'm looping through my list of devices with $stmt->fetch.
Maybe I should do something like run task specifically, with just the telnet/shell script call in the task, like that example:
$task = new class extends Thread {
private $response;
public function run()
{
$content = file_get_contents("http://google.com");
preg_match("~<title>(.+)</title>~", $content, $matches);
$this->response = $matches[1];
}
};
$task->start() && $task->join();
var_dump($task->response); // string(6) "Google"
But, I'm getting the error when I try to add this to my code below:
PHP Parse error: syntax error, unexpected T_CLASS in /opt/IBM/custom/NAC_Dslam/calix_swVerThreaded.php on line 100
this is the line:
$task = new class ...
My script looks like this:
$stmt =$mysqli->prepare("SELECT ip, model FROM TableD WHERE vendor = 'Calix' AND model in ('C7','E7') AND sw_ver IS NULL LIMIT 6000"); //AND ping_reply IS NULL AND software_version IS NULL
$stmt->bind_result($ip, $model); //list of ip's
if(!$stmt->execute())
{
//err
}
$stmt2 = $mysqli2->prepare("UPDATE TableD SET sw_ver = ?
WHERE vendor = 'Calix'
AND ip = ? ");
$stmt2->bind_param("ss", $software, $ip);
while($stmt->fetch()) {
//initializing var's
if(pingAddress($ip)=="alive") { //Ones that don't ping are dead to us.
///////this is the part that takes forever and should be multi-threaded/////
//Call shell script to telnet to calix dslam and get version for that ip
if($model == "C7"){
$task = new class extends Thread {
private $itsOutput;
public function run()
{
exec ("./calix_C7_swVer.sh $ip", $itsOutput);//takes forever/telnet
//in shell script. Can't
//be fixed. Each time I
//call this script it's a
//different ip
}
};
$task->start() && $task->join();
var_dump($task->itsOutput); //should be returned output above //takes forever to telnet
//$output = $task->itsOutput;
$output2=array_reverse($output,true);
if (!(preg_grep("/DENY/", $output2))){
$found = preg_grep("/COMPLD/", $output2);
$ind = key($found);
$version = explode(",", $output[$ind+1]);
if(strlen($version[3])>=1) { //if sw ver came back in an acceptable size
$software = $version[3];
$software = trim($software,'"'); //trim double quote (usually is there)
print "sw ver after trim: " . $software . "\n";
if(!$stmt2->execute()) { //write sw version to netcool db
$tempErr = "Failed to insert into dslam_elements_nac: " . $stmt2->error;
printf($tempErr . "\n"); //show mysql execute error if exists
$err->logThis($tempErr);
}
if(!$stmtX->execute()) { //commit it
$tempErr = "Failed to commit dslam_elements_nac: " . $stmtX->error;
printf($tempErr . "\n"); //show mysql execute error if exists
$err->logThis($tempErr);
}
} //we got a version back
else { //version not retrieved
//error processing
} //didn't get sw ver
} //not deny
} //c7
else if($model == "E7") {
exec ("./calix_E7_swVer.sh $ip", $output);
$output2=array_reverse($output,true);
if (!(preg_grep("/DENY/", $output2))){
$found = preg_grep("/yes/", $output2);
$ind = key($found);
$version = explode(" ", $output[$ind]);
if(strlen($version[5])>=1) { //if sw ver came back in an acceptable size
$software = $version[5];
print "sw ver after trim: " . $software . "\n";
if(!$stmt2->execute()) { //write sw version to netcool db
$tempErr = "Failed to insert into dslam_elements_nac: " . $stmt2->error;
printf($tempErr . "\n"); //show mysql execute error if exists
$err->logThis($tempErr);
}
if(!$stmtX->execute()) { //commit it
//err processing
}
} //we got a version back
else { //version not retrieved
//handle it
} //didn't get sw ver
} //not deny
}
} //while
update
I'm trying this (pcntl_fork), but it doesn't seem to be quite what I need because when I sleep(30), which I think is similar to my shell script call, other processes don't continue and do the next one.
<?php
declare(ticks = 1);
$max=10;
$child=0;
$res = array("aabc", "bcd", "cde", "eft", "ggg", "hhh", "iii", "jjj", "kkk", "lll", "mmm", "nnn", "ooo", "ppp", "qqq", "aabc", "bcd", "cde", "eft", "ggg", "hhh", "iii", "jjj", "kkk", "lll", "mmm", "nnn", "ooo", "ppp", "qqq");
function sig_handler($signo) {
global $child;
switch ($signo) {
case SIGCHLD:
//echo "SIGCHLD receivedn";
// clean up zombies
$pid = pcntl_waitpid(-1, $status, WNOHANG);
$child -= 1;
//exit;
}
}
pcntl_signal(SIGCHLD, "sig_handler");
//$website_scraper = new scraper();
foreach($res as $r){
while ($child >= $max) {
sleep(5); //echo " - sleep $child n";
//pcntl_waitpid(0,$status);
}
$child++;
$pid=pcntl_fork();
if ($pid==-1) {
die("Could not fork:n");
}
elseif ($pid) {
// we're in the parent fork, dont do anything
}
else {
//example of what a child process could do:
print "child process stuff \n";
sleep(30);
//$website_scraper -> scraper("http://foo.com");
exit;
}
while(pcntl_waitpid(0, $status) != -1) { //////???
$status = pcntl_wexitstatus($status);
echo "child $status completed \n";
}
print "did stuff \n";
}
?>
I've been reading up on multi-threading with PHP
Don't. PHP threading has very limited utility, as it cannot be used in a web server environment. It can only be used in command-line scripts.
The author of the PHP pthreads extension has written:
pthreads v3 is restricted to operating in CLI only: I have spent many years trying to explain that threads in a web server just don't make sense, after 1,111 commits to pthreads I have realised that, my advice is going unheeded.
So I'm promoting the advice to hard and fast fact: you can't use pthreads safely and sensibly anywhere but CLI.
If you need to communicate with multiple network devices in parallel, consider using stream_select to perform asynchronous I/O, or running multiple PHP processes as part of a worker queue to manage the connections.

PHP/Ajax Anonymous function, iteration causes error? + how to show standard PHP errors when using Ajax

I've developed a web scraper on one server, which works and does what I want it to do. Now I have to implement it in another environment and I've stumbled on an issue I did not have when developing, which I am having a hard to identifying.
The only real error I have to go on is (from JS console):
POST http://my.cool.page/pro/company/scrape 502 (Bad Gateway)
The development server (where it works) is using PHP 5.4.16, implementation server is on PHP 5.4.45. I am using the same versions of external code on both servers.
The circumstances for launching the scraper are a bit different in implementation, it's now being loaded through Ajax rather than as its own page.
The ajax call:
$("#showScraperButton").click(function(){
$.post('/pro/company/scrape',
{
'url': url
},
function(result){
//code...
}
);
});
Function + case for scraping anchor tags, using Fabpot/Goutte:
function _getTagContent($crawler = '', $toScrape = '', $contentPatterns = '')
{
$tagContent = array();
ChromePhp::log("Hello _getTagContent");
foreach($toScrape as $tag) {
$i = 0;
switch ($tag) {
case 'a':
$n = $i;
$crawler->filter($tag)->each(
function ($node) use(&$tagContent, &$n, &$tag, &$crawler)
{
$nodeText = trim($node->text());
$tagContent[$tag][$n]['value'] = $nodeText;
$linksCrawler = $crawler->selectLink($nodeText);
try {
$link = $linksCrawler->link();
$magicDidHappen = true;
}
catch(Exception $e) {
$magicDidHappen = false;
}
if ($magicDidHappen) {
$uri = $link->getUri();
}
else {
$uri = $node->attr('href');
}
$tagContent[$tag][$n]['uri'] = $uri;
$n++;
});
break;
default:
break;
}
}
return $tagContent;
}
This results in the error described above.
By commenting out each line in the case, I found that the error does not show until
$n++;
is called. If
$n++;
is NOT included, the final a element is indeed present in $tagContent.
This led me to believe that the attempt at iteration is the problem in this case, and that the code otherwise does not throw errors. I then tried with a different html tag, using similar syntax:
case 'h3':
$n = $i;
$crawler->filter($tag)->each(
function ($node) use(&$tagContent, &$n, &$tag)
{
$tagContent[$tag][$n] = trim($node->text());
$n++;
});
break;
However, this works as intended, giving me all 40 instances of h3 on the page I'm scraping.
From this I have some questions: Please help? Could it be related to PHP versions? Is there a way to print the "standard" PHP errors when doing Ajax calls (instead of/in addition to http response codes), as I'm sure there is a hint to be found there as to what is failing. Thanks much for any help!
It now works using
case 'a':
$crawler->filter($tag)->each(
function ($node, $n) use (&$tagContent, &$tag, &$crawler)
{
$nodeText = trim($node->text());
$tagContent[$tag][$n]['value'] = $nodeText;
$linksCrawler = $crawler->selectLink($nodeText);
try {
$link = $linksCrawler->link();
$magicDidHappen = true;
}
catch(Exception $e) {
$magicDidHappen = false;
}
if ($magicDidHappen) {
$uri = $link->getUri();
}
else {
$uri = $node->attr('href');
}
$tagContent[$tag][$n]['uri'] = $uri;
$n++;
});
break;
Moved $n out of the using() statement and into the function parameters. I believe ChromePhp might have been causing some issues here. Still don't really know what went wrong. But now it works...

Retrieving results from background Gearman job/task

Subject is quite self-explanatory, but I definitely need a fresh pair of eyes on this.
I am using mmoreram/GearmanBundle Symfony2 bundle to send jobs to execute. So, far I have managed to send a job, execute it and return results. That part works as expected.
However, I am trying to the same with background job/tasks. I know that, in this scenario, client does not wait for job to complete, but I was hoping that job handle can help me with that (e.g. retrieve job status).
$gearman = $this->get('gearman');
$jobId = $gearman->doHighBackgroundJob("CsvWorker~parseCsv", json_encode(["foo", "bar", "123"]));
sleep(3);
// At this point, job has completed for sure (it's very simple)
var_dump($jobId);
var_dump($gearman->getJobStatus($jobId));
This outputs the following:
string 'H:localhost.localdomain:10' (length=26)
object(Mmoreram\GearmanBundle\Module\JobStatus)[410]
private 'known' => boolean false
private 'running' => boolean false
private 'completed' => int 0
private 'completionTotal' => int 0
The known => false, in particular, really puzzles me. During the job execution, I made sure to invoke correctly sendStatus and sendComplete methods.
So, I guess, a general question would be: once the job has completed, is it still known to Gearman?
UPDATE:
I managed to add some code changes to the bundle which allowed me to listen for data being returned by job. That way, I may be able to persist that in database, however, my client (job creator) is still pretty much left in the dark on whether the job has actually finished.
I found here such an option for solving the problem.
It is convenient when you need to complete a task, and the answer is needed only for a while.
Worker
$gmworker = new GearmanWorker();
$gmworker->addServer();
$gmworker->addFunction("long_running_task", "long_running_task_fn");
print "Waiting for job...\n";
while($gmworker->work()) {
if ($gmworker->returnCode() != GEARMAN_SUCCESS) {
echo "return_code: " . $gmworker->returnCode() . "\n";
break;
}
}
function long_running_task_fn($job) {
$mc = memcache_connect('localhost', 11211);
$result = 1;
$n = $job->workload();
for ($i = 1; $i <= $n; $i++) {
$result *= $i;
$job->sendStatus($i, $n);
sleep(1);
}
memcache_set($mc, $job->handle(), $result);
}
Client
<?php
if ($_POST['start']) {
$gmc = new GearmanClient();
$gmc->addServer();
$handle = $gmc->doBackground('long_running_task', '10');
header('Location: /client.php?handle='.urlencode($handle));
}
if ($_GET['handle']) {
$handle = $_GET['handle'];
$gmc = new GearmanClient();
$gmc->addServer();
$status = $gmc->jobStatus($handle);
}
function get_result($handle) {
$mc = memcache_connect('localhost', 11211);
$reply = memcache_get($mc, $handle);
memcache_close($mc);
return $reply;
}
?>
As described in PHP Manual, as long as the job is known to the server, it is not completed.

multi-thread, multi-curl crawler in PHP

Hi everyone once again!
We need some help to develop and implement a multi-curl functionality into our crawler. We have a huge array of "links to be scanned" and we loop throw them with a Foreach.
Let's use some pseudo code to understand the logic:
1) While ($links_to_be_scanned > 0).
2) Foreach ($links_to_be_scanned as $link_to_be_scanned).
3) Scan_the_link() and run some other functions.
4) Extract the new links from the xdom.
5) Push the new links into $links_to_be_scanned.
5) Push the current link into $links_already_scanned.
6) Remove the current link from $links_to_be_scanned.
Now, we need to define a maximum number of parallel connections and be able to run this process for each link in parallel.
I understand that we're gonna have to create a $links_being_scanned or some kind of queue.
I'm really not sure how to approach this problem to be honest, if anyone could provide some snippet or idea to solve it, it would be greatly appreciated.
Thanks in advance!
Chris;
Extended:
I just realized that is not the multi-curl itself the tricky part, but the amount of operations done with each link after the request.
Even after the muticurl, I would eventually have to find a way to run all this operations in parallel. The whole algorithm described below would have to run in parallel.
So now rethinking, we would have to do something like this:
While (There's links to be scanned)
Foreach ($Link_to_scann as $link)
If (There's less than 10 scanners running)
Launch_a_new_scanner($link)
Remove the link from $links_to_be_scanned array
Push the link into $links_on_queue array
Endif;
And each scanner does (This should be run in parallel):
Create an object with the given link
Send a curl request to the given link
Create a dom and an Xdom with the response body
Perform other operations over the response body
Remove the link from the $links_on_queue array
Push the link into the $links_already_scanned array
I assume we could approach this creating a new PHP file with the scanner algorithm, and using pcntl_fork() for each parallel proccess?
Since even using multi-curl, I would eventually have to wait looping on a regular foreach structure for the other processes.
I assume I would have to approach this using fsockopen or pcntl_fork.
Suggestions, comments, partial solutions, and even a "good luck" will be more than appreciated!
Thanks a lot!
DISCLAIMER: This answer links an open-source project with which I'm involved. There. You've been warned.
The Artax HTTP client is a socket-based HTTP library that (among other things) offers custom control over the number of concurrent open socket connections to individual hosts while making multiple asynchronous HTTP requests.
Limiting the number of concurrent connections is easily accomplished. Consider:
<?php
use Artax\Client, Artax\Response;
require dirname(__DIR__) . '/autoload.php';
$client = new Client;
// Defaults to max of 8 concurrent connections per host
$client->setOption('maxConnectionsPerHost', 2);
$requests = array(
'so-home' => 'http://stackoverflow.com',
'so-php' => 'http://stackoverflow.com/questions/tagged/php',
'so-python' => 'http://stackoverflow.com/questions/tagged/python',
'so-http' => 'http://stackoverflow.com/questions/tagged/http',
'so-html' => 'http://stackoverflow.com/questions/tagged/html',
'so-css' => 'http://stackoverflow.com/questions/tagged/css',
'so-js' => 'http://stackoverflow.com/questions/tagged/javascript'
);
$onResponse = function($requestKey, Response $r) {
echo $requestKey, ' :: ', $r->getStatus();
};
$onError = function($requestKey, Exception $e) {
echo $requestKey, ' :: ', $e->getMessage();
}
$client->requestMulti($requests, $onResponse, $onError);
IMPORTANT: In the above example the Client::requestMulti method is making all the specified requests asynchronously. Because the per-host concurrency limit is set to 2, the client will open up new connections for the first two requests and subsequently reuse those same sockets for the other requests, queuing requests until one of the two sockets become available.
you could try something like this, haven't checked it, but you should get the idea
$request_pool = array();
function CreateHandle($url) {
$handle = curl_init($url);
// set curl options here
return $handle;
}
function Process($data) {
global $request_pool;
// do something with data
array_push($request_pool , CreateHandle($some_new_url));
}
function RunMulti() {
global $request_pool;
$multi_handle = curl_multi_init();
$active_request_pool = array();
$running = 0;
$active_request_count = 0;
$active_request_max = 10; // adjust as necessary
do {
$waiting_request_count = count($request_pool);
while(($active_request_count < $active_request_max) && ($waiting_request_count > 0)) {
$request = array_shift($request_pool);
curl_multi_add_handle($multi_handle , $request);
$active_request_pool[(int)$request] = $request;
$waiting_request_count--;
$active_request_count++;
}
curl_multi_exec($multi_handle , $running);
curl_multi_select($multi_handle);
while($info = curl_multi_info_read($multi_handle)) {
$curl_handle = $info['handle'];
call_user_func('Process' , curl_multi_getcontent($curl_handle));
curl_multi_remove_handle($multi_handle , $curl_handle);
curl_close($curl_handle);
$active_request_count--;
}
} while($active_request_count > 0 || $waiting_request_count > 0);
curl_multi_close($multi_handle);
}
You should look for some more robust solution to your problem. RabbitMQ
is a very good solution I used. There is also Gearman but I think it is your choice.
I prefer RabbitMQ.
I will share with you my code which I have used to collect email addresses from certain website.
You can modify it to fit your needs.
There were some problems with relative URL's there.
And I do not use CURL here.
<?php
error_reporting(E_ALL);
$home = 'http://kharkov-reklama.com.ua/jborudovanie/';
$writer = new RWriter('C:\parser_13-09-2012_05.txt');
set_time_limit(0);
ini_set('memory_limit', '512M');
function scan_page($home, $full_url, &$writer) {
static $done = array();
$done[] = $full_url;
// Scan only internal links. Do not scan all the internet!))
if (strpos($full_url, $home) === false) {
return false;
}
$html = #file_get_contents($full_url);
if (empty($html) || (strpos($html, '<body') === false && strpos($html, '<BODY') === false)) {
return false;
}
echo $full_url . '<br />';
preg_match_all('/([A-Za-z0-9_\-]+\.)*[A-Za-z0-9_\-]+#([A-Za-z0-9][A-Za-z0-9\-]*[A-Za-z0-9]\.)+[A-Za-z]{2,4}/', $html, $emails);
if (!empty($emails) && is_array($emails)) {
foreach ($emails as $email_group) {
if (is_array($email_group)) {
foreach ($email_group as $email) {
if (filter_var($email, FILTER_VALIDATE_EMAIL)) {
$writer->write($email);
}
}
}
}
}
$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
preg_match_all("/$regexp/siU", $html, $matches, PREG_SET_ORDER);
if (is_array($matches)) {
foreach($matches as $match) {
if (!empty($match[2]) && is_scalar($match[2])) {
$url = $match[2];
if (!filter_var($url, FILTER_VALIDATE_URL)) {
$url = $home . $url;
}
if (!in_array($url, $done)) {
scan_page($home, $url, $writer);
}
}
}
}
}
class RWriter {
private $_fh = null;
private $_written = array();
public function __construct($fname) {
$this->_fh = fopen($fname, 'w+');
}
public function write($line) {
if (in_array($line, $this->_written)) {
return;
}
$this->_written[] = $line;
echo $line . '<br />';
fwrite($this->_fh, "{$line}\r\n");
}
public function __destruct() {
fclose($this->_fh);
}
}
scan_page($home, 'http://kharkov-reklama.com.ua/jborudovanie/', $writer);

Libevent timeout loop exit

I'm having some difficulties getting the PHP libevent extension to break out of a loop on a timeout. Here's what I've got so far based on the demos on the PHP.net docs pages:
// From here: http://www.php.net/manual/en/libevent.examples.php
function print_line($fd, $events, $arg) {
static $max_requests = 0;
$max_requests++;
printf("Received event: %s after %s\n%s", implode(getEventFlags($events)), getTimer(), fgets($fd));
if ($max_requests == 10) {
// exit loop after 3 writes
echo " [EXIT]\n";
event_base_loopexit($arg[1]);
}
}
// create base and event
$base = event_base_new();
$event = event_new();
getTimer(); // Initialise time
$fd = STDIN;
event_set($event, $fd, EV_READ | EV_PERSIST, "print_line", array($event, $base));
event_base_set($event, $base);
event_add($event, 2000000);
event_base_loop($base);
// extract flags from bitmask
function getEventFlags ($ebm) {
$expFlags = array('EV_TIMEOUT', 'EV_SIGNAL', 'EV_READ', 'EV_WRITE', 'EV_PERSIST');
$ret = array();
foreach ($expFlags as $exf) {
if ($ebm & constant($exf)) {
$ret[] = $exf;
}
}
return $ret;
}
// Used to track time!
function getTimer () {
static $ts;
if (is_null($ts)) {
$ts = microtime(true);
return "Timer initialised";
}
$newts = microtime(true);
$r = sprintf("Delta: %.3f", $newts - $ts);
$ts = $newts;
return $r;
}
I can see that the timeout value passed to event_add effects the events passed to print_line(), if these events are any more than 2 seconds apart I get an EV_TIMEOUT instead of an EV_READ. What I want however is for libevent to call print_line as soon as the timeout is reached rather than waiting for the next event in order to give me the timeout.
I've tried using event_base_loopexit($base, 2000000), this causes the event loop to exit immediately without blocking for events. I've also tried passing EV_TIMEOUT to event_set, this seems to have no effect at all.
Has anyone managed to get this working before? I know that the event_buffer_* stuff works with timeouts, however I want to use the standard event_base functions. One of the PECL bugs talks about event_timer_* functions and these functions do exist on my system, however they're not documented at all.
Problem is in fgets() in:
printf("Received event: %s after %s\n%s", implode(getEventFlags($events)), getTimer(), fgets($fd));
This blocks processing and waits for data from STDIN (but there are none on timeout)
Change to something like that:
$text = '';
if ($events & EV_READ) {
$text = fgets($fd);
}
printf("Received event: %s after %s\n%s", implode(getEventFlags($events)), getTimer(), $text);

Categories