Console script perform data import from external API. For boosting import loading performed in concurrent processes that is created by pcntl_fork command.
For communication with API cUrl is used. Communication performed by https protocol.
By some undefined reason periodically some children becomes zombie. There is no errors/warnings/notices in console and also there is not logs is written. Errors level is configured appropriately.
After investigation I suppose that there problem in curl extension since without it, with fake connection, there is no problems.
Also if run import in single process mode - there is no problems at all.
PHP: 7.2.4,
OS: Debian 9,
Curl: 7.59.0 (x86_64-pc-linux-gnu) libcurl/7.47.0 OpenSSL/1.0.2g zlib/1.2.8 libidn/1.32 librtmp/2.3
Maybe someone encountered similar problem or know possible reasons of this strange behavior?
Pseudo code sample of child logic (main part of child showed):
while (true) {
$socket->writeRawString(Signal::MESSAGE_REQUEST_DATA);
$response = $socket->readRawString();
if (Signal::MESSAGE_TERMINATE_PROCESS === $response) {
break;
}
$response = json_decode($response, true);
if (empty($response) || empty($response['deltaId'])) {
continue;
}
$delta = $this->providerConnection->getChanges($response['deltaId']);
if(empty($delta)) {
continue;
}
$xmlReader = new \XMLReader();
$xmlReader->XML($delta);
$xmlReader->read();
$xmlReader->read();
$hasNext = true;
while ($hasNext && 'updated' !== $xmlReader->name) {
$hasNext = $xmlReader->next();
}
if ('updated' !== $xmlReader->name) {
throw new \RuntimeException('Deltas file do not contain updated date.');
}
if (strtotime($xmlReader->readString()) < $endDateTimestamp) {
$socket->writeRawString(self::SIGNAL_END_DATE_REACHED);
continue;
}
}
posix_kill(\posix_getpid(), SIGTERM);
In providerConnection->getChanges($response['deltaId']); request performed via cUrl. For work with cUrl used Php cUrl class extension
As mentioned in my comments, your problem probably is, that childprocesses that died/finished need to be collected by the parent process, or they remain as zombies.
First solution:
Install a signal handler in the parent. Something like this:
pcntl_signal(SIGCHLD, [$this, 'handleSignals']);
With a signal handler that could look like this:
/**
* #param integer $signal
*/
public function handleSignals($signal) {
switch($signal) {
case SIGCHLD:
do {
$pid = pcntl_wait($status, WNOHANG);
} while($pid > 0);
break;
default:
//Nothing to do
}
}
I normally store the pids of forked children and check them all individually with pcntl_waitpid, but this could get you going.
Second Solution:
Use a double-fork to spawn the child-processes, if the parent does not need to wait for all sub-tasks to finish. A double fork looks like this:
$pid = pcntl_fork();
if ($pid == -1) handleError();
elseif ($pid == 0) { // child
$pid = pcntl_fork();
if ($pid == -1) handleChildError();
elseif($pid == 0) { // second level child
exit(startWork()); // init will take over this process
}
// exit first level child
exit(0);
} else {
// parent, wait for first level child
pcntl_wait($pid, $status); // forked child returns almost immediatly, so blocking wait is in order
}
I give up using cUrl for my task. Today I switched to Guzzle with StreamHandler instead of cUrl and it solved all my problems.
I suppose, that due to some internal errors in cUrl, system was killing my child processes.
This is not answer to my question. It just workaround of my problem for those who may also encounter similar problem.
Topic is still open for possible suggestions/explanations.
Related
Problem
The title really says it all. I had a Timeout class that handled timing out using pcntl_alarm(), and during ssh2_* calls, the signal handler simply does not fire when it should.
In its most basic form,
print date('H:i:s'); // A - start waiting, ideally 5 seconds at the most
pcntl_alarm(5);
// SSH attempt to download remote file /dev/random to ensure it blocks *hard*
// (omitted)
print date('H:i:s'); // B - get here once it realizes it's a tripeless-cat scenario
with a signal handler that also outputs the date, I would expect this (rows B and C might be inverted, that does not matter)
"A, at 00:00:00"
"C, at 00:00:05" <-- from the signal handler
"B, at 00:00:05"
but the above will obtain this disappointing result instead:
"A, at 00:00:00"
"C, at 00:01:29" <-- from the signal handler
"B, at 00:01:29"
In other words, the signal handler does fire, but it does so once another, longer, timeout has expired. I can only guess this timeout is inside ssh2_*. Quickly browsing through the source code yielded nothing obvious. What's worse, this ~90 seconds timeout is what I can reproduce by halting a download. In other cases when the firewall dropped the wrong SSH2 packet, I got a stuck process with an effectively infinite timeout.
As the title might reveal, I have already checked other questions and this is definitely not a mis-quoting of SIGALRM (also, the Timeout class worked beautifully). It seems like I need a louder alarm when libssh2 is involved.
Workaround
This modification of the Timeout yields a slightly less precise, but always working system, yet it does so in an awkward, clunky way. I'd really prefer for pcntl_alarm to work in all cases as it's supposed to.
public static function install() {
self::$enabled = function_exists('pcntl_async_signals') && function_exists('posix_kill') && function_exists('pcntl_wait');
if (self::$enabled) {
// Just to be on the safe side
pcntl_async_signals(true);
pcntl_signal(SIGALRM, static function(int $signal, $siginfo) {
// Child just died.
$status = null;
pcntl_wait($status);
static::$expired = true;
});
}
return self::$enabled;
}
public static function wait(int $time) {
if (self::$enabled) {
static::$expired = false;
// arrange for a SIGALRM to be delivered in $time seconds
// pcntl_alarm($time);
$ppid = posix_getpid();
$pid = pcntl_fork();
if ($pid == -1) {
throw new RuntimeException('Could not fork alarming child');
}
if ($pid) {
// save the child's PID
self::$thread = $pid;
return;
}
// we are the child. Send SIGALRM to the parent after requested timeout
sleep($time);
posix_kill($ppid, SIGALRM);
die();
}
}
/**
* Cancel the timeout and verify whether it expired.
**/
public static function expired(): bool {
if (self::$enabled) {
// Have we spawned an alarm?
if (self::$thread) {
// Yes, so kill it.
posix_kill(self::$thread, SIGTERM);
self::$thread = 0;
}
// Maybe.
$status = null;
pcntl_wait($status);
// pcntl_alarm(0);
}
return static::$expired;
}
Question
Can pcntl_alarm() be made to work as expected?
Is there a way to call multiple file_get_contents without waiting first one to finish. I have a few PHP scripts that do some heavy work and execution time of one is more than 1.5 min so I want to call them all at the same time to reduce execution time. Currently I have multiple file_get_contents but to run next file_get_contents the first one needs to finish first. I tried exec method but it's blocked on server (shared hosting).
Well there are ways of running tasks in parallel if thats what you're trying to achieve but bare in mind PHP is a high level scripting language and the original purpose of it was to serve stateless HTTP request.
There are extensions out there like Gearman, this extension allows applications to complete tasks in parallel.
Obviously this will be in vane for you as you're probably using a service for just hosting. Get something like a vps, a vps from OVH is cheaper than most web hosting services.
Yes, it is possible! But it is little bit tricky!
You have to play with forks.
I agree about queues and else... but just for the sake of ability, here is an example:
<?php
declare(ticks = 1);
$a = [
'https://twitter.com',
'https://facebook.com',
'https://stackoverflow.com',
'https://linkedin.com',
'https://github.com',
];
// Count of php threads (forks or processes).
$max = 3;
// Child counter.
$child = 0;
pcntl_signal(SIGCHLD, function ($signo) {
global $child;
if ($signo === SIGCLD) {
while (($pid = pcntl_wait($signo, WNOHANG)) > 0) {
$signal = pcntl_wexitstatus($signo);
$child--;
}
}
});
foreach ($a as $item) {
while ($child >= $max) {
sleep(1);
}
$child++;
$pid = pcntl_fork();
if ($pid) {
} else {
// Child fork.
sleep(1);
// HERE YOUR CODE:
echo file_get_contents($item);
exit(0);
}
}
while ($child != 0) {
sleep(1);
}
How should I multithread some php-cli code that needs a timeout?
I'm using PHP 5.6 on Centos 6.6 from the command line.
I'm not very familiar with multithreading terminology or code. I'll simplify the code here but it is 100% representative of what I want to do.
The non-threaded code currently looks something like this:
$datasets = MyLibrary::getAllRawDataFromDBasArrays();
foreach ($datasets as $dataset) {
MyLibrary::processRawDataAndStoreResultInDB($dataset);
}
exit; // just for clarity
I need to prefetch all my datasets, and each processRawDataAndStoreResultInDB() cannot fetch it's own dataset. Sometimes processRawDataAndStoreResultInDB() takes too long to process a dataset, so I want to limit the amount of time it has to process it.
So you can see that making it multithreaded would
Speed it up by allowing multiple processRawDataAndStoreResultInDB() to execute at the same time
Use set_time_limit() to limit the amount of time each one has to process each dataset
Notice that I don't need to come back to my main program. Since this is a simplification, you can trust that I don't want to collect all the processed datasets and do a single save into the DB after they are all done.
I'd like to do something like:
class MyWorkerThread extends SomeThreadType {
public function __construct($timeout, $dataset) {
$this->timeout = $timeout;
$this->dataset = $dataset;
}
public function run() {
set_time_limit($this->timeout);
MyLibrary::processRawDataAndStoreResultInDB($this->dataset);
}
}
$numberOfThreads = 4;
$pool = somePoolClass($numberOfThreads);
$pool->start();
$datasets = MyLibrary::getAllRawDataFromDBasArrays();
$timeoutForEachThread = 5; // seconds
foreach ($datasets as $dataset) {
$thread = new MyWorkerThread($timeoutForEachThread, $dataset);
$thread->addCallbackOnTerminated(function() {
if ($this->isTimeout()) {
MyLibrary::saveBadDatasetToDb($dataset);
}
}
$pool->addToQueue($thread);
}
$pool->waitUntilAllWorkersAreFinished();
exit; // for clarity
From my research online I've found the PHP extension pthreads which I can use with my thread-safe php CLI, or I could use the PCNTL extension or a wrapper library around it (say, Arara/Process)
https://github.com/krakjoe/pthreads (and the example directory)
https://github.com/Arara/Process (pcntl wrapper)
When I look at them and their examples though (especially the pthreads pool example) I get confused quickly by the terminology and which classes I should use to achieve the kind of multithreading I'm looking for.
I even wouldn't mind creating the pool class myself, if I had a isRunning(), isTerminated(), getTerminationStatus() and execute() function on a thread class, as it would be a simple queue.
Can someone with more experience please direct me to which library, classes and functions I should be using to map to my example above? Am I taking the wrong approach completely?
Thanks in advance.
Here comes an example using worker processes. I'm using the pcntl extension.
/**
* Spawns a worker process and returns it pid or -1
* if something goes wrong.
*
* #param callback function, closure or method to call
* #return integer
*/
function worker($callback) {
$pid = pcntl_fork();
if($pid === 0) {
// Child process
exit($callback());
} else {
// Main process or an error
return $pid;
}
}
$datasets = array(
array('test', '123'),
array('foo', 'bar')
);
$maxWorkers = 1;
$numWorkers = 0;
foreach($datasets as $dataset) {
$pid = worker(function () use ($dataset) {
// Do DB stuff here
var_dump($dataset);
return 0;
});
if($pid !== -1) {
$numWorkers++;
} else {
// Handle fork errors here
echo 'Failed to spawn worker';
}
// If $maxWorkers is reached we need to wait
// for at least one child to return
if($numWorkers === $maxWorkers) {
// $status is passed by reference
$pid = pcntl_wait($status);
echo "child process $pid returned $status\n";
$numWorkers--;
}
}
// (Non blocking) wait for the remaining childs
while(true) {
// $status is passed by reference
$pid = pcntl_wait($status, WNOHANG);
if(is_null($pid) || $pid === -1) {
break;
}
if($pid === 0) {
// Be patient ...
usleep(50000);
continue;
}
echo "child process $pid returned $status\n";
}
I've been using System_Daemon class to create a daemon to send sms.
The script worked perfect with php 5.3.8, but now, with php 5.4.9 it crashes but no error or notice messages is created.
In the function _fork, of System_Daemon class, always return a value that tells it's parent
static protected function _fork()
{
self::debug('forking {appName} daemon');
$pid = pcntl_fork();
if ($pid === -1) {
// Error
return self::warning('Process could not be forked');
} else if ($pid) {
// Parent
self::debug('Ending {appName} parent process');
// Die without attracting attention
exit();
} else {
// Child
self::$_processIsChild = true;
self::$_isDying = false;
self::$_processId = posix_getpid();
return true;
}
}
So, in _summon() function, where it ask about _fork() returned value, always is equal to false.
I've red this post, from another member who has a similar issue:
PHP Pear system_daemon doesn't fork
I've made his suggestions but with no success.
Can somebody, please, give me a hand with this?
I'm so sorry about my english, i made an effort to explain myself.
I have a function that needs to go over around 20K rows from an array, and apply an external script to each. This is a slow process, as PHP is waiting for the script to be executed before continuing with the next row.
In order to make this process faster I was thinking on running the function in different parts, at the same time. So, for example, rows 0 to 2000 as one function, 2001 to 4000 on another one, and so on. How can I do this in a neat way? I could make different cron jobs, one for each function with different params: myFunction(0, 2000), then another cron job with myFunction(2001, 4000), etc. but that doesn't seem too clean. What's a good way of doing this?
If you'd like to execute parallel tasks in PHP, I would consider using Gearman. Another approach would be to use pcntl_fork(), but I'd prefer actual workers when it's task based.
The only waiting time you suffer is between getting the data and processing the data. Processing the data is actually completely blocking anyway (you just simply have to wait for it). You will not likely gain any benefits past increasing the number of processes to the number of cores that you have. Basically I think this means the number of processes is small so scheduling the execution of 2-8 processes doesn't sound that hideous. If you are worried about not being able to process data while retrieving data, you could in theory get your data from the database in small blocks, and then distribute the processing load between a few processes, one for each core.
I think I align more with the forking child processes approach for actually running the processing threads. There is a brilliant demonstration in the comments on the pcntl_fork doc page showing an implementation of a job daemon class
http://php.net/manual/en/function.pcntl-fork.php
<?php
declare(ticks=1);
//A very basic job daemon that you can extend to your needs.
class JobDaemon{
public $maxProcesses = 25;
protected $jobsStarted = 0;
protected $currentJobs = array();
protected $signalQueue=array();
protected $parentPID;
public function __construct(){
echo "constructed \n";
$this->parentPID = getmypid();
pcntl_signal(SIGCHLD, array($this, "childSignalHandler"));
}
/**
* Run the Daemon
*/
public function run(){
echo "Running \n";
for($i=0; $i<10000; $i++){
$jobID = rand(0,10000000000000);
while(count($this->currentJobs) >= $this->maxProcesses){
echo "Maximum children allowed, waiting...\n";
sleep(1);
}
$launched = $this->launchJob($jobID);
}
//Wait for child processes to finish before exiting here
while(count($this->currentJobs)){
echo "Waiting for current jobs to finish... \n";
sleep(1);
}
}
/**
* Launch a job from the job queue
*/
protected function launchJob($jobID){
$pid = pcntl_fork();
if($pid == -1){
//Problem launching the job
error_log('Could not launch new job, exiting');
return false;
}
else if ($pid){
// Parent process
// Sometimes you can receive a signal to the childSignalHandler function before this code executes if
// the child script executes quickly enough!
//
$this->currentJobs[$pid] = $jobID;
// In the event that a signal for this pid was caught before we get here, it will be in our signalQueue array
// So let's go ahead and process it now as if we'd just received the signal
if(isset($this->signalQueue[$pid])){
echo "found $pid in the signal queue, processing it now \n";
$this->childSignalHandler(SIGCHLD, $pid, $this->signalQueue[$pid]);
unset($this->signalQueue[$pid]);
}
}
else{
//Forked child, do your deeds....
$exitStatus = 0; //Error code if you need to or whatever
echo "Doing something fun in pid ".getmypid()."\n";
exit($exitStatus);
}
return true;
}
public function childSignalHandler($signo, $pid=null, $status=null){
//If no pid is provided, that means we're getting the signal from the system. Let's figure out
//which child process ended
if(!$pid){
$pid = pcntl_waitpid(-1, $status, WNOHANG);
}
//Make sure we get all of the exited children
while($pid > 0){
if($pid && isset($this->currentJobs[$pid])){
$exitCode = pcntl_wexitstatus($status);
if($exitCode != 0){
echo "$pid exited with status ".$exitCode."\n";
}
unset($this->currentJobs[$pid]);
}
else if($pid){
//Oh no, our job has finished before this parent process could even note that it had been launched!
//Let's make note of it and handle it when the parent process is ready for it
echo "..... Adding $pid to the signal queue ..... \n";
$this->signalQueue[$pid] = $status;
}
$pid = pcntl_waitpid(-1, $status, WNOHANG);
}
return true;
}
}
you can use "PTHREADS"
very easy to install and works great on windows
download from here -> http://windows.php.net/downloads/pecl/releases/pthreads/2.0.4/
Extract the zip file and then
move the file 'php_pthreads.dll' to php\ext\ directory.
move the file 'pthreadVC2.dll' to php\ directory.
then add this line in your 'php.ini' file:
extension=php_pthreads.dll
save the file.
you just done :-)
now lets see example of how to use it:
class ChildThread extends Thread {
public $data;
public function run() {
/* Do some expensive work */
$this->data = 'result of expensive work';
}
}
$thread = new ChildThread();
if ($thread->start()) {
/*
* Do some expensive work, while already doing other
* work in the child thread.
*/
// wait until thread is finished
$thread->join();
// we can now even access $thread->data
}
for more information about PTHREADS read php docs here:
PHP DOCS PTHREADS
if you'r using WAMP like me, then you should add 'pthreadVC2.dll' into
\wamp\bin\apache\ApacheX.X.X\bin
and also edit the 'php.ini' file (same path) and add the same line as before
extension=php_pthreads.dll
GOOD LUCK!
What you are looking for is parallel which is a succinct concurrency API for PHP 7.2+
$runtime = new \parallel\Runtime();
$future = $runtime->run(function() {
for ($i = 0; $i < 500; $i++) {
echo "*";
}
return "easy";
});
for ($i = 0; $i < 500; $i++) {
echo ".";
}
printf("\nUsing \\parallel\\Runtime is %s\n", $future->value());
Output:
.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*..*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*
Using \parallel\Runtime is easy
Have a look at pcntl_fork. This allows you to spawn child processes which can then do the separate work that you need.
Not sure if a solution for your situation but you can redirect the output of system calls to a file, thus PHP will not wait until the program is finished. Although this may result in overloading your server.
http://www.php.net/manual/en/function.exec.php - If a program is started with this function, in order for it to continue running in the background, the output of the program must be redirected to a file or another output stream. Failing to do so will cause PHP to hang until the execution of the program ends.
There's Guzzle with its concurrent requests
use GuzzleHttp\Client;
use GuzzleHttp\Promise;
$client = new Client(['base_uri' => 'http://httpbin.org/']);
$promises = [
'image' => $client->getAsync('/image'),
'png' => $client->getAsync('/image/png'),
'jpeg' => $client->getAsync('/image/jpeg'),
'webp' => $client->getAsync('/image/webp')
];
$responses = Promise\Utils::unwrap($promises);
There's the overhead of promises; but more importantly Guzzle only works with HTTP requests and it works with version 7+ and frameworks like Laravel.