PHP forking and multiple child signals - php

I'm trying to write a script which creates a number of forked child processes using the pcntl_* functions.
Basically, there is a single script which runs in a loop for about a minute, periodically polling a database to see if there is a task to be run. If there is one, it should fork and run the task in a separate process so that the parent isn't held up by a long-running task.
Since there possibly could be a large number of tasks ready to be run, I want to limit the number of child processes that are created. Therefore, I am keeping track of the number of processes by incrementing a variable each time one is created (and then pausing if there's too many), and then decrementing it in a signal handler. Kind of like this:
define(ticks = 1);
$openProcesses = 0; // how many we have open
$max = 3; // the most we want open at a time
pcntl_signal(SIGCHLD, "childFinished");
while (!time_is_up()) {
if (there_is_something_to_do()) {
$pid = pcntl_fork();
if (!$pid) { // I am the child
foo(); // run the long-running task
exit(0); // and exit
} else { // I am the parent
++$openProcesses;
if ($openProcesses >= $max) {
pcntl_wait($status); // wait for any child to exit
} // before continuing
}
} else {
sleep(3);
}
}
function childFinished($signo) {
global $openProcesses;
--$openProcesses;
}
This works pretty much ok most of the time, except for when two or more processes finish simultaneously - the signal handler function is only called once, which throws out my counter. The reason for this is explained by "Anonymous" in the notes of the PHP manual:
Multiple children return less than the number of children exiting at a given moment SIGCHLD signals is normal behavior for Unix (POSIX) systems. SIGCHLD might be read as "one or more children changed status -- go examine your children and harvest their status values".
My question is this: How do I examine the children and harvest their status? Is there any reliable way to check how many child processes are open at any given time?
Using PHP 5.2.9

One way is to keep an array of the PIDs of the child processes, and in the signal handler check each PID to see if it's still running. The (untested) code would look like:
declare(ticks = 1);
$openProcesses = 0;
$procs = array();
$max = 3;
pcntl_signal(SIGCHLD, "childFinished");
while (!time_is_up()) {
if (there_is_something_to_do()) {
$pid = pcntl_fork();
if (!$pid) {
foo();
exit(0);
} else {
$procs[] = $pid; // add the PID to the list
++$openProcesses;
if ($openProcesses >= $max) {
pcntl_wait($status);
}
}
} else {
sleep(3);
}
}
function childFinished($signo) {
global $openProcesses, $procs;
// Check each process to see if it's still running
// If not, remove it and decrement the count
foreach ($procs as $key => $pid) if (posix_getpgid($pid) === false) {
unset($procs[$key]);
$openProcesses--;
}
}

You could have children send a SIGUSR1 to the parent when they start,then a SIGUSR2 before they exit. The other thing you are dealing with when using primitive signals is the kernel merging them, which it does not do with RT signals. In theory, ANY non-rt signal could be merged.
You might implement some kind of simple locking using sqlite, where only one child at a time can have the talking stick. Just make sure that children handle normally fatal signals so that they remain alive to free the lock.

I know this is about 8 years too late (and I hope you found an answer), but just in case it helps someone else I am going to answer.
The use of the pcntl_w* functions will be your friend here and you will probably want to implement a process reaper. The documentation is not very helpful and still does not contain any useful examples.
This would be a multi-part process:
1 - use pcntl_signal send trapped signals to your signal handler
2 - Do your looping/polling and within that loop;
3 - Iterate through the array of your children (which you will create below) and reap them as necessary
4 - fork(): This will consist of the following:
pcntl_async_signals(true);
$children = array();
while ($looping === true)
{
reapChildren();
if (($pid = pcntl_fork()) exit (1); // error
elseif ($pid) // parent
{
$children[] = $pid;
// close files/sockets/etc
posix_setpgid ($pid,posix_getpgrp());
}
else
{ // child
posix_setpgid(posix_getpid(),posix_getppid());
// ... jump to child function/object/code/etc ...
exit (0); // or whatever code you want to return
}
} // end of loop
In the reaper, you will need the following:
function reapChildren()
{
global $children;
foreach ($children as $idx => $pid)
{
$rUsage = array();
$status = 0; // integer which will be used as the $status pointer
$ret = pcntl_waitpid($pid, $status, WNOHANG|WUNTRACED, $rUsage);
if (pcntl_wifexited($status)) // the child exited normally
{
$exitCode = pcntl_wexitstatus($status); // returns the child exit status
}
if (pcntl_wifsignaled($status)) // the child received a signal
{
$signal = pcntl_wtermsig($status); // returns the signal that abended the child
}
if (pcntl_wifstopped($status))
{
$signal = pcntl_wstopsig($status); // returns the signal that stopped the child
}
}
}
The above reaper code will allow you to poll the status of your children and if you are using php7+, the $signalInfo array which is filled in at your signal handler will contain a lot of useful information you can use.. var_dump it.. check it out. Also, using pcntl_async_signals(true) in php7+ replaces the need for declare(ticks=1) and manually calling pcntl_signal_dispatch();
I hope this helps.

Related

usleep() in while loop causing issues - whole process freezes

I am using php, Laravel, Redis, and SQL on an Ubuntu localhost server. I have made a bunch of methods that return results from API searches after some processing. I am calling 5 of these methods which will be very slow if done synchronously, so I've been experimenting with async approaches (which I know php isn't optimised for). After a few approaches I have found some success with pcntl_fork(), but I'm running into some nasty problems.
Edit: After some messing around I have found that if I remove the while loop then the code afterward executes properly, I have removed the while loop and placed it in the second 'search' method. However it still causes a freeze of the system. This makes no sense as there shouldn't be an infinite loop as if I manually query the Redis db, all 5 results are there.
This is my code: (I have a few custom classes for making and processing the API calls, fyi these methods work flawlessly)
//this caches the individual api results to a Redis list
public static function cacheAsyncApiSearch(string $searchQuery, int $maxResults = 20)
{
$key = "search:".$searchQuery; //for Redis
if(!Redis::client()->exists($key)) {
for ($i = 0; $i < 5; $i++) {
// Create a child process
$pid = pcntl_fork();
if ($pid == -1) {
// Fork failed
exit(1);
} elseif ($pid) {
// This is the parent process
// I have tried many versions of pcntl_wait, none work! They all still don't allow code to be ran afterwards (even within this elseif block), and the best it does is cache the 1st api case (YouTube)
// while (!pcntl_wait($status, WNOHANG)) {
// $exitStatus = pcntl_wexitstatus($status);
// // Do something with the exit status of the child process
// }
// dd($pid);
// pcntl_waitpid($pid, $status, WUNTRACED);
} else {
//child processes
switch ($i) {
case 0:
$results = YouTube::search($searchQuery, $maxResults)['results'];
Redis::client()->rPush($key,SearchResultDTO::jsonEncodeArray($results));
SearchResultDTO::convertResultDTOToModels($results);
break;
case 1:
$results = Dailymotion::search($searchQuery, $maxResults)['results'];
Redis::client()->rPush($key,SearchResultDTO::jsonEncodeArray($results));
SearchResultDTO::convertResultDTOToModels($results);
break;
case 2:
$results = Vimeo::search($searchQuery, $maxResults)['results'];
Redis::client()->rPush($key,SearchResultDTO::jsonEncodeArray($results));
SearchResultDTO::convertResultDTOToModels($results);
break;
case 3:
$results = Twitch::search($searchQuery, 2)['results'];
Redis::client()->rPush($key,SearchResultDTO::jsonEncodeArray($results));
SearchResultDTO::convertResultDTOToModels($results);
break;
case 4:
$results = Podcasts::getPodcastsFromItunesResults(Podcasts::search($searchQuery, 2)["response"]->results);
Redis::client()->rPush($key,SearchResultDTO::jsonEncodeArray($results));
SearchResultDTO::convertResultDTOToModels($results);
break;
}
$i = 10000;
exit(0);
}
}
// for noting the process id of the given process that gets to this point
Redis::client()->lPush("search_pid:".$searchQuery, $pid);
// sets a time out for the redis cache
Redis::client()->expire($key, 60*60*4);
while (is_numeric( Redis::client()->lLen($key)) && Redis::client()->lLen($key) < 5) {
usleep(500000); // 0.5 seconds
// pcntl_waitpid(-1, $status); //does this even do anything? not for me
}
return false; // not already cached
}
return true; // already cached
}
This code somewhat works, It performs the api calls and caches the Redis perfectly. However when the method is ran, no code will be ran after it (unless redis has found a cached version and the process is not forked).
This made me think that all processes are being exited (possibly true? if so i dont know why), so I tried writing a version without the exit(0) line. This works, I can then perform code after the method call, however I noticed (when getting SQL race conditions) that all 6 (5 child, 1 parent) processes continued to run their own version of the code after this method (e.g. some database writes)
public static function search(string $searchQuery, int $maxResults = 20): array
{
$key = "search:".$searchQuery;
$results = [];
// the quoted method above
self::cacheAsyncApiSearch($searchQuery, $maxResults);
foreach (Redis::client()->lRange($key,0,-1) as $result){
$results = array_merge($results, SearchResultDTO::jsonDecodeArray($result));
}
$creatorDTOs = [];
$videoDTOs = [];
$streamDTOs = [];
$playlistDTOs = [];
$podcastDTOs = [];
/** #var SearchResultDTO $result */
foreach ($results as $result) {
match ($result->kind) {
Kind::Creator => $creatorDTOs[] = $result,
Kind::Video => $videoDTOs[] = $result,
Kind::Stream => $streamDTOs[] = $result,
Kind::Playlist => $playlistDTOs[] = $result,
Kind::Podcast => $podcastDTOs[] = $result,
};
}
// did this to test how many times the code was being ran (the list has 6 1's in it)
Redis::client()->lPush("here", '1');
// I know this code isn't completely efficient since I already called these conversion methods before, however I am just trying to get the forking stuff to work right now.
return [
"creators" => SearchResultDTO::convertResultDTOToModels($creatorDTOs),
"videos" => SearchResultDTO::convertResultDTOToModels($videoDTOs),
"streams" => SearchResultDTO::convertResultDTOToModels($streamDTOs),
"playlists" => SearchResultDTO::convertResultDTOToModels($playlistDTOs),
"podcasts" => SearchResultDTO::convertResultDTOToModels($podcastDTOs)
];
}
These DTO's (Data Transfer Objects) are being used to populate a UI. So for example, when I make a search (that isn't cached), the page is blank forever. But if I refresh the page (after the search is cached) then the results show just fine.
This is the most bizarre problem I have ever ran into and I really appreciate any help.
Edit please read:
After some messing around I have found that if I remove the while loop then the code afterward executes properly, I have removed the while loop and placed it in the second 'search' method. However it still causes a freeze of the system. This makes no sense as there shouldn't be an infinite loop as if I manually query the Redis db, all 5 results are there. And the dd("two") can never be excecated unless the usleep() is removed. Hopefully this narrows the problem down.
Edit 2 please read:
I have figured out that I can get the dd("two") to work when usleep() is reduced to 0.05s from 0.5 seconds, but it still doesnt seem to run long enough for it to work.
if(!self::cacheAsyncApiSearch($searchQuery, $maxResults))
{
// make sure Redis is properly returning a number not object
$len = Redis::client()->lLen($key);
while(!is_numeric($len)){
usleep(500000); // 0.5 seconds
$len = Redis::client()->lLen($key);
}
//dd($len); //this dd() works
while ($len < 5) {
dd("one"); // this dd() works
usleep(500000); // 0.5 seconds
dd("two"); **//$this does not work, why?**
$len = Redis::client()->lLen($key);
}
}

PHP pthreads failing when run from cron

Ok, so lets start slow...
I have a pthreads script running and working for me, tested and working 100% of the time when I run it manually from the command line via ssh. The script is as follows with the main thread process code adjusted to simulate random process' run time.
class ProcessingPool extends Worker {
public function run(){}
}
class LongRunningProcess extends Threaded implements Collectable {
public function __construct($id,$data) {
$this->id = $id;
$this->data = $data;
}
public function run() {
$data = $this->data;
$this->garbage = true;
$this->result = 'START TIME:'.time().PHP_EOL;
// Here is our actual logic which will be handled within a single thread (obviously simulated here instead of the real functionality)
sleep(rand(1,100));
$this->result .= 'ID:'.$this->id.' RESULT: '.print_r($this->data,true).PHP_EOL;
$this->result .= 'END TIME:'.time().PHP_EOL;
$this->finished = time();
}
public function __destruct () {
$Finished = 'EXITED WITHOUT FINISHING';
if($this->finished > 0) {
$Finished = 'FINISHED';
}
if ($this->id === null) {
print_r("nullified thread $Finished!");
} else {
print_r("Thread w/ ID {$this->id} $Finished!");
}
}
public function isGarbage() : bool { return $this->garbage; }
public function getData() {
return $this->data;
}
public function getResult() {
return $this->result;
}
protected $id;
protected $data;
protected $result;
private $garbage = false;
private $finished = 0;
}
$LoopDelay = 500000; // microseconds
$MinimumRunTime = 300; // seconds (5 minutes)
// So we setup our pthreads pool which will hold our collection of threads
$pool = new Pool(4, ProcessingPool::class, []);
$Count = 0;
$StillCollecting = true;
$CountCollection = 0;
do {
// Grab all items from the conversion_queue which have not been processed
$result = $DB->prepare("SELECT * FROM `processing_queue` WHERE `processed` = 0 ORDER BY `queue_id` ASC");
$result->execute();
$rows = $result->fetchAll(PDO::FETCH_ASSOC);
if(!empty($rows)) {
// for each of the rows returned from the queue, and allow the workers to run and return
foreach($rows as $id => $row) {
$update = $DB->prepare("UPDATE `processing_queue` SET `processed` = 1 WHERE `queue_id` = ?");
$update->execute([$row['queue_id']]);
$pool->submit(new LongRunningProcess($row['fqueue_id'],$row));
$Count++;
}
} else {
// 0 Rows To Add To Pool From The Queue, Do Nothing...
}
// Before we allow the loop to move on to the next part, lets try and collect anything that finished
$pool->collect(function ($Processed) use(&$CountCollection) {
global $DB;
$data = $Processed->getData();
$result = $Processed->getResult();
$update = $DB->prepare("UPDATE `processing_queue` SET `processed` = 2 WHERE `queue_id` = ?");
$update->execute([$data['queue_id']]);
$CountCollection++;
return $Processed->isGarbage();
});
print_r('Collecting Loop...'.$CountCollection.'/'.$Count);
// If we have collected the same total amount as we have processed then we can consider ourselves done collecting everything that has been added to the database during the time this script started and was running
if($CountCollection == $Count) {
$StillCollecting = false;
print_r('Done Collecting Everything...');
}
// If we have not reached the full MinimumRunTime that this cron should run for, then lets continue to loop
$EndTime = microtime(true);
$TimeElapsed = ($EndTime - $StartTime);
if(($TimeElapsed/($LoopDelay/1000000)) < ($MinimumRunTime/($LoopDelay/1000000))) {
$StillCollecting = true;
print_r('Ended To Early, Lets Force Another Loop...');
}
usleep($LoopDelay);
} while($StillCollecting);
$pool->shutdown();
So while the above script will run via a command line (which has been adjusted to the basic example, and detailed processing code has been simulated in the above example), the below command gives a different result when run from a cron setup for every 5 minutes...
/opt/php7zts/bin/php -q /home/account/cron-entry.php file=every-5-minutes/processing-queue.php
The above script, when using the above command line call, will loop over and over during the run time of the script and collect any new items from the DB queue, and insert them into the pool, which allows 4 processes at a time to run and finish, which is then collected and the queue is updated before another loop happens, pulling any new items from the DB. This script will run until we have processed and collected all processes in the queue during the execution of the script. If the script has not run for the full 5 minute expected period of time, the loop is forced to continue checking the queue, if the script has run over the 5 minute mark it allows any current threads to finish & be collected before closing. Note that the above code also includes a code based "flock" functionality which makes future crons of this idle loop and exit or start once the lock has lifted, ensuring that the queue and threads are not bumping into each other. Again, ALL OF THIS WORKS FROM THE COMMAND LINE VIA SSH.
Once I take the above command, and put it into a cron to run for every 5 minutes, essentially giving me a never ending loop, while maintaining memory, I get a different result...
That result is described as follows... The script starts, checks the flock, and continues if the lock is not there, it creates the lock, and runs the above script. The items are taken from the queue in the DB, and inserted into the pool, the pool fires off the 4 threads at a time as expected.. But the unexpected result is that the run() command does not seem to be executed, and instead the __destruct function runs, and a "Thread w/ ID 2 FINISHED!" type of message is returned to the output. This in turn means that the collection side of things does not collect anything, and the initiating script (the cron script itself /home/account/cron-entry.php file=every-5-minutes/processing-queue.php) finishes after everything has been put into the pool, and destructed. Which prematurely "finishes" the cron job, since there is nothing else to do but loop and pull nothing new from the queue, since they are considered "being processed" when processed == 1 in the queue.
The question then finally becomes... How do I make the cron's script aware of the threads that where spawned and run() them without closing the pool out before they can do anything?
(note... if you copy / paste the provided script, note that I did not test it after removing the detailed logic, so it may need some simple fixes... please do not nit-pick said code, as the key here is that pthreads works if the script is executed FROM the Command Line, but fails to properly run when the script is executed FROM a CRON. If you plan on commenting with non-constructive criticism, please go use your fingers to do something else!)
Joe Watkins! I Need Your Brilliance! Thanks In Advance!
After all of that, it seems that the issue was with regards to user permissions. I was setting this specific cron up inside of cpanel, and when running the command manually I was logged in as root.
After setting this command up in roots crontab, I was able to get it to successfully run the threads from the pool. Only issue I have now is some threads never finish, and sometimes I am unable to close the pool. But this is a different issue, so I will open another question elsewhere.
For those running into this issue, make sure you know who the owner of the cron is as it matters with php's pthreads.

How to make an asynchronous self-calling loop non-recursive

I'm writing a function in PHP that loops through an array, and then performs an asynchronous call on it (using a Promise).
The problem is that, the only way I can make this loop happen, is by letting a function call itself asynchronously. I run into the 100-nested functions problem really quick, and I would basically like to change it to not recur.
function myloop($data, $index = 0) {
if (!isset($data[$index])) {
return;
}
$currentItem = $data[$index];
$currentItem()->then(function() use ($data, $index) {
myloop($data, $index + 1);
});
}
For those that want to answer this from a practical perspective (e.g.: rewrite to not be asynchronous), I'm experimenting with functional and asynchronous patterns and I want to know if it is possible to do this with PHP.
I've written a possible solution in pseudo-code. The idea is to limit
the number of items running asynchronous at once by using a database
queue. myloop() is no longer directly recursive, instead being called
whenever an item finishes running. In the sample data, I've limited it
to 4 items concurrently (arbitrary value).
Basically, it's still recursively calling itself, but in a roundabout way,
avoiding the situation you mentioned of many nested called.
Execution Flow:
myloop() ---> queue
^ v
| |
'<-processor <-'
<?php
//----------
// database
//table: config
//columns: setting, value
//items: ACTIVE_COUNT, 0
// ITEM_CONCURRENT_MAX, 4
//table: queue
//columns: id, item, data, index, pid, status(waiting, running, finished), locked
// --- end pseudo-schema ---
<?php
// ---------------
// itemloop.php
// ---------------
//sends an item and associated data produced by myloop() into a database queue,
//to be processed (run asynchronous, but limited to how many can run at once)
function send_item_to_processor($item, $data, $index, $counter) {
//INSERT $item to a queue table, along with $data, $index (if needed), $counter, locked = 0
//status == waiting
}
//original code, slightly modified to remove direct recursion and implement
//the queue.
function myloop($data, $index = 0, $counter = 0) {
if (!isset($data[$index])) {
return;
}
$currentItem = $data[$index];
$currentItem()->then(function() use ($data, $index) {
//instead of directly calling `myloop()`, push item to
//database and let the processor worry about it. see below.
//*if you wanted currentItem to call a specific function after finishing,
//you could create an array of numbered functions and pass the function
//number along with the other data.*
send_item_to_processor($currentItem, $data, $index + 1, $counter + 1);
});
}
// ---------------
// processor.php
// ---------------
//handles the actual running of items. looks for a "waiting" item and
//executes it, updating various statuses along the way.
//*called from `process_queue()`*
function process_new_items() {
//select ACTIVE_COUNT, ITEM_CONCURRENT_MAX
//ITEM_COUNT = total records in the queue. this is done to
//short-circuit the execution of `process_queue()` whenever possible
//(which is called frequently).
if (ITEM_COUNT == 0 || $ACTIVE_COUNT >= $ITEM_CONCURRENT_MAX)
return FALSE;
//select item from queue where status = waiting AND locked = 0 limit 1;
//update item set status = running, pid = programPID
//update config ACTIVE_COUNT = +1
//**** asynchronous run item here ****//
return TRUE;
}
//main processor for the queue. first processes new/waiting items
//if it can (if too many items aren't already running), then processes
//dead/completed items. Upon an item.status == finished, `myloop()` is
//called from this function. Still technically a recursive call, but
//avoids out-of-control situations due to the asynchronous nature.
//this function could be called on a timer of some sort, such as a cronjob
function process_queue() {
if (!process_new_items())
return FALSE; //too many instances running, no need to process
//check queue table for items with status == finished or is_pid_valid(pid) == FALSE
$numComplete = count($rows);
//update all rows to locked = 1, in case process_queue() gets called again before
//we finish, resulting in an item potentially being processed as dead twice.
foreach($rows as $item) {
if (is_invalid(pid) || $status == finished) {
//and here is the call back to myloop(), avoiding a strictly recursive
//function call.
//*Not sure what to do with `$item` here -- might be passed back to `myloop()`?.*
//delete item(s) from queue
myloop(data, index, counter - 1);
//decrease config.ACTIVE_COUNT by $numComplete
}
}
}

How do I use $status returned by pcntl_waitpid()?

I have a parent/worker arrangement going on. The parent keeps the worker PIDs in an array, constantly checking that they are still alive with the following loop:
// $workers is an array of PIDs
foreach ($workers as $workerID => $pid) {
// Check if this worker still exists as a process
pcntl_waitpid($pid, $status, WNOHANG|WUNTRACED);
// If the worker exited normally, stop tracking it
if (pcntl_wifexited($status)) {
$logger->info("Worker $workerID exited normally");
array_splice($workers, $workerID, 1);
}
// If it has a session ID, then it's still living
if (posix_getsid($pid))⋅
$living[] = $pid;
}
// $dead is the difference between workers we've started
// and those that are still running
$dead = array_diff($workers, $living);
The problem is that pcntl_waitpid() is always setting $status to 0, so the very first time this loop is run, the parent thinks that all of its children have exited normally, even though they are still running. Am I using pcntl_waitpid() incorrectly, or expecting it to do something that it doesn't?
Simple, the child has not exited or stopped. You added the WNOHANG flag, so it will always return immediately (It tells the function not to wait for an event). What you should do is check the return value of pcntl_waitpid to see if anything of value was returned (assuming that you only want to run the contents of the loop if there's a status change):
foreach ($workers as $workerID => $pid) {
// Check if this worker still exists as a process
if (pcntl_waitpid($pid, $status, WNOHANG|WUNTRACED)) {
// If the worker exited normally, stop tracking it
if (pcntl_wifexited($status)) {
$logger->info("Worker $workerID exited normally");
array_splice($workers, $workerID, 1);
}
// If it has a session ID, then it's still living
if (posix_getsid($pid))⋅
$living[] = $pid;
}
}
You're indeed "using pcntl_waitpid() wrong" (Note the quotes)
Since you're using WNOHANG, only if pcntl_waitpid() returns the child's PID, you may evaluate what's in $status.
See return values for pcntl_waitpid().

Improving HTML scraper efficiency with pcntl_fork()

With the help from two previous questions, I now have a working HTML scraper that feeds product information into a database. What I am now trying to do is improve efficiently by wrapping my brain around with getting my scraper working with pcntl_fork.
If I split my php5-cli script into 10 separate chunks, I improve total runtime by a large factor so I know I am not i/o or cpu bound but just limited by the linear nature of my scraping functions.
Using code I've cobbled together from multiple sources, I have this working test:
<?php
libxml_use_internal_errors(true);
ini_set('max_execution_time', 0);
ini_set('max_input_time', 0);
set_time_limit(0);
$hrefArray = array("http://slashdot.org", "http://slashdot.org", "http://slashdot.org", "http://slashdot.org");
function doDomStuff($singleHref,$childPid) {
$html = new DOMDocument();
$html->loadHtmlFile($singleHref);
$xPath = new DOMXPath($html);
$domQuery = '//div[#id="slogan"]/h2';
$domReturn = $xPath->query($domQuery);
foreach($domReturn as $return) {
$slogan = $return->nodeValue;
echo "Child PID #" . $childPid . " says: " . $slogan . "\n";
}
}
$pids = array();
foreach ($hrefArray as $singleHref) {
$pid = pcntl_fork();
if ($pid == -1) {
die("Couldn't fork, error!");
} elseif ($pid > 0) {
// We are the parent
$pids[] = $pid;
} else {
// We are the child
$childPid = posix_getpid();
doDomStuff($singleHref,$childPid);
exit(0);
}
}
foreach ($pids as $pid) {
pcntl_waitpid($pid, $status);
}
// Clear the libxml buffer so it doesn't fill up
libxml_clear_errors();
Which raises the following questions:
1) Given my hrefArray contains 4 urls - if the array was to contain say 1,000 product urls this code would spawn 1,000 child processes? If so, what is the best way to limit the amount of processes to say 10, and again 1,000 urls as an example split the child work load to 100 products per child (10 x 100).
2) I've learn that pcntl_fork creates a copy of the process and all variables, classes, etc. What I would like to do is replace my hrefArray variable with a DOMDocument query that builds the list of products to scrape, and then feeds them off to child processes to do the processing - so spreading the load across 10 child workers.
My brain is telling I need to do something like the following (obviously this doesn't work, so don't run it):
<?php
libxml_use_internal_errors(true);
ini_set('max_execution_time', 0);
ini_set('max_input_time', 0);
set_time_limit(0);
$maxChildWorkers = 10;
$html = new DOMDocument();
$html->loadHtmlFile('http://xxxx');
$xPath = new DOMXPath($html);
$domQuery = '//div[#id=productDetail]/a';
$domReturn = $xPath->query($domQuery);
$hrefsArray[] = $domReturn->getAttribute('href');
function doDomStuff($singleHref) {
// Do stuff here with each product
}
// To figure out: Split href array into $maxChilderWorks # of workArray1, workArray2 ... workArray10.
$pids = array();
foreach ($workArray(1,2,3 ... 10) as $singleHref) {
$pid = pcntl_fork();
if ($pid == -1) {
die("Couldn't fork, error!");
} elseif ($pid > 0) {
// We are the parent
$pids[] = $pid;
} else {
// We are the child
$childPid = posix_getpid();
doDomStuff($singleHref);
exit(0);
}
}
foreach ($pids as $pid) {
pcntl_waitpid($pid, $status);
}
// Clear the libxml buffer so it doesn't fill up
libxml_clear_errors();
But what I can't figure out is how to build my hrefsArray[] in the master/parent process only and feed it off to the child process. Currently everything I've tried causes loops in the child processes. I.e. my hrefsArray gets built in the master, and in each subsequent child process.
I am sure I am going about this all totally wrong, so would greatly appreciate just general nudge in the right direction.
Introduction
pcntl_fork() is not the only way to improve performance HTML scraper while it might be a good idea to use Message Queue has Charles suggested but you still need a faster effective way to pull that request in your workers
Solution 1
Use curl_multi_init ... curl is actually faster and using multi curl gives you parallel processing
From PHP DOC
curl_multi_init Allows the processing of multiple cURL handles in parallel.
So Instead of using $html->loadHtmlFile('http://xxxx'); to load the files several times you can just use curl_multi_init to load multiple url at the same time
Here are some Interesting Implementations
php - Fastest way to check presence of text in many domains (above 1000)
php get all the images from url which width and height >=200 more quicker
How to prevent server from overloading during Curl requests in PHP
Solution 2
You can use pthreads to use multi-threading in PHP
Example
// Number of threads you want
$threads = 10;
// Treads storage
$ts = array();
// Your list of URLS // range just for demo
$urls = range(1, 50);
// Group Urls
$urlsGroup = array_chunk($urls, floor(count($urls) / $threads));
printf("%s:PROCESS #load\n", date("g:i:s"));
$name = range("A", "Z");
$i = 0;
foreach ( $urlsGroup as $group ) {
$ts[] = new AsyncScraper($group, $name[$i ++]);
}
printf("%s:PROCESS #join\n", date("g:i:s"));
// wait for all Threads to complete
foreach ( $ts as $t ) {
$t->join();
}
printf("%s:PROCESS #finish\n", date("g:i:s"));
Output
9:18:00:PROCESS #load
9:18:00:START #5592 A
9:18:00:START #9620 B
9:18:00:START #11684 C
9:18:00:START #11156 D
9:18:00:START #11216 E
9:18:00:START #11568 F
9:18:00:START #2920 G
9:18:00:START #10296 H
9:18:00:START #11696 I
9:18:00:PROCESS #join
9:18:00:START #6692 J
9:18:01:END #9620 B
9:18:01:END #11216 E
9:18:01:END #10296 H
9:18:02:END #2920 G
9:18:02:END #11696 I
9:18:04:END #5592 A
9:18:04:END #11568 F
9:18:04:END #6692 J
9:18:05:END #11684 C
9:18:05:END #11156 D
9:18:05:PROCESS #finish
Class Used
class AsyncScraper extends Thread {
public function __construct(array $urls, $name) {
$this->urls = $urls;
$this->name = $name;
$this->start();
}
public function run() {
printf("%s:START #%lu \t %s \n", date("g:i:s"), $this->getThreadId(), $this->name);
if ($this->urls) {
// Load with CURL
// Parse with DOM
// Do some work
sleep(mt_rand(1, 5));
}
printf("%s:END #%lu \t %s \n", date("g:i:s"), $this->getThreadId(), $this->name);
}
}
It seems like I suggest this daily, but have you looked at Gearman? There's even a well documented PECL class.
Gearman is a work queue system. You'd create workers that connect and listen for jobs, and clients that connect and send jobs. The client can either wait for the requested job to be completed, or it can fire it and forget. At your option, workers can even send back status updates, and how far through the process they are.
In other words, you get the benefits of multiple processes or threads, without having to worry about processes and threads. The clients and workers can even be on different machines.

Categories