Improving HTML scraper efficiency with pcntl_fork()

Improving HTML scraper efficiency with pcntl_fork() - php

With the help from two previous questions, I now have a working HTML scraper that feeds product information into a database. What I am now trying to do is improve efficiently by wrapping my brain around with getting my scraper working with pcntl_fork.
If I split my php5-cli script into 10 separate chunks, I improve total runtime by a large factor so I know I am not i/o or cpu bound but just limited by the linear nature of my scraping functions.
Using code I've cobbled together from multiple sources, I have this working test:
<?php
libxml_use_internal_errors(true);
ini_set('max_execution_time', 0);
ini_set('max_input_time', 0);
set_time_limit(0);
$hrefArray = array("http://slashdot.org", "http://slashdot.org", "http://slashdot.org", "http://slashdot.org");
function doDomStuff($singleHref,$childPid) {
$html = new DOMDocument();
$html->loadHtmlFile($singleHref);
$xPath = new DOMXPath($html);
$domQuery = '//div[#id="slogan"]/h2';
$domReturn = $xPath->query($domQuery);
foreach($domReturn as $return) {
$slogan = $return->nodeValue;
echo "Child PID #" . $childPid . " says: " . $slogan . "\n";
}
}
$pids = array();
foreach ($hrefArray as $singleHref) {
$pid = pcntl_fork();
if ($pid == -1) {
die("Couldn't fork, error!");
} elseif ($pid > 0) {
// We are the parent
$pids[] = $pid;
} else {
// We are the child
$childPid = posix_getpid();
doDomStuff($singleHref,$childPid);
exit(0);
}
}
foreach ($pids as $pid) {
pcntl_waitpid($pid, $status);
}
// Clear the libxml buffer so it doesn't fill up
libxml_clear_errors();
Which raises the following questions:
1) Given my hrefArray contains 4 urls - if the array was to contain say 1,000 product urls this code would spawn 1,000 child processes? If so, what is the best way to limit the amount of processes to say 10, and again 1,000 urls as an example split the child work load to 100 products per child (10 x 100).
2) I've learn that pcntl_fork creates a copy of the process and all variables, classes, etc. What I would like to do is replace my hrefArray variable with a DOMDocument query that builds the list of products to scrape, and then feeds them off to child processes to do the processing - so spreading the load across 10 child workers.
My brain is telling I need to do something like the following (obviously this doesn't work, so don't run it):
<?php
libxml_use_internal_errors(true);
ini_set('max_execution_time', 0);
ini_set('max_input_time', 0);
set_time_limit(0);
$maxChildWorkers = 10;
$html = new DOMDocument();
$html->loadHtmlFile('http://xxxx');
$xPath = new DOMXPath($html);
$domQuery = '//div[#id=productDetail]/a';
$domReturn = $xPath->query($domQuery);
$hrefsArray[] = $domReturn->getAttribute('href');
function doDomStuff($singleHref) {
// Do stuff here with each product
}
// To figure out: Split href array into $maxChilderWorks # of workArray1, workArray2 ... workArray10.
$pids = array();
foreach ($workArray(1,2,3 ... 10) as $singleHref) {
$pid = pcntl_fork();
if ($pid == -1) {
die("Couldn't fork, error!");
} elseif ($pid > 0) {
// We are the parent
$pids[] = $pid;
} else {
// We are the child
$childPid = posix_getpid();
doDomStuff($singleHref);
exit(0);
}
}
foreach ($pids as $pid) {
pcntl_waitpid($pid, $status);
}
// Clear the libxml buffer so it doesn't fill up
libxml_clear_errors();
But what I can't figure out is how to build my hrefsArray[] in the master/parent process only and feed it off to the child process. Currently everything I've tried causes loops in the child processes. I.e. my hrefsArray gets built in the master, and in each subsequent child process.
I am sure I am going about this all totally wrong, so would greatly appreciate just general nudge in the right direction.

Introduction
pcntl_fork() is not the only way to improve performance HTML scraper while it might be a good idea to use Message Queue has Charles suggested but you still need a faster effective way to pull that request in your workers
Solution 1
Use curl_multi_init ... curl is actually faster and using multi curl gives you parallel processing
From PHP DOC
curl_multi_init Allows the processing of multiple cURL handles in parallel.
So Instead of using $html->loadHtmlFile('http://xxxx'); to load the files several times you can just use curl_multi_init to load multiple url at the same time
Here are some Interesting Implementations
php - Fastest way to check presence of text in many domains (above 1000)
php get all the images from url which width and height >=200 more quicker
How to prevent server from overloading during Curl requests in PHP
Solution 2
You can use pthreads to use multi-threading in PHP
Example
// Number of threads you want
$threads = 10;
// Treads storage
$ts = array();
// Your list of URLS // range just for demo
$urls = range(1, 50);
// Group Urls
$urlsGroup = array_chunk($urls, floor(count($urls) / $threads));
printf("%s:PROCESS #load\n", date("g:i:s"));
$name = range("A", "Z");
$i = 0;
foreach ( $urlsGroup as $group ) {
$ts[] = new AsyncScraper($group, $name[$i ++]);
}
printf("%s:PROCESS #join\n", date("g:i:s"));
// wait for all Threads to complete
foreach ( $ts as $t ) {
$t->join();
}
printf("%s:PROCESS #finish\n", date("g:i:s"));
Output
9:18:00:PROCESS #load
9:18:00:START #5592 A
9:18:00:START #9620 B
9:18:00:START #11684 C
9:18:00:START #11156 D
9:18:00:START #11216 E
9:18:00:START #11568 F
9:18:00:START #2920 G
9:18:00:START #10296 H
9:18:00:START #11696 I
9:18:00:PROCESS #join
9:18:00:START #6692 J
9:18:01:END #9620 B
9:18:01:END #11216 E
9:18:01:END #10296 H
9:18:02:END #2920 G
9:18:02:END #11696 I
9:18:04:END #5592 A
9:18:04:END #11568 F
9:18:04:END #6692 J
9:18:05:END #11684 C
9:18:05:END #11156 D
9:18:05:PROCESS #finish
Class Used
class AsyncScraper extends Thread {
public function __construct(array $urls, $name) {
$this->urls = $urls;
$this->name = $name;
$this->start();
}
public function run() {
printf("%s:START #%lu \t %s \n", date("g:i:s"), $this->getThreadId(), $this->name);
if ($this->urls) {
// Load with CURL
// Parse with DOM
// Do some work
sleep(mt_rand(1, 5));
}
printf("%s:END #%lu \t %s \n", date("g:i:s"), $this->getThreadId(), $this->name);
}
}

It seems like I suggest this daily, but have you looked at Gearman? There's even a well documented PECL class.
Gearman is a work queue system. You'd create workers that connect and listen for jobs, and clients that connect and send jobs. The client can either wait for the requested job to be completed, or it can fire it and forget. At your option, workers can even send back status updates, and how far through the process they are.
In other words, you get the benefits of multiple processes or threads, without having to worry about processes and threads. The clients and workers can even be on different machines.

Related

usleep() in while loop causing issues - whole process freezes

I am using php, Laravel, Redis, and SQL on an Ubuntu localhost server. I have made a bunch of methods that return results from API searches after some processing. I am calling 5 of these methods which will be very slow if done synchronously, so I've been experimenting with async approaches (which I know php isn't optimised for). After a few approaches I have found some success with pcntl_fork(), but I'm running into some nasty problems.
Edit: After some messing around I have found that if I remove the while loop then the code afterward executes properly, I have removed the while loop and placed it in the second 'search' method. However it still causes a freeze of the system. This makes no sense as there shouldn't be an infinite loop as if I manually query the Redis db, all 5 results are there.
This is my code: (I have a few custom classes for making and processing the API calls, fyi these methods work flawlessly)
//this caches the individual api results to a Redis list
public static function cacheAsyncApiSearch(string $searchQuery, int $maxResults = 20)
{
$key = "search:".$searchQuery; //for Redis
if(!Redis::client()->exists($key)) {
for ($i = 0; $i < 5; $i++) {
// Create a child process
$pid = pcntl_fork();
if ($pid == -1) {
// Fork failed
exit(1);
} elseif ($pid) {
// This is the parent process
// I have tried many versions of pcntl_wait, none work! They all still don't allow code to be ran afterwards (even within this elseif block), and the best it does is cache the 1st api case (YouTube)
// while (!pcntl_wait($status, WNOHANG)) {
// $exitStatus = pcntl_wexitstatus($status);
// // Do something with the exit status of the child process
// }
// dd($pid);
// pcntl_waitpid($pid, $status, WUNTRACED);
} else {
//child processes
switch ($i) {
case 0:
$results = YouTube::search($searchQuery, $maxResults)['results'];
Redis::client()->rPush($key,SearchResultDTO::jsonEncodeArray($results));
SearchResultDTO::convertResultDTOToModels($results);
break;
case 1:
$results = Dailymotion::search($searchQuery, $maxResults)['results'];
Redis::client()->rPush($key,SearchResultDTO::jsonEncodeArray($results));
SearchResultDTO::convertResultDTOToModels($results);
break;
case 2:
$results = Vimeo::search($searchQuery, $maxResults)['results'];
Redis::client()->rPush($key,SearchResultDTO::jsonEncodeArray($results));
SearchResultDTO::convertResultDTOToModels($results);
break;
case 3:
$results = Twitch::search($searchQuery, 2)['results'];
Redis::client()->rPush($key,SearchResultDTO::jsonEncodeArray($results));
SearchResultDTO::convertResultDTOToModels($results);
break;
case 4:
$results = Podcasts::getPodcastsFromItunesResults(Podcasts::search($searchQuery, 2)["response"]->results);
Redis::client()->rPush($key,SearchResultDTO::jsonEncodeArray($results));
SearchResultDTO::convertResultDTOToModels($results);
break;
}
$i = 10000;
exit(0);
}
}
// for noting the process id of the given process that gets to this point
Redis::client()->lPush("search_pid:".$searchQuery, $pid);
// sets a time out for the redis cache
Redis::client()->expire($key, 60*60*4);
while (is_numeric( Redis::client()->lLen($key)) && Redis::client()->lLen($key) < 5) {
usleep(500000); // 0.5 seconds
// pcntl_waitpid(-1, $status); //does this even do anything? not for me
}
return false; // not already cached
}
return true; // already cached
}
This code somewhat works, It performs the api calls and caches the Redis perfectly. However when the method is ran, no code will be ran after it (unless redis has found a cached version and the process is not forked).
This made me think that all processes are being exited (possibly true? if so i dont know why), so I tried writing a version without the exit(0) line. This works, I can then perform code after the method call, however I noticed (when getting SQL race conditions) that all 6 (5 child, 1 parent) processes continued to run their own version of the code after this method (e.g. some database writes)
public static function search(string $searchQuery, int $maxResults = 20): array
{
$key = "search:".$searchQuery;
$results = [];
// the quoted method above
self::cacheAsyncApiSearch($searchQuery, $maxResults);
foreach (Redis::client()->lRange($key,0,-1) as $result){
$results = array_merge($results, SearchResultDTO::jsonDecodeArray($result));
}
$creatorDTOs = [];
$videoDTOs = [];
$streamDTOs = [];
$playlistDTOs = [];
$podcastDTOs = [];
/** #var SearchResultDTO $result */
foreach ($results as $result) {
match ($result->kind) {
Kind::Creator => $creatorDTOs[] = $result,
Kind::Video => $videoDTOs[] = $result,
Kind::Stream => $streamDTOs[] = $result,
Kind::Playlist => $playlistDTOs[] = $result,
Kind::Podcast => $podcastDTOs[] = $result,
};
}
// did this to test how many times the code was being ran (the list has 6 1's in it)
Redis::client()->lPush("here", '1');
// I know this code isn't completely efficient since I already called these conversion methods before, however I am just trying to get the forking stuff to work right now.
return [
"creators" => SearchResultDTO::convertResultDTOToModels($creatorDTOs),
"videos" => SearchResultDTO::convertResultDTOToModels($videoDTOs),
"streams" => SearchResultDTO::convertResultDTOToModels($streamDTOs),
"playlists" => SearchResultDTO::convertResultDTOToModels($playlistDTOs),
"podcasts" => SearchResultDTO::convertResultDTOToModels($podcastDTOs)
];
}
These DTO's (Data Transfer Objects) are being used to populate a UI. So for example, when I make a search (that isn't cached), the page is blank forever. But if I refresh the page (after the search is cached) then the results show just fine.
This is the most bizarre problem I have ever ran into and I really appreciate any help.
Edit please read:
After some messing around I have found that if I remove the while loop then the code afterward executes properly, I have removed the while loop and placed it in the second 'search' method. However it still causes a freeze of the system. This makes no sense as there shouldn't be an infinite loop as if I manually query the Redis db, all 5 results are there. And the dd("two") can never be excecated unless the usleep() is removed. Hopefully this narrows the problem down.
Edit 2 please read:
I have figured out that I can get the dd("two") to work when usleep() is reduced to 0.05s from 0.5 seconds, but it still doesnt seem to run long enough for it to work.
if(!self::cacheAsyncApiSearch($searchQuery, $maxResults))
{
// make sure Redis is properly returning a number not object
$len = Redis::client()->lLen($key);
while(!is_numeric($len)){
usleep(500000); // 0.5 seconds
$len = Redis::client()->lLen($key);
}
//dd($len); //this dd() works
while ($len < 5) {
dd("one"); // this dd() works
usleep(500000); // 0.5 seconds
dd("two"); **//$this does not work, why?**
$len = Redis::client()->lLen($key);
}
}

Solution for calling a function doing lots of stuff in it by Cron?

function cronProcess() {
# > 100,000 users
$users = $this->UserModel->getUsers();
foreach ($users as $user) {
# Do lots of database Insert/Update/Delete, HTTP request stuff
}
}
The problem happens when the number of users reaches ~ 100,000.
I called the function by CURL via CronTab.
So what is the best solution for this?

I do a lot of bulk tasks in CakePHP, some processing millions of records. It's certainly possible to do, the key as others suggested is small batches in a loop.
If this is something you're calling from Cron, it's probably easier to use a Shell (< v3.5) or the newer Command class (v3.6+) than cURL.
Here's generally how I paginate large batches, including some helpful optional things like a progress bar, turning off hydration to speed things up slightly, and showing how many users/second the script was able to process:
<?php
namespace App\Command;
use Cake\Console\Arguments;
use Cake\Console\Command;
use Cake\Console\ConsoleIo;
class UsersCommand extends Command
{
public function execute(Arguments $args, ConsoleIo $io)
{
// I'd guess a Finder would be a more Cake-y way of getting users than a custom "getUsers" function:
// See https://book.cakephp.org/3.0/en/orm/retrieving-data-and-resultsets.html#custom-finder-methods
$usersQuery = $this->UserModel->find('users');
// Get a total so we know how many we're gonna have to process (optional)
$total = $usersQuery->count();
if ($total === 0) {
$this->abort("No users found, stopping..");
}
// Hydration takes extra processing time & memory, which can add up in bulk. Optionally if able, skip it & work with $user as an array not an object:
$usersQuery->enableHydration(false);
$this->info("Grabbing $total users for processing");
// Optionally show the progress so we can visually see how far we are in the process
$progress = $io->helper('Progress')->init([
'total' => 10
]);
// Tune this page value to a size that solves your problem:
$limit = 1000;
$offset = 0;
// Simply drawing the progress bar every loop can slow things down, optionally draw it only every n-loops,
// this sets it to 1/5th the page size:
$progressInterval = $limit / 5;
// Optionally track the rate so we can evaluate the speed of the process, helpful tuning limit and evaluating enableHydration effects
$startTime = microtime(true);
do {
$users = $usersQuery->offset($offset)->toArray();
$count = count($users);
$index = 0;
foreach ($users as $user) {
$progress->increment(1);
// Only draw occasionally, for speed
if ($index % $progressInterval === 0) {
$progress->draw();
}
### WORK TIME
# Do your lots of database Insert/Update/Delete, HTTP request stuff etc. here
###
}
$progress->draw();
$offset += $limit; // Increment your offset to the next page
} while ($count > 0);
$totalTime = microtime(true) - $startTime;
$this->out("\nProcessed an average " . ($total / $totalTime) . " Users/sec\n");
}
}
Checkout these sections in the CakePHP Docs:
Console Commands
Command Helpers
Using Finders & Disabling Hydration
Hope this helps!

pthreads stopping already running thread once a condition has been met

I'm still relatively new to PHP and trying to use pthreads to solve an issue. I have 20 threads running processes that end at varying times. Most finish around < 10 seconds or so. I don't need all 20, just 10 detected. Once I get to 10, I would like to kill the threads, or to continue on to the next step.
I have tried using set_time_limit to about 20 seconds for each of the threads, but they ignore it and keep running. I am looping through the jobs looking for the join because I didn't want the rest of the program to run but I'm stuck until the slowest one has finished. While pthreads has reduced the time from around a minute to about 30 seconds, I can shave even more time since the first 10 run in about 3 seconds.
Thanks for any help and here is my code:
$count = 0;
foreach ( $array as $i ) {
$imgName = $this->smsId."_$count.jpg";
$name = "LocalCDN/".$imgName;
$stack[] = new AsyncImageModify($i['largePic'], $name);
$count++;
}
// Run the threads
foreach ( $stack as $t ) {
$t->start();
}
// Check if the threads have finished; push the coordinates into an array
foreach ( $stack as $t ) {
if($t->join()){
array_push($this->imgArray, $t->data);
}
}
class class AsyncImageModify extends \Thread{
public $data;
public function __construct($arg, $name, $container) {
$this->arg = $arg;
$this->name = $name;
}
public function run() {
//tried putting the set_time_limit() here, didn't work
if ($this->arg) {
// Get the image
$didWeGetTheImage = Image::getImage($this->arg, $this->name);
if($didWeGetTheImage){
$timestamp1 = microtime(true);
print_r("Starting face detection $this->arg" . "\n");
print_r(" ");
$j = Image::process1($this->name);
if($j){
// lets go ahead and do our image manipulation at this point
$userPic = Image::process2($this->name, $this->name, 200, 200, false, $this->name, $j);
if($userPic){
$this->data = $userPic;
print_r("Back from process2; the image returned is $userPic");
}
}
$endTime = microtime(true);
$td = $endTime-$timestamp1;
print_r("Finished face detection $this->arg in $td seconds" . "\n");
print_r($j);
}
}
}

It is difficult to guess the functionality of Image::* methods, so I can't really answer in any detail.
What I can say, is that there are very few machines I can think of that are suitable to run 20 concurrent threads in any case. A more suitable setup would be the worker/stackable model. A Worker thread is a reuseable context, and can execute task after task, implemented as Stackables; execution in a multi-threaded environment should always use the least amount of threads to get the most work done possible.
Please see pooling example and other examples that are distributed with pthreads, available on github, additionally, much information regarding usage is contained in past bug reports, if you are still struggling after that ...

PHP forking and multiple child signals

I'm trying to write a script which creates a number of forked child processes using the pcntl_* functions.
Basically, there is a single script which runs in a loop for about a minute, periodically polling a database to see if there is a task to be run. If there is one, it should fork and run the task in a separate process so that the parent isn't held up by a long-running task.
Since there possibly could be a large number of tasks ready to be run, I want to limit the number of child processes that are created. Therefore, I am keeping track of the number of processes by incrementing a variable each time one is created (and then pausing if there's too many), and then decrementing it in a signal handler. Kind of like this:
define(ticks = 1);
$openProcesses = 0; // how many we have open
$max = 3; // the most we want open at a time
pcntl_signal(SIGCHLD, "childFinished");
while (!time_is_up()) {
if (there_is_something_to_do()) {
$pid = pcntl_fork();
if (!$pid) { // I am the child
foo(); // run the long-running task
exit(0); // and exit
} else { // I am the parent
++$openProcesses;
if ($openProcesses >= $max) {
pcntl_wait($status); // wait for any child to exit
} // before continuing
}
} else {
sleep(3);
}
}
function childFinished($signo) {
global $openProcesses;
--$openProcesses;
}
This works pretty much ok most of the time, except for when two or more processes finish simultaneously - the signal handler function is only called once, which throws out my counter. The reason for this is explained by "Anonymous" in the notes of the PHP manual:
Multiple children return less than the number of children exiting at a given moment SIGCHLD signals is normal behavior for Unix (POSIX) systems. SIGCHLD might be read as "one or more children changed status -- go examine your children and harvest their status values".
My question is this: How do I examine the children and harvest their status? Is there any reliable way to check how many child processes are open at any given time?
Using PHP 5.2.9

One way is to keep an array of the PIDs of the child processes, and in the signal handler check each PID to see if it's still running. The (untested) code would look like:
declare(ticks = 1);
$openProcesses = 0;
$procs = array();
$max = 3;
pcntl_signal(SIGCHLD, "childFinished");
while (!time_is_up()) {
if (there_is_something_to_do()) {
$pid = pcntl_fork();
if (!$pid) {
foo();
exit(0);
} else {
$procs[] = $pid; // add the PID to the list
++$openProcesses;
if ($openProcesses >= $max) {
pcntl_wait($status);
}
}
} else {
sleep(3);
}
}
function childFinished($signo) {
global $openProcesses, $procs;
// Check each process to see if it's still running
// If not, remove it and decrement the count
foreach ($procs as $key => $pid) if (posix_getpgid($pid) === false) {
unset($procs[$key]);
$openProcesses--;
}
}

You could have children send a SIGUSR1 to the parent when they start,then a SIGUSR2 before they exit. The other thing you are dealing with when using primitive signals is the kernel merging them, which it does not do with RT signals. In theory, ANY non-rt signal could be merged.
You might implement some kind of simple locking using sqlite, where only one child at a time can have the talking stick. Just make sure that children handle normally fatal signals so that they remain alive to free the lock.

I know this is about 8 years too late (and I hope you found an answer), but just in case it helps someone else I am going to answer.
The use of the pcntl_w* functions will be your friend here and you will probably want to implement a process reaper. The documentation is not very helpful and still does not contain any useful examples.
This would be a multi-part process:
1 - use pcntl_signal send trapped signals to your signal handler
2 - Do your looping/polling and within that loop;
3 - Iterate through the array of your children (which you will create below) and reap them as necessary
4 - fork(): This will consist of the following:
pcntl_async_signals(true);
$children = array();
while ($looping === true)
{
reapChildren();
if (($pid = pcntl_fork()) exit (1); // error
elseif ($pid) // parent
{
$children[] = $pid;
// close files/sockets/etc
posix_setpgid ($pid,posix_getpgrp());
}
else
{ // child
posix_setpgid(posix_getpid(),posix_getppid());
// ... jump to child function/object/code/etc ...
exit (0); // or whatever code you want to return
}
} // end of loop
In the reaper, you will need the following:
function reapChildren()
{
global $children;
foreach ($children as $idx => $pid)
{
$rUsage = array();
$status = 0; // integer which will be used as the $status pointer
$ret = pcntl_waitpid($pid, $status, WNOHANG|WUNTRACED, $rUsage);
if (pcntl_wifexited($status)) // the child exited normally
{
$exitCode = pcntl_wexitstatus($status); // returns the child exit status
}
if (pcntl_wifsignaled($status)) // the child received a signal
{
$signal = pcntl_wtermsig($status); // returns the signal that abended the child
}
if (pcntl_wifstopped($status))
{
$signal = pcntl_wstopsig($status); // returns the signal that stopped the child
}
}
}
The above reaper code will allow you to poll the status of your children and if you are using php7+, the $signalInfo array which is filled in at your signal handler will contain a lot of useful information you can use.. var_dump it.. check it out. Also, using pcntl_async_signals(true) in php7+ replaces the need for declare(ticks=1) and manually calling pcntl_signal_dispatch();
I hope this helps.

linux worker script/queue (php)

I need a binary/script (php) that does the following.
Start n process of X in the background and maintain the number processes.
An example:
n = 50
initially 50 processes are started
a process exits
49 are still running
so 1 should be started again.
P.S.: I posted the same question on SV, which makes me probably very unpopular.

Can you use the crontab linux and write to a db or file the number of current process?.
If DB, the advantage is that you can use to procedure and lock the table, and write the number of process.
But to backgroun you should use & at the end of the call to script
# php-f pro.php &

Pseudocode:
for (i=1; i<=50; i++)
myprocess
endfor
while true
while ( $(ps --no-headers -C myprocess|wc -l) < 50 )
myprocess
endwhile
endwhile
If you translate this to php and fix its flaws, it might just do what you want.

I would go in the direction that andres suggested. Just put something like this at the top of your pro.php file...
$this_file = __FILE__;
$final_count = 50;
$processes = `ps auwx | grep "php -f $this_file"`;
$processes = explode("\n", $processes);
if (count($processes)>$final_count+3) {
exit;
}
//... Remaining code goes here

Have you tried making a PHP Daemon before?
http://kevin.vanzonneveld.net/techblog/article/create_daemons_in_php/

Here's something in Perl I have in my library (and hey, let's be honest, I'm not going to rig this up in PHP just to give you something working in that language this moment. I'm just using what I can copy / paste).
#!/usr/bin/perl
use threads;
use Thread::Queue;
my #workers;
my $num_threads = shift;
my $dbname = shift;
my $queue = new Thread::Queue;
for (0..$num_threads-1) {
$workers[$_] = new threads(\&worker);
print "TEST!\n";
}
while ($_ = shift #ARGV) {
$queue->enqueue($_);
}
sub worker() {
while ($file = $queue->dequeue) {
system ('./4parser.pl', $dbname, $file);
}
}
for (0..$num_threads-1) { $queue->enqueue(undef); }
for (0..$num_threads-1) { $workers[$_]->join; }
Whenever one of those systems calls finishes up, it moves on dequeing. Oh, and damn if I know hwy I did 0..$numthreads instead of the normal my $i = 0; $i < ... idiom, but I did it that way that time.

I have to solutions to propose. Both do child process reboot on exit, do child process reloading on USR1 signal, wait for the children exit on SIGTERM and so on.
The first is based on swoole php extension. It is very performant, async, non-blocking. Here's the usage example code:
<?php
use Symfony\Component\Process\PhpExecutableFinder;
require_once __DIR__.'/../vendor/autoload.php';
$phpBin = (new PhpExecutableFinder)->find();
if (false === $phpBin) {
throw new \LogicException('Php executable could not be found');
}
$daemon = new \App\Infra\Swoole\Daemon();
$daemon->addWorker(1, $phpBin, [__DIR__ . '/console', 'quartz:scheduler', '-vvv']);
$daemon->addWorker(3, $phpBin, [__DIR__ . '/console', 'enqueue:consume', '--setup-broker', '-vvv']);
$daemon->run();
The daemon code is here
Another is based on Symfony process library. It does not require any extra extensions. The usage example and daemon code could be found here

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Improving HTML scraper efficiency with pcntl_fork() - php

Related

usleep() in while loop causing issues - whole process freezes

Solution for calling a function doing lots of stuff in it by Cron?

pthreads stopping already running thread once a condition has been met

PHP forking and multiple child signals

linux worker script/queue (php)

Categories

Resources