GNU Parallel as job queue processor

GNU Parallel as job queue processor - php

I have a worker.php file as below
<?php
$data = $argv[1];
//then some time consuming $data processing
and I run this as a poor man's job queue using gnu parallel
while read LINE; do echo $LINE; done < very_big_file_10GB.txt | parallel -u php worker.php
which kind of works by forking 4 php processes when I am on 4 cpu machine.
But it still feels pretty synchronous to me because read LINE is still reading one line at a time.
Since it is 10GB file, I am wondering if somehow I can use parallel to read the same file in parallel by splitting it into n parts (where n = number of my cpus), that will make my import n times faster (ideally).

No need to do the while business:
parallel -u php worker.php :::: very_big_file_10GB.txt
-u Ungroup output. Only use this if you are not going to use the output, as output from different jobs may mix.
:::: File input source. Equivalent to -a.
I think you will benefit from reading at least chapter 2 (Learn GNU Parallel in 15 minutes) of "GNU Parallel 2018". You can buy it at
http://www.lulu.com/shop/ole-tange/gnu-parallel-2018/paperback/product-23558902.html
or download it at: https://doi.org/10.5281/zenodo.1146014

Related

Limit cpu usage on multiple PHP script running asyncronously

I'm running a PHP code via CLI by setting a cron-job. The script reads about 10000 records from database and runs 10000 new scripts (by exec command) without waiting for previous script to be done. I use this because I want all those tasks run fast. (each one takes about 10 seconds).
When number of tasks that are running gets large, CPU usage become 100% and can't work with server (CentOS). How can I handle this?

You need to limit the number of scripts running in parallel at any given time because running 10,000 concurrent scripts is clearly saturating your system. Instead, you should queue up each task and process 25 or 50 (whatever causes a reasonable amount of load) tasks at the same time.
Without much knowledge of how these scripts actually work, I can't really give you much advice code-wise, but you definitely need to have a queue in place to limit the number of concurrent instances of your script running at the same time.
Also check out semaphores, they might be useful for this producer/consumer model

I recently wrote a script that handle parallel execution of commands. It basically allows the operator to tune the number of concurrent processes at runtime.
If you feel that CPU usage is too high, just decrease the value in /etc/maxprunning
CMD="path-of-your-php-script"
function getMaxPRunning {
cat /etc/maxprunning
}
function limitProcs {
PRUNNING=`ps auxw | grep -v grep | grep "$CMD" | wc -l`
#echo "Now running $PRUNNING processes, MAX:$MAXPRUNNING"
[ $PRUNNING -ge $MAXPRUNNING ] && {
sleep 1
return
}
#echo "Launching new process"
sleep 0.2
$CMD &
}
MAXPRUNNING=`getMaxPRunning`
C=1
while [ 1 ];do
MAXPRUNNING=`getMaxPRunning`
limitProcs
done
If you want your php scripts to ignore an accidental parent's death, put this line at the top of the php script
pcntl_signal(SIGINT, SIG_IGN);

test if file opened from php is still open

I have a PHP script running on Debian that calls the ping command and redirects the output to a file using exec():
exec('ping -w 5 -c 5 xxx.xxx.xxx.xxx > /var/f/ping/xxx.xxx.xxx.xxx_1436538580.txt &');
The PHP script then has a while loop that scans the /var/f/ping/ folder and checks to see if the ping has finished writing to it. I tried checking the output using:
exec('lsof | grep /var/f/ping/xxx.xxx.xxx.xxx_1436538580.txt');
to see if the file was still open, but it takes lsof about 10-15 seconds to return its results, which is too slow for what we need. Ideally it should be able to check this within 2 or 3 seconds.
Is there a faster/better way to test if the ping has completed?

using grep with lsof is probably the slowest way, as lsof will scan everything. you can narrow down the scope that lsof uses to one directory by doing:
lsof +D /var/f/ping
or similar.
there's a good and easy-to-read overview of lsof uses here:
http://www.thegeekstuff.com/2012/08/lsof-command-examples/
alternately, you could experiment with:
http://php.net/manual/en/function.fam-monitor-file.php
and see if that meets your requirements better.

You need Deffered queue pattern to such kind of tasks. Make pings in background by cron and create table or file with job statuses.

Which process will be killed by kernel in case of memory over use? child or parent?

let me cover a small background here:
i am launching one ruby script (script_launcher.rb) through PHP using shell_exec function and in that ruby script i am doing this:
spawned_process_id = spawn("ruby actual_script.rb > /dev/null" )
Process.wait spawned_process_id
and after that according to Process::Status Object i am responding in my ruby script.
This is starting three processes on server:
1) Through PHP script for "ruby script_launcher.rb"
2) Through spawn function somethings like "sh -c ruby actual_script.rb > /dev/null"
3) Through "ruby actual_script.rb"
now my question is if my actual_script.rb is eating up a lot of memory in RAM then which of the above process will be killed by kernel.
The actual problem is here Process::Status Object is storing status of spawned_process_id (Process No. 2) but if kernel is killing Process No. 3 then my logic gives success which is absolutely a wrong case.
Any solution or reference would be helpful.

I don't think linux would kill a process automatically if there was no special configuration.
Mostly it's the process that kills(maybe by not handling an exception) itself when it can't get no more memory from the system.

Slow cronjobs on Cent OS 5

I have 1 cronjob that runs every 60 minutes but for some reason, recently, it is running slow.
Env: centos5 + apache2 + mysql5.5 + php 5.3.3 / raid 10/10k HDD / 16gig ram / 4 xeon processor
Here's what the cronjob do:
parse the last 60 minutes data
a) 1 process parse user agent and save the data to the database
b) 1 process parse impressions/clicks on the website and save them to the database
from the data in step 1
a) build a small report and send emails to the administrator/bussiness
b) save the report into a daily table (available in the admin section)
I see now 8 processes (the same file) when I run the command ps auxf | grep process_stats_hourly.php (found this command in stackoverflow)
Technically I should only have 1 not 8.
Is there any tool in Cent OS or something I can do to make sure my cronjob will run every hour and not overlapping the next one?
Thanks

Your hardware seems to be good enough to process this.
1) Check if you already have hanging processes. Using the ps auxf (see tcurvelo answer), check if you have one or more processes that takes too much resources. Maybe you don't have enough resources to run your cronjob.
2) Check your network connections:
If your databases and your cronjob are on a different server you should check whats the response time between these two machines. Maybe you have network issues that makes the cronjob wait for the network to send the package back.
You can use: Netcat, Iperf, mtr or ttcp
3) Server configuration
Is your server is configured correctly? Your OS, MySQL are setup correctly? I would recommend to read these articles:
http://www3.wiredgorilla.com/content/view/220/53/
http://www.vr.org/knowledgebase/1002/Optimize-and-disable-default-CentOS-services.html
http://dev.mysql.com/doc/refman/5.1/en/starting-server.html
http://www.linux-mag.com/id/7473/
4) Check your database:
Make sure your database has the correct indexes and make sure your queries are optimized. Read this article about the explain command
If a query with few hundreds thousands of record takes times to execute that will affect the rest of your cronjob, if you have a query inside a loop, even worse.
Read these articles:
http://dev.mysql.com/doc/refman/5.0/en/optimization.html
http://20bits.com/articles/10-tips-for-optimizing-mysql-queries-that-dont-suck/
http://blog.fedecarg.com/2008/06/12/10-great-articles-for-optimizing-mysql-queries/
5) Trace and optimized PHP code?
Make sure your PHP code runs as fast as possible.
Read these articles:
http://phplens.com/lens/php-book/optimizing-debugging-php.php
http://code.google.com/speed/articles/optimizing-php.html
http://ilia.ws/archives/12-PHP-Optimization-Tricks.html
A good technique to validate your cronjob is to trace your cronjob script:
Based on your cronjob process, put some debug trace including how much memory, how much time it took to execute the last process. eg:
<?php
echo "\n-------------- DEBUG --------------\n";
echo "memory (start): " . memory_get_usage(TRUE) . "\n";
$startTime = microtime(TRUE);
// some process
$end = microtime(TRUE);
echo "\n-------------- DEBUG --------------\n";
echo "memory after some process: " . memory_get_usage(TRUE) . "\n";
echo "executed time: " . ($end-$start) . "\n";
By doing that you can easily find which process takes how much memory and how long it takes to execute it.
6) External servers/web service calls
Is your cronjob calls external servers or web service? if so, make sure these are loaded as fast as possible. If you request data from a third-party server and this server takes few seconds to return an answer that will affect the speed of your cronjob specially if these calls are in loops.
Try that and let me know what you find.

The ps's output also shows when the process have started (see column STARTED).
$ ps auxf
USER PID %CPU %MEM VSZ RSS TTY STAT STARTED TIME COMMAND
root 2 0.0 0.0 0 0 ? S 18:55 0:00 [ktrheadd]
^^^^^^^
(...)
Or you can customize the output:
$ ps axfo start,command
STARTED COMMAND
18:55 [ktrheadd]
(...)
Thus, you can be sure if they are overlapping.

You should use a lockfile mechanism within your process_stats_hourly.php script. Doesn't have to be anything overly complex, you could have php write the PID which started the process to a file like /var/mydir/process_stats_hourly.txt. So if it takes longer than an hour to process the stats and cron kicks off another instance of the process_stats_hourly.php script, it can check to see if the lockfile already exists, if it does it will not run.
However you are left with the problem of how to "re-queue" the hourly script if it did find the lock file and couldn't start.

You might use strace -p 1234 where 1234 is a relevant process id, on one of the processes which is running too long. Perhaps you'll understand why is it so slow, or even blocked.

Is there any tool in Cent OS or something I can do to make sure my cronjob will run every hour and not overlapping the next one?
Yes. CentOS' standard util-linux package provides a command-line convenience for filesystem locking. As Digital Precision suggested, a lockfile is an easy way to synchronize processes.
Try invoking your cronjob as follows:
flock -n /var/tmp/stats.lock process_stats_hourly.php || logger -p cron.err 'Unable to lock stats.lock'
You'll need to edit paths and adjust for $PATH as appropriate. That invocation will attempt to lock stats.lock, spawning your stats script if successful, otherwise giving up and logging the failure.
Alternatively your script could call PHP's flock() itself to achieve the same effect, but the flock(1) utility is already there for you.

How often is that logfile rotated?
A log-parsing job suddenly taking longer than usual sounds like the log isn't being rotated and is now too big for the parser to handle efficiently.
Try resetting the logfile and see if the job runs faster. If that solves the problem, I recommend logrotate as a means of preventing the problem in the future.

You could add a step to the cronjob to check the output of your above command:
ps auxf | grep process_stats_hourly.php
Keep looping until the command returns nothing, indicating that the process isn't running, then allow the remaining code to execute.

C++ application collapsing after some hours

i have an application written in C++ that uses opencv 2.0, curl and a opensurf library. First a PHP script (cron.php) calls proc_open and calls the C++ application (called icomparer). When it finishes processing N images returns groups saying which images are the same, after that the script uses:
shell_exec('php cron.php > /dev/null 2>&1 &');
die;
And starts again. Well, after 800 or 900 iterates my icomparer starts breaking. The system don't lets me create more files, in icomparer and in the php script.
proc_open(): unable to create pipe Too many open files (2)
shell_exec(): Unable to execute 'php cron.php > /dev/null 2>&1 &'
And curl fails too:
couldn't resolve host name (6)
Everything crashes. I think that i'm doing something wrong, for example, I dunno if starting another PHP process from a PHP process release resources.
In "icomparer" I'm closing all opened files. Maybe not releasing all mutex with mutex_destroy... but in each iterator the c++ application is closed, I think that all stuff is released right?
What I have to watch for? I have tried monitoring opened files with stof.
Php 5.2
Centos 5.X
1 GB ram
120 gb hard disk (4% used)
4 x intel xeon
Is a VPS (machine has 16 gb ram)
The process opens 10 threads and joins them.

Sounds like you're leaking file descriptors.

On Unix-alike systems, child processes inherit the open file descriptors of the parent. However, when the child process exits, it does close all of its copies of the open file descriptors but not the parent's copies.
So you are opening file descriptors in the parent and not closing them. My bet is that you are not closing the pipes returned by the proc_open() call.
And you'll also need to call proc_close() too.

Yeah, it looks like you're opening processes but don't close them after use and - as it seems - they are not closed automatically (which may work in some circumstances).
Make sure you close/terminate your process with proc_close($res) if you don't use the resource anymore.

Your application doesn't close it's files / sockets you can try to use the ulimit syscall with this you can remove the number of open files allowed per application. Have a look: man ulimit

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

GNU Parallel as job queue processor - php

Related

Limit cpu usage on multiple PHP script running asyncronously

test if file opened from php is still open

Which process will be killed by kernel in case of memory over use? child or parent?

Slow cronjobs on Cent OS 5

C++ application collapsing after some hours

Categories

Resources