Related
I have php script running as a cron job, extensively using third party code. Script itself has a few thousands LOC. Basically it's the data import / treatment script. (JSON to MySQL, but it also makes a lot of HTTP calls and some SOAP).
Now, performance is downgrading with the time. When testing with a few records (around 100), performance is ok, it is done in a 10-20 minutes. When running whole import (about 1600 records), mean time of import of one record grows steadily, and whole thing takes more than 24 hours, so at least 5 times longer than expected.
Memory seems not to be a problem, usage growing as it should, without unexpected peaks.
So, I need to debug it to find the bottleneck. It can be some problem with the script, underlying code base, php itself, database, os or network part. I am suspecting for now some kind of caching somewhere which is not behaving well with a near 100 % miss ratio.
I cannot use XDebug, profile file grows too fast to be treatable.
So question is: how can I debug this kind of script?
PHP version: 5.4.41
OS: Debian 7.8
I can have root privileges if necessary, and install the tools. But it's the production server and ideally debugging should not be too disrupting.
Yes its possible and You can use Kint (PHP Debugging Script)
What is it?
Kint for PHP is a tool designed to present your debugging data in the absolutely best way possible.
In other words, it's var_dump() and debug_backtrace() on steroids. Easy to use, but powerful and customizable. An essential addition to your development toolbox.
Still lost? You use it to see what's inside variables.
Act as a debug_backtrace replacer, too
you can download here or Here
Total Documentations and Help is here
Plus, it also supports almost all php framework
CodeIgniter
Drupal
Symfony
Symfony 2
WordPress
Yii
framework
Zend Framework
All the Best.... :)
There are three things that come to mind:
Set up an IDE so you can debug the PHP script line by line
Add some logging to the script
Look for long running queries in MySQL
Debug option #2 is the easiest. Since this is running as a cron job, you add a bunch of echo's in your script:
<?php
function log_message($type, $message) {
echo "[{strtoupper($type)}, {date('d-m-Y H:i:s')}] $message";
}
log_message('info', 'Import script started');
// ... the rest of your script
log_message('info', 'Import script finished');
Then pipe stdout to a log file in the cron job command.
01 04 * * * php /path/to/script.php >> /path/to/script.log
Now you can add log_message('info|warn|debug|error', 'Message here') all over the script and at least get an idea of where the performance issue lies.
Debug option #3 is just straight investigation work in MySQL. One of your queries might be taking a long time, and it might show up in a long running query utility for MySQL.
Profiling tool:
There is a PHP profiling tool called Blackfire which is currently in public beta. There is specific documentation on how to profile CLI applications. Once you collected the profile you can analyze the application control flow with time measurements in a nice UI:
Memory consumption suspicious:
Memory seems not to be a problem, usage growing as it should, without unexpected peaks.
A growing memory usage actually sounds suspicious! If the current dataset does not depend on all previous datasets of the import, then a growing memory most probably means, that all imported datasets are kept in memory, which is bad. PHP may also frequently try to garbage collect, just to find out that there is nothing to remove from memory. Especially long running CLI tasks are affected, so be sure to read the blog post that discovered the behavior.
Use strace to see what the program is basically doing from the system perspective. Is it hanging in IO operations etc.? strace should be the first thing you try when encountering performance problems with whatever kind of Linux application. Nobody can hide from it! ;)
If you should find out that the program hangs in network related calls like connect, readfrom and friends, meaning the network communication does hang at some point while connecting or waiting for responses than you can use tcpdump to analyze this.
Using the above methods you should be able to find out most common performance problems. Note that you can even attach to a running task with strace using -p PID.
If the above methods doesn't help, I would profile the script using xdebug. You can analyse the profiler output using tools like KCachegrind
Although it is not stipulated, and if my guess is correct you seem to be dealing with records one at a time, but in one big cron.
i.e. Grab a record#1, munge it somehow, add value to it, reformat it then save it, then move to record#2
I would consider breaking the big cron down. ie
Cron#1: grab all the records, and cache all the salient data locally (to that server). Set a flag if this stage is achieved.
Cron #2: Now you have the data you need, munge and add value, cache that output. Set a flag if this stage is achieved.
Cron #3: Reformat that data and store it. Delete all the files.
This kind of "divide and conquer" will ease your debugging woes, and lead to a better understanding of what is actually going on, and as a bonus give you the opportunity to rerun say, cron 2.
I've had to do this many times, and for me logging is the key to identifying weaknesses in your code, identify poor assumptions about data quality, and can hint at where latency is causing a problem.
I've run into strange slowdowns when doing network heavy efforts in the past. Basically, what I found was that during manual testing the system was very fast but when left to run unattended it would not get as much done as I had hoped.
In my case the issue I found was that I had default network timeouts in place and many web requests would simply time out.
In general, though not an external tool, you can use the difference between two microtime(TRUE) requests to time sections of code. To keep the logging small set a flag limit and only test the time if the flag has not been decremented down to zero after reducing for each such event. You can have individual flags for individual code segments or even for different time limits within a code segment.
$flag['name'] = 10; // How many times to fire
$slow['name'] = 0.5; // How long in seconds before it's a problem?
$start = microtime(TRUE);
do_something($parameters);
$used = microtime(TRUE) - $start;
if ( $flag['name'] && used >= $slow['name'] )
{
logit($parameters);
$flag['name']--;
}
If you output what URL, or other data/event took to long to process, then you can dig into that particular item later to see if you can find out how it is causing trouble in your code.
Of course, this assumes that individual items are causing your problem and not simply a general slowdown over time.
EDIT:
I (now) see it's a production server. This makes editing the code less enjoyable. You'd probably want to make integrating with the code a minimal process having the testing logic and possibly supported tags/flags and quantities in an external file.
setStart('flagname');
// Do stuff to be checked for speed here
setStop('flagname',$moredata);
For maximum robustness the methods/functions would have to ensure they handled unknown tags, missing parameters, and so forth.
xdebug_print_function_stack is an option, but what you can also do is to create a "function trace".There are three output formats. One is meant as a human readable trace, another one is more suited for computer programs as it is easier to parse, and the last one uses HTML for formatting the trace
http://www.xdebug.org/docs/execution_trace
Okay, basically you have two possibilities - it's either the ineffective PHP code or ineffective MySQL code. Judging by what you say, it's probably inserting into indexed table a lot of records separately, which causes the insertion time to skyrocket. You should either disable indexes and rebuild them after insertion, or optimize the insertion code.
But, about the tools.
You can configure the system to automatically log slow MySQL queries:
https://dev.mysql.com/doc/refman/5.1/en/slow-query-log.html
You can also do the same with PHP scripts, but you need a PHP-FPM environment (and you probably have Apache).
https://rtcamp.com/tutorials/php/fpm-slow-log/
These tools are very powerful and versatile.
P.S. 10-20 minutes for 100 records seems like A LOT.
You can use https://github.com/jmartin82/phplapse to record the application activity for determinate time.
For example start recording after n iterations with:
phplapse_start();
And stop it in next iteration with:
phplapse_stop();
With this process you was created a snapshot of execution when all seems works slow.
(I'm the author of project, don't hesitate to contact with me to improve the functionality)
I have a similar thing running each night (a cron job to update my database). I have found the most reliable way to debug is to set up a log table in the database and regularly insert / update a json string containing a multi-dimensional array with info about each record and whatever useful info you want to know about each record. This way if your cron job does not finish you still have detailed information about where it got up to and what happened along the way. Then you can write a simple page to pull out the json string, turn it back into an array and print useful data onto the page including timing and passed tests etc. When you see something as an issue you can concentrate on putting more info from that area into the json string.
Regular "top" command can show you, if CPU usage by php or mysql is bottleneck. If not, then delays may be caused by http calls.
If CPU usage by mysqld is low, but constant, then it may be disk usage bottleneck.
Also, you can check your bandwidth usage by installing and using "speedometer", or other tools.
Lets say just 5 different cURL Scripts each running on a different cmd window at the same time on the same machine. Will that be ok? Will it act the same as running different tabs on a browser?
I would also like to add that some of the scripts have PHP Simple HTML DOM Parser in them. Would that change anything?
All thoughts on this would be appreciated.
Yes, you can definitely do that. From your code perspective, it is the same as having 5 different users load the site at once. Same from the server perspective. In fact, this is a pretty simplistic way to test simulated load on a page and see how it handles it. Definitely not the best way by far, but quick and dirty.
Just be careful when generating these 5 scripts that you don't end up in a runaway situation where those 5 have not completely finished and exited before you run more. You could quickly overload your own server/site if you end up with dozens or even hundreds of scripts running.
Also be aware that if you're doing heavy processing in those scripts, you may hit PHP/MySQL/Apache timeouts, so keep an eye on that.
Make sure these calls are also thoroughly protected from the public too, or someone else will send thousands of calls to it, also crashing your site.
If you need to do heavy processing it's usually best to do it with a command line PHP script triggered by a cron job. This gives you much more control and security.
I need to run a bunch of long running processes on a CENTOS server.
If I leave the processes (python/php scripts) to run sometimes the processes will stop running because of trivial errors eg. string encoding issues or sometimes because the process seems to get killed by the server.
I try to use nohup and fire the jobs from the crontab
Is there any way to keep these processes running in such a way that all the variables are saved and I can restart the script from where it stopped?
I know I can program this into the code but would prefer a generalised utility which could just keep these things running so that the script completed even if there were trivial errors.
Perhaps I need some sort of process-management tool?
Many thanks for any suggestions
is there any way to keep these processes running in such a way that all the variables are saved and i can restart the script from where it stopped?
Yes. It's called creating a "checkpoint" or "memento".
i know i can program this
Good. Get started. Each problem is unique, so you have to create, save, and reload the mementos.
but would prefer a generalised utility which could just keep these things running so that the script completed even if there were trivial errors.
It doesn't generalize well. Not all variables can be saved. Only you know what's required to restart your process in a meaningful way.
perhaps i need some sort of process-management tool?
Not really.
trivial errors eg. string encoding issues
Usually, we find these by unit testing. That saves a lot of programming to work around the error. An ounce of prevention is worth a pound of silly work-arounds.
sometimes because the process seems to get killed by the server.
What? You'd better find out why. An ounce of prevention is worth a pound of silly work-arounds.
Why would you prefer to keep from using bash commands via exec() in php?
I do not consider portability issue (I definitely will not port it to run on Windows). That's just a matter of a good way of writing scripts.
On the one hand:
I need to write much more lines in php then in bash to accomplish the same task.
For example, when I need to filter some lines in a file, I just can't imaging using something instead of cat file | grep string > new_file. This would take much more time and effort to do in php.
I do not want to analyze all situations when something might go wrong. I will just show bash command output to the user, so he would know what exactly happened.
I do not need to write another wrapper around filesystem functions and use it. It is much more efficient to leverage the OS for file searching, manipulation etc.
On the other hand:
Calling unix command with exec() might be inefficient in most cases. It is quite expensive to spawn a separate process. Not talking about scripts running under apache, which is even much less efficient than spawning from command line scripts.
Sometimes it turns out to be 'black magic-like' and perl-like scripting. Though it can be avoided via detailed comments.
Maybe I'm just trying to use two different tools together when they are not supposed to. Each tool has its own application and should not be mixed together.
Even though I'm sure users will not try to run script will malicious purposes, using exec() is a potential security threat. In most cases user data can be escaped with escapeshellarg(), but it is still an issue to take into account.
another reason to avoid this is that it's much easier to create security holes like this.
for example, if a user manage to sneak
`rm -rf /`
(With backticks) into the input, your bash code might actually nuke the server (or nuke something at least).
this is mostly a religious thing, most developers try to write code that always works. relying on external commands is a sure way to get your code to fail on some systems (even on the same OS).
What are you trying to achieve? PHP has regex-based functions to find what you need from a file. Yes, you would probably need about 5 lines of code to do it, but it would probably be no more or less efficient.
The main reason against using exec() in PHP is for security. If you're trusting your user to give you a command to exec() in bash, they could easily run malicious commands, such as installing and starting backdoor trojans, removing files, and the like.
As long as you're careful though (use the shell escaping commands to clean user input, restrict the Apache user permissions etc) it shouldn't be a problem. I'm just working on a complete platform at the moment, which relies on the front-end executing shell processes simply because C++ is much faster than PHP, so I've written a lot of the backend logic as a shell application and keep PHP for the front-end logic.
Even though you say portability isn't an issue, you never know for certain what the future holds, so I'd encourage you to reconsider that position. For example, I was once asked to port an editor that was written (by someone else) from Unix to DOS. The original program wasn't expected to be ported and was written with Unix specific calls deeply embedded in the code. After reviewing the amount of work required, we abandoned the task as too time consuming.
I have used exec calls in PHP; however, I had no other way to accomplish what I needed (I had to call another program written in another language with no other bridge between the languages). However, IMO, exec calls which aren't necessary are ugly. As others have said, they can also create security risks and slow your program down.
As you said yourself, you need to document the exec calls well to be sure they'll be understood by programmers. Why create the extra work? Not just now but in the future, when any changes to the exec call will also need to be documented.
Finally, I suggest you learn PHP and its functions a bit better. I'm not that good with PHP, but in just a matter of minutes with Google and php.net, I think I accomplished the same thing you gave as an example with:
$search_results = preg_grep($search_string, file($file_name));
foreach ($search_results as $result) {
echo $result . "\n";
}
Yes, it's a bit more code, but not that much, and you can put it in a function if appropriate ... and I wouldn't be surprised if a PHP guru could shorten it.
IMHO, the main concern with using exec() to execute *nix commands via PHP is security, more than performance or even code style.
If you have a very good input sanitization (and this is very hard to achieve), you may be able not to have any security hole.
Personally, if portability isn't an issue, I would totally use *nix commands like grep, locate, etc. anyday over trying to duplicate that functionality in PHP.
It's about using the best tool for the job. In some cases, arguably more often than most people realize, it is much more efficient to leverage the OS for file searching, manipulation, etc. (amongst other things)
Lot of people would descend on your like a ton of bricks for even mentioning using exec. Some people would consider is blasphemy but not me. I can see nothing wrong with exec for some situations if your server has been properly configured. The disadvantage though is that you are spawning another process.
If you are running your PHP using a web server, the "user" that runs the script may not have permission to run certain shell commands. you said portability is not an issue, but i can guarantee to you that it IS an issue, (unless you are creating PHP scripts for fun). In the business world where things and condition changes fast, you won't know you might one day have to run your scripts on other platforms.
It is not secure unless you take extreme precautions to make sure it can't be used by people executing the code.
php is not a good executor. php spawns a process from apache, and if that process hangs, your apache server will hang, if your site is also running on the same apache; it will fail.
You can expect to have silly issues like these as well, if it happens you can't even restart apache without killing the spawned process manually from shell.
http://bugs.php.net/bug.php?id=38915
therefore, i'm not talking about security, running linux commands from php fails more than you'd think, worst part of using exec, it's not always possible to get error messages back to php. or write subsequent method that depends on what happened with exec.
consider this pseudo example:
exec ('bash myscript.sh',$x)
if (myScriptWasOk == true) then do this
There is no way that you get that 'myScriptWasOk' variable right. You just don't know anything about it, $x will help you sometimes.
All this being said, if u need something simple, and if you tested your script and it works ok, just go for it.
If you are only aiming for Unix compatibility (which is perfectly fine), I can't see anything wrong with it. Virtually server operating system available today is a Unix clone, except of course for Windows which I think is ridiculous to use as a server platform in the first place (and I'm talking from experience here, this is not just Microsoft hatred). Unix-compatibility is a perfectly legitimate requirement on any server in my opinion.
The only real reason I can see to avoid it is performance. You will quickly find that executing external processes in general is extremely slow. It's slow in C, and it's slow in PHP. I would think that's the biggest real, non-religious concern.
EDIT:
Oh, and as for the security problem, that's a simple matter of making sure that you are in total control of the variables passed to the operating system. It's a concern you have to make when communicating between processes and languages anyway, for example when you do SQL queries. It's not a big enough reason in my opinion to not do something, it's just something that has to be taken into account in this case, like in every case.
If portability really isn't an issue, because you are building a company solution that is always going to be on your own, totally controlled servers, I say go for shell commands as much as you want to. There is no inherent security problem as long as you do proper basic sanitation using escapeshellarg() and consorts.
At the same time, in my projects portability mostly is an issue, and when it is, I try not to use shell commands at all - only when something can't be done in PHP at all (e.g. MP3 decoding/encoding, ImageMagick, Video operations) or not reasonably (i.e. a PHP based solution is way too slow) will I use external commands.
I wish to create a background process and I have been told these are usually written in C or something of that sort. I have recently found out PHP can be used to create a daemon and I was hoping to get some advice if I should make use of PHP in this way.
Here are my requirements for a daemon.
Continuously check if a row has been
added to MySQL database table
Run FFmpeg commands on what was
retrieved from database
Insert output into MySQL table
I am not sure what else I can offer to help make this decision. Just to add, I have not done C before. Only Java and PHP and basic bash scripting.
Does it even make that much of a performance difference?
Please allow for my ignorance, I am learning! :)
Thanks all
As others have noted, various versions of PHP have issues with their garbage collectors. Of course, if you know that your version does not have such issues, you eliminate that problem. The point is, you don't know (for sure) until you write the daemon and run it through valgrind to see if the installed PHP leaks or not on any given machine. So on that hand, you may write it just to discover that what Zend thinks is fixed might still be buggy, or you are dealing with a slightly older version of PHP or some extension. Icky.
The other problem is somewhat buggy signals. In my experience, signal handlers are not always entered correctly with PHP, especially when the signal is queued instead of merged. That may not be an issue for you, i.e. if you just need to handle SIGINT/SIGUSR1/SIGUSR2/SIGHUP.
So, I suggest:
If the daemon is simple, go ahead and use PHP. If it looks like its going to get rather complex, or allocate lots of memory, you might consider writing it in C after prototyping it in PHP.
I am a pretty die hard C person. However, I see nothing wrong with hammering out something quick using PHP (beyond the cases that I explained). I also see nothing wrong with using PHP to prototype something that may or may not be later rewritten in C. For instance, handling database stuff is going to be much simpler if you use PHP, versus managing callbacks using other interfaces in C. So in that instance, for a 'one off', you will surely get it done much faster.
I would be inclined to perform this task with a cron job, rather than polling the database in a daemon.
It's likely that your FFmpeg command will take a while to do it's thing, right? In that case, is it really necessary to be constantly polling the database? Wouldn't a cronjob running each minute (or every five, ten or twenty minutes for that matter) be a simpler way to achieve the same thing?
Php isn't any better or worse for this kind of thing than any of the other common scripting languages. It has fairly complete access to all of the system calls and library utilities you would need to do this sort of work. If you are most comfortable using PHP for scripting, then php will do the job for you.
The only down side is that php is not quite as ubiquitous as, say, perl or python, which is installed on almost every flavor of unix. Php is only found on systems that are going to be serving dynamic web content. Not that a Php interpreter is too large or costly to install also, but if your biggest concern is getting your program to many systems, that may be a slight hurdle.
I'll be contrary and recommend you try the php daemon. It's apparently the language you know the best. You'll presumably incorporate a timer in any case, so you can duplicate the querying frequency on the database. There's really no penalty as long as you aren't naively looping on a query.
If it's something not executed frequently, you could alternatively run the php from cron, letting youor code drain the queue and then die.
But don't be afraid to stick with what you know best, as a first approximation.
Try not to use triggers. They'll impose unnecessary coupling, and they're no fun to test and debug.
One problem with properly daemonizing a PHP script is that PHP doesn't have interfaces to the dup() or dup2() syscalls, which are needed for detaching the file descriptors.
A cron-job would probably work just fine, if near-instant actions is not required.
I'm just about to put live, a system I've built, based on the queueing daemon 'beanstalkd'. I send various small messages from (in this case, PHP) webpage calls to the daemon, and a PHP script then picks them up from the queue and performs various tasks, such as resizing images or checking databases (often passing info back via a Memcache-based store).
To avoid long-running processes, I've wrapped it in a BASH script, that, depending on the value returned from the script ("exit(1);") will restart the script, for every (say) 50 tasks it's performed. If it's restarting because I plan it to, it will do so instantly, any other exit value (the default is 0, so I don't use that) would pause a few seconds before it was restarted.
Running as a cron job with sensibly determined periodicity, a PHP script can do the job, and production stability is certainly achievable. You might want to limit the number of simultaneous FFMpeg instances, and be sure to have complete application logging and exception handling. I have implemented continuously running polling processes in Java, as well as the every-ten-minute cron'd PHP script, and both do the job nicely.
You might want to consider making a mysql trigger that executes a system command (i.e. FFmpeg) instead of a daemon. If some lag isn't a problem, you could also put something in cron that executes every few minutes to check. Cron would be my choice, if it is an option.
To answer your question, php is perfectly fine to run as a daemon. It does not have to be done in C.
If you combine the answers from Kent Fredric, tokenmacguy and Domster you get something useful.
php is probably not good for long execution times,
so let's keep every execution cycle short and make sure the OS takes care of the cleanup of any memoryleaks.
As a tool to start your php script cron can be a good tool.
And if you do it like that, there is not much difference between languages.
However, the question still stands.
Is php even capable to run as a normal daemon for long times (some years)?
Or will assorted memoryleaks eat up all your ram and kill the system?
/Johan
If you do so, pay attention to memory leaks. PHP 5.2 has some problems with its garbage collector, according to this (fixed in 5.3). Perhaps its better to use cron, so the script starts clean every run.
For what you've described, I would go with a daemon. Make sure that you stick a sleep in the poll loop, so that you don't bombard the database when there are no new tasks. A cronjob works better for workflow/report type of jobs, where there isn't some particular event that triggers the next run.
As mentioned, PHP has some problems with memory management. You need to be sure that you test your code for memory leaks, since these would build up over time, in a long running script. PHP doesn't have real garbage collection - It relies on reference counting, which means that cyclic references will cause leaks. If you're aware of this, you can code around it.
If you do decided to go down the daemon route, there is a great PEAR module called System_Daemon which I've recently used successfully on a PHP v5.3.0 installation. It is documented on the authors blog: http://kevin.vanzonneveld.net/techblog/article/create_daemons_in_php
If you have PEAR installed, you can install this module using:
pear install -f System_Daemon
You will also need to create a initialisation script: /etc/init.d/<your_daemon_name>
Then you can:
Start Daemon: /etc/init.d/projNotifMailDaemon start
Stop Daemon: /etc/init.d/projNotifMailDaemon stop
Logs are kept at: /var/log/<your_daemon_name>.log
I wouldn't recommend it. PHP is not designed for longterm execution. Its designed primarily with short lived pages.
In my experience PHP can have problems with leaking memory for some of the larger tasks.
A cron job and a little bit of bash scripting should be everything you need by the sounds of it. You can do things like:
$file=`mysqlquery -h server < "select file from table;"`
ffmpeg $file -fps 50 output.a etc.
so bash would be easier to write, port and maintain IMHO than to use PHP.
If you know what you are doing sure. You need to understand your operating system well. PHP generally isn't suited for most daemons because it isn't threaded and doesn't have a decent event based system for all tasks. However if it suits your needs then no problem. Modern PHP (5.3+) is really stable and doesn't have any memory leaks. As long as you enable the GC and don't implement your own memory leaks, etc you'll be fine.
Here are the stats for one daemon I am running:
uptime 17 days (last restart due to PHP upgrade).
bytes written: 200GB
connections: hundreds
connections handled, hundreds of thousands
items/requests processed: millions
node.js is generally better suited although has some minor annoyances. Some attempts to improve PHP in the same areas have been made but they aren't really that great.
Cron job? Yes.
Daemon which runs forever? No.
PHP does not have a garbage collector (or at least, last time I checked it did not). Therefore, if you create a circular reference, it NEVER gets cleaned up - at least not until the main script execution finishes. In daemon process this is approximately never.
If they've added a GC in new versions, then yes you can.
Go for it. I had to do it once also.
Like others said, it's not ideal but it'll get-er-done. Using Windows, right? Good.
If you only need it to run occasionally (Once per hour, etc).
Make a new shortcut to your firefox, place it somewhere relevant.
Open up the properties for the shortcut, change "Target" to:
"C:\Program Files\Mozilla Firefox\firefox.exe" http://localhost/path/to/script.php
Go to Control Panel>Scheduled Tasks
Point your new scheduled task at the shortcut.
If you need it to run constantly or pseudo-constantly, you'll need to spice the script up a bit.
Start your script with
set_time_limit(0);
ob_implicit_flush(true);
If the script uses a loop (like while) you have to clear the buffer:
$i=0;
while($i<sizeof($my_array)){
//do stuff
flush();
ob_clean();
sleep(17);
$i++;
}