PHP exec() Throttling

PHP exec() Throttling - php

I have a webpage that, if some data is missing, I run an exec command to gather the data I need. It all works great, the issue I want to try to avoid is if someone spams my site with the wrong kinds of urls and runs this call thousands of times.
I am removing potential items in the call, but I would really like it if I could limit the call to, say once every 5 seconds, no matter who makes the call. IE, the first call goes through fine, but if someone else tries during that time it would not allow it for the set amount of time.
I wouldn't mind adding a tar to it later, ie if the next call is under 5 seconds increase to 10 seconds, etc, but for now I just want to add a safety throttle on the call.
Thanks

What you are looking for is called a "Lock" or a "Mutex"
In computer science, a lock or mutex (from mutual exclusion) is a synchronization mechanism for enforcing limits on access to a resource in an environment where there are many threads of execution. A lock is designed to enforce a mutual exclusion concurrency control policy.
There are a couple of libraries out there that take care of this for you and I would encourage you to use those, if this is an option.
Symfony Lock
php-lock/lock
PHP has/had a mutex implementation but I believe (if I understand things correctly) it is only for CLI applications because it came from the optional pthreads extension.
If you want to roll your own, you could literally just use a file on disk. If that file exists, you know that a process is running and other requests should abort/do something else. Once the running process is complete, you just need to delete that special file. You should also register an error handler and/or shutdown function who guarantees that the file is deleted, just in case your exec logic throws a fatal exception.
The actual file isn't important, it could just be /tmp/my-app-lock but it is common in *nux to put these in /var/run or similar, if your process has access to it. You could also just put this in your website's folder, too.
Instead of a file, you could also use a database with the same idea. If a value exists, it should assume to be locked. If it isn't, the caller should create the value in a transaction (to guarantee that it exists). Maybe even uses a shared key but a unique per-process random value that double-checks that it actually acquired the lock. This is actually built into the Symfony version, too, as one of their locks.
edit
Here's an example using Symfony's lock component:
use Symfony\Component\Lock\LockFactory;
use Symfony\Component\Lock\Store\FlockStore;
// the argument is the path of the directory where the locks are created
// if none is given, sys_get_temp_dir() is used internally.
$store = new FlockStore('/var/stores');
$factory = new LockFactory($store);
$lock = $factory->createLock('pdf-invoice-generation');
if ($lock->acquire()) {
// The resource "pdf-invoice-generation" is locked.
// You can compute and generate invoice safely here.
$lock->release();
}
Also, Symfony's lock component doesn't require the entire Symfony framework, it is standalone.

Related

Why are "lock" files used in PHP instead of just counting the processes?

I've seen a lot of examples where a "lock" file is used to keep track of if a PHP script is currently running.
Example:
script starts
checks if "/tmp/lockfile" is currently locked
If it is locked, exit. If not, lock the file and continue
This way, if a long-running script is started up twice, only the first instance will run. Which is great.
However, it seems like the wrong way to go around it. Why don't we just check if the process is already running like this?
if(exec("ps -C " . basename(__FILE__) . " --no-headers | wc -l") > 1){
echo "Already running.";
exit;
}
Are there any potential pitfalls to this method? Why do I see the "lock" file workaround so often? It definitely seems more accurate to count the processes with the name we're looking for....

Based on comments here and my own observations, I've composed a list of pro's and con's of both approaches:
flock method:
pros:
More compatible across operating systems
No knowledge of bash required
More common approach, lots of examples
Works even with exec() disabled
Can use multiple locks in a single file to allow different running "modes" of the same file at the same time
cons:
It's not definite. If your lock file is deleted by an external process / user, you could end up with multiple processes. If you're saving the lock file in the /tmp directory, that's a valid possibility, since everything in this directory is supposed to be "temporary"
Under certain circumstances, when a process dies unexpectedly, the file lock can be transferred to an unrelated process (I didn't believe this at first, but I found instances of it happening (although rarely) across 200+ unix based systems, in 3 different operating systems)
exec("ps -C...") method
pros:
Since you're actually counting the processes, it will work everytime, regardless of the state of file locks, etc.
cons:
Only works in linux
requires "exec" to be enabled
If you change the name of your script, it could cause double processes (and make sure your script name isn't hard-coded in the code)
Assumes that your script only has one running "mode"
EDIT: I ended up using this:
if (exec("pgrep -x " . $scriptName . " -u ". $currentUser . " | wc -l") > 1)
{
echo $scriptName . " is already running.\n";
exit;
}
... because ps doesn't allow you to filter on the owner of the process in addition to the process name, and I wanted to allow this script to run multiple times if a different user was running it.
EDIT 2:
... So, after having that running for a few days, it's not perfect either. Somehow, the process started up multiple times on the same machine under the same user. My only guess is that there was some issue (ran out of memory, etc) that caused the pgrep to return nothing, when it should have returned something.
So that means that NEITHER the flock method and the counting process methods are 100% reliable. You'll have to determine what approach will work better for your project.
Ultimately, I'm using another solution that stores the PID of the current task in a "lock" file, that's not actually locked with flock. Then, when the script starts up, checks if the lock file exists, and if it does, gets the contents (PID of the last time the script started up) Then, it checks if it's still running, by comparing the /proc/#PID#/cmdline contents with the name of the script that's running.

The most important reasons that lock files are used, unfortunately not given in the other answers, is that lock files use a locking mechanism that is atomic, allow you to run multiple instances your script, at a higher level context than the script itself, and are more secure.
Lock files are atomic
Iterating through a process list is inherently prone to race conditions; in the time it takes to retrieve and iterate through the list a second process might have just spawned, and you unintentionally end up with multiple processes.
The file locking mechanism is strictly atomic. Only one process can get an exclusive lock to a file, so when it does have a lock, there is no possibility the same command will be running twice.
Lock files allow for multiple instances of the same script
Say you want to have several separate instances of a script running, but each with its own scope/options. If you'd simply count the number of times the script appears in ps output, you'd only be able to run a single instance. By using separate lock files you can properly lock its running state in an individual scope.
Lock files are more secure
Lock files are simply a much more elegant and secure design. Instead of abusing a process list (which requires a lot more permissions and would have to be accessed and parsed differently on each OS) to deduce if a script is already running, you have a single dedicated system to explicitly lock a state.
Summary
So to reiterate all the reasons lock files are used over other methods:
Lock files allow for atomic locking, negating race conditions.
They allow for running multiple instances/scopes of the same script.
File system access is more universally available than execute permissions or full access to process information.
They are more secure, because no permissions outside of the specific lock file are needed.
Using file locks is more compatible across different operating systems.
It is a more elegant solution; doesn't require parsing through a process list.
As for the cons mentioned in other answers; that the lock file can be deleted and sometimes the lock is transferred to another process:
Deletion of lock files is prevent by setting proper permissions, and storing the file on non-volatile storage. Lock files are "definite" if used according to spec, unlike a process list.
It's true a process that forks a child process wil "bequeath" it's lock to any child processes that are still running, however this is easily remedied by specifically unlocking the file once the script is done. E.g. using flock --unlock.
Long story short: you should always use lock files over parsing running processes.
Other solutions
There are other solutions, though they usually offer no benefit over simple file locks unless you have additional requirements:
Mutexes in a separate database / store (e.g. redis).
Exclusively listening to a port on a network interface.

First, the command is not right . When I run php test.php and the command
ps -C test.php
whill get nothing. You can use ps -aux|grep 'test.php' -c to get process number.But exec("ps -aux|grep 'php test.php' -c"); return number must minus 2 is the real process number
The reason to use lock file most reason is exec or other command function need some special permission, and is disable_functions in php.ini default config.
Test script like this:
$count = exec("ps -aux|grep 'php test.php' -c");
if($count > 3){
echo "Already running.".$count;
exit;
}
while(1){
sleep(20);
}

The main reason why "lock files" are used is simply because they can be expected to work in any host-environment. If you can "create a file," and if you can "lock it," such code will work ... anywhere.
Also – "there is nothing to be gained by being 'clever.'" This well-known strategy is known to work well – therefore, "run with it."

Reliable PHP script reentrant lock

I have to make sure a certain PHP script (started by a web request) does not run more then once simultaneously.
With binaries, it is quite easy to check if a process of a certain binary is already around.
However, a PHP script may be run by several pathways, eg. CGI, FCGI, inside webserver modules etc. so I cannot use system commands to find it.
So how to reliable check if another instance of a certain script is currently running?

The exact same strategy is used as one would chose with local applications:
The process manages a "lock file".
You define a static location in the file system. Upon script startup you check if a lock file exists in that location, if so you bail out. If not you first create that lock file, then proceed. During tear down of your script you delete that lock file again. Such lock file is a simple passive file, only its existence is of interest, often not its content. That is a standard procedure.
You can win extra candy points if you use the lock file not only as a passive semaphore, but if you store the process id of the generating process in it. That allows subsequent attempts to verify of that process actually still exists or has crashed in the mean time. That makes sense because such a crash would leave a stale lock file, thus create a dead lock.
To work around the issue discussed in the comments which correctly states that in some of the scenarios in which php scripts are used in a wen environment a process ID by itself may not be enough to reliably test if a given task has been successfully and completely processed one could use a slightly modified setup:
The incoming request does not directly trigger to task performing php script itself, but merely a wrapper script. That wrapper manages the lock file whilst delegating the actual task to be performed into a sub request to the http server. That allows the controlling wrapper script to use the additional information of the request state. If the actual task performing php script really crashes without prior notice, then the requesting wrapper knows about that: each request is terminated with a specific http status code which allows to decide if the task performing request has terminated normally or not. That setup should be reliable enough for most purposes. The chances of the trivial wrapper script crashing or being terminated falls into the area of a system failure which is something no locking strategy can reliably handle.

As PHP does not always provide a reliable way of file locking (it depends on how the script is run, eg. CGI, FCGI, server modules and the configuration), some other environment for locking should be used.
The PHP script can for example call another PHP interpreter in it's CLI variant. That would provide a unique PID that could be checked for locking. The PID should be stored to some lock file then which can be checked for stale lock by querying if a process using the PID is still around.
Maybe it is also possible to do all tasks needing the lock inside a shell script. Shell scripts also provide a unique PID and release it reliable after exit. A shell script may also use a unique filename that can be used to check if it is still running.
Also semaphores (http://php.net/manual/de/book.sem.php) could be used, which are explicitely managed by the PHP interpreter to reflect a scripts lifetime. They seem to work quite well, however there is not much fuzz around about how reliable they are in case of premature script death.
Also keep in mind that external processes launched by a PHP script may continue executing even if the script ends. For example, a user abort on FCGI releases passthru processes, which carry on working despite the client connection is closed. They may be killed later if enough output accumulated or not at all.
So such external processes have to locked as well, which can't be done by the PHP-accquired semaphores alone.

How to avoid file deadlocks when PHP process/server crashes?

I am new to PHP. I understand I can use flock() to lock a file and avoid race conditions when two users reach the same php file adding content to the lockable file.
However, what happens if a php process crashes? What happens to the next user waiting for the lockable file? What happens if the server crashes (someone pulls the plug)? Is the lock automatically released? Will the file remain locked after rebooting the server?
To make it short, does PHP make sure such critical situations (i.e., lock not explicitly released) are handled properly? If not, how should one deal with these situations? How to recover from these?

Locks are handled by the OS. Therefore:
if a process crashes, all locks it held are released (along with any other kind of resource it held)
if the system crashes, locks are meaningless because they do not "carry over" to the next reboot
PHP does not need to do anything special other than use the OS-provided mechanism for locking files, so in general you are perfectly safe.
However, if your web server setup is such that each request is not handled by a new process then if one request is abnormally terminated (let's say a thread is aborted) the lock will persist and block all further requests for the lock, quickly resulting in a deadlocked web server. That's one of the many reasons that you really, really should not use setups that do not provide process-level isolation among requests (disclaimer: I am not a web server expert -- I could be wrong in the "should not" part, even though I doubt it).

Close connection in PHP but keep executing script

Anyone know how to close the connection (besides just flush()?), but keep executing some code afterwards.
I don't want the client to see the long process that may occur after the page is done.

You might want to look at pcntl_fork() -- it allows you to fork your current script and run it in a separate thread.
I used it in a project where a user uploaded a file and then the script performed various operations on it, including communicating with a third-party server, which could take a long time. After the initial upload, the script forked and displayed the next page to the user, and the parent killed itself off. The child then continued executing, and was queried by the returned page for its status using AJAX. it made the application much more responsive, and the user got feedback as to the status while it was executing.
This link has more on how to use it:
Thorough look at PHP's pcntl_fork() (Apr 2007; by Frans-Jan van Steenbeek)
If you can't use pcntl_fork, you can always fall back to returning a page quickly that fires an AJAX request to execute more items from a queue.
mvds reminds the following (which can apply in a specific server configuration): Don't fork the entire apache webserver, but start a separate process instead. Let that process fork off a child which lives on. Look for proc_open to get full fd interaction between your php script and the process.

I don't want the client to see the
long process that may occur after the
page is done.
sadly, the page isn't done until after the long process has finished - hence what you ask for is impossible (to implement in the way you infer) I'm afraid.
The key here, pointed to by Jhong's answer and inversely suggested by animusen's comment, is that the whole point of what we do with HTTP as web developers is to respond to a request as quickly as possible /end - that's it, so if you're doing anything else, then it points to some design decision that could perhaps have been a little better :)
Typically, you take the additional task you are doing after returning the 'page' and hand it over to some other process, normally that means placing the task in a job queue and having a cli daemon or a cron job pick it up and do what's needed.
The exact solution is specific to what you're doing, and the answer to a different (set of) questions; but for this one it comes down to: no you can't close the connection, and one would advise you look at refactoring the long running process out of that script / page.

Take a look at PHP's ignore_user_abort-setting. You can set it using the ignore_user_abort() function.
An example of (optional) use has been given (and has been reported working by the OP) in the following duplicate question:
close a connection early (Sep 2008)
It basically gives reference to user-notes in the PHP manual. A central one is
Connection Handling user-note #71172 (Nov 2006)
which is also the base for the following two I'd like to suggest you to look into:
Connection Handling user-note #89177 (Feb 2009)
Connection Handling user-note #93441 (Sep 2009)

Don't fork the entire apache webserver, but start a separate process instead. Let that process fork off a child which lives on. Look for proc_open to get full fd interaction between your php script and the process.

We solved this issue by inserting the work that needs to be done into a job queue, and then have a cron-script pick up the backend jobs regularly. Probably not exactly what you need, but it works very well for data-intensive processes.
(you could also use Zend Server's job queue, if you've got a wad of cash and want a tried-and-tested solution)

PHP Architecture: How do I do that?

I need some help understanding internal workings of PHP.
Remember, in old days, we used to write TSR (Terminate and stay resident) routines (Pre-windows era)? Once that program is executed, it will stay in memory and can be re-executed by some hot-key (alt- or ctrl- key combination).
I want to use similar concept in web server/applications. Say, I have common_functions.php which consists of common functions (like Generate_City_Combo(), or Check_Permission() or Generate_User_Permission_list() or like) to all the web applications running on that apache/php server.
In all the modules or applications php files, I can write:
require_once(common_functions.php) ;
which will include that common file in all the modules and applications and works fine.
My question is: How does php handle this internally?
Say I have:
Two applications AppOne and AppTwo.
AppOne has two menu options AppOne_Menu_PQR and AppOne_Menu_XYZ
AppTwo has two menu options AppTwo_Menu_ABC and APPTwo_Menu_DEF
All of these four menu items call functions { like Generate_City_Combo(), or Check_Permission() or Generate_User_Permission_list() } from common_functions.php
Now consider following scenarios:
A) User XXX logs in and clicks on AppOne_Menu_PQR from his personalized Dashboard then s/he follows through all the screens and instructions. This is a series of 8-10 page requests (screens) and it is interactive. After this is over, user XXX clicks on AppTwo_Menu_DEF from his personalized Dashboard and again like earlier s/he follows through all the screens and instructions (about 8-10 pages/screens). Then User XXX Logs off.
B) User XXX logs in and does whatever mentioned in scenario A. At the same time, user YYY also logs in (from some other client machine) and does similar things mentioned in scenario A.
For scenario A, it is same session. For Scenario B, there are two different sessions.
Assume that all the menu options call Generate_User_Permission_list() and Generate_Footer() or many menu options call Generate_City_Combo().
So how many times will PHP execute/include common_functions.php per page request? per session? or per PHP startup/shutdown? My understanding is common_functions.php will be executed once EVERY page request/cycle/load/screen, right? Basically once for each and every interaction.
Remember functions like Generate_City_Combo() or Generate_Footer() produces same output or does same thing irrespective of who or when is calling.
I would like to restrict this to once per Application startup and shutdown.
These are just examples. My actual problem is much more complex and involved. In my applications, I would like to call Application_Startup() routines just once which will create ideal environment (like all lookup and reference data structures, Read-Only-Data, Security-Matrix, Menu-options, context sensitive business execution logic etc..). After that all the requests coming to server need not spend any time or resources to create environment but can instantly refer "already-created-environment".
Is this something feasible in PHP? How? Could you point me to someplace or some books which explains internal working of PHP?
Thanks in advance.

PHP processes each HTTP request in a completely separate frame of execution - there is no persistent process running to service them all. (Your webserver is running, but each time it loads a PHP page, a separate instance of the PHP interpreter is invoked.)
If the time it takes for your desired persistent areas to be generated is significant, you may wish to consider caching the output from those scripts on disk and loading the cached version first if it is available (and not out of date).

I would say that you are likely prematurely optimizing, but there is hope.
You very frequently want multiple copies of your compiled code in memory since you want stability per request; you don't want separate requests operating in the same memory space and running the risk of race conditions or data corruption!
That said, there are numerous PHP Accelerators out there that will pre-compile PHP code, greatly speeding up include and require calls.

PHP(in almost all cases) is page oriented. There is no Application_Startup() that will maintain a state across HTTP requests.
You can sometimes emulate this by loading/unloading serialized data from a database or $_SESSION, but there is overhead involved. Also, there are other cases where a memcached server can optimize this as well, but you typically can't use those with you typical virtual hosting services like cPanel.
If I had to build an app like you are talking about I would serialize the users choices into the session, and then save whatever needs to persist between sessions in a database.
There are several ORM modules for PHP like Doctrine which simplify object serialization to a database.

I'm necromancing, here, but with the advent of PThread, it seems like there may be the possibility of a stab in the direction of an actual solution for this, rather than just having to say, in effect, "No, you can't do that with PHP."
A person could basically create their own multi-threaded web server in PHP, just with the CLI tools, the socket_* functions and PThreads. Just listen on port 80, add requests to a request queue, and launch some number of worker threads to process the queue.
The number of workers could be managed based on the request queue length and the operating system's run queue length. Every few seconds, the main thread could pass through a function to manage the size of the worker pool. If the web request queue length was greater than some constant times the operating system's run queue length and the number of workers was less than a configured maximum, it could instantiate another worker thread. If the web request queue length was less than some other (lower) constant times the OS's run queue length and the number of workers was greater than a configured minimum, it could tell one of the worker threads to die when it finishes its current request. The constants and configured values could then be tuned to maximize over all throughput for the server. Something like that.
You'd have to do all your own uri parsing, and you'd have to piece together the HTTP response yourself, etc., but the worker threads could instantiate objects that extend Threaded, or reuse previously instantiated Threaded objects.
Voila - PHP TomCat.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.