I have a simple messaging queue setup and running using the Zend_Queue object heirarchy. I'm using a Zend_Queue_Adapter_Db back-end. I'm interested in using this as a job queue, to schedule things for processing at a later time. They're jobs that don't need to happen immediately, but should happen sooner rather than later.
Is there a best-practices/standard way to setup your infrastructure to run jobs? I understand the code for receiving a message from the queue, but what's not so clear to me is how run the program that does that receiving. A cron that receives n messages on the command-line, run once a minute? A cron that fires off multiple web requests, each web request running the receiver script? Something else?
Tangential bonus question. If I'm running other queries with Zend_Db, will the message queue queries be considered part of that transaction?
You can do it like a thread pool. Create a command line php script to handle the receiving. It should be started by a shell script that automatically restarts the process if it dies. The shell script should not start the process if it is already running (use a $pid.running file or similar). Have cron run several of these every 1-10 minutes. That should handle the receiving nicely.
I wouldn't have the cron fire a web request unless your cron is on another server for some strange reason.
Another way to use this would be to have some backround process creating data, and a web user(s) consume it as they naturally browse the site. A report generator might work this way. Company-wide reports are available to all users but you don't want them all generating this db/time intensive report. So you create a queue and process one at a time possible removing duplicates. All users can view the report(s) when ready.
According to the docs it doens't look like the zend db is even using the same connection as your other zend_db queries. But of course the best way to find out is to make a simple test.
EDIT
The multiple lines in the cron are for concurrency. each line represents a worker for the pool. I was not clear, you don't want the pid as the identifier, you want to pass that as a parameter.
/home/byron/run_queue.sh Process1
/home/byron/run_queue.sh Process2
/home/byron/run_queue.sh Process3
The bash script would check for the $process.running file if it finds it exit.
otherwise:
Create the $process.running file.
start the php process. Block/wait until finished.
Delete the $process.running file.
This allows for the php script to die but not cause the pool to loose a worker.
If the queue is empty the php script exits immediately and is started again by the nex invocation of cron.
Related
First of all sorry to post a question that seems to have been flogged to death on SO before. However, none of the questions I have reviewed helped me to solve my specific problem.
I have built a web application that runs an extensive data processing routine in PHP (i.e. MySQL queries, calculations, etc.).
Depending on the amount of data fed to the app this processing can take quite a long time so the script needs to run server-side and independently from the web front-end.
There is a problem, however. It seems I cannot control the script execution time limit as long as the script is invoked via cgi.
When I run the script via SSH and the command line it works fine for however long it takes to process the data.
But if I use the exec() command in a php script called via the webserver I always ends up with the error End of script output before headers after approximately 45 seconds.
Rather than having to fiddle with server settings (a nightmare in terms of portability) I would like to find a solution that kicks off the script independently from cgi.
Any suggestions?
Don't execute the long script directly from the website (AKA, directly from Apache) because, as you've mentioned, it will block until it finishes and potentially time out. Instead, use the website to schedule a job (an execution of the long script) to be run immediately.
Here is a basic outline of how you can potentially do this:
Create a new, small database to store job requests, including fields job_id, processing_status, run_start_time, and more relevant fields
Create some Ajax that hits your server and writes a "job request" to this jobs database, set to execute immediately.
Add a crontab script or bot that periodically watches for new jobs. If it finds a job that is yet to be processed but has passed the run_start_time, run it using exec() or some other command executor. This way the command won't timeout because it is not being run by Apache, but by the cron daemon.
When the command finishes, update the jobs database saying that processing is finished.
From your website, write a frontend that allows the user to see if the requested job is finished yet. Once it finishes, it displays some kind of "Done" indicator or something similar.
I have a PHP script that processes my email subscriptions.
It does something like:
foreach email to be sent:
mailer->send-email
print "Email sent to whoever."
I'm now encountering rate-limiting by my web host. The mailing library has a built in throttler that will sleep to ensure I stay under the rate. However, this could result in the web page taken multiple hours to actually load.
Will the client side browser ever give up on the page loading? Any suggested better solutions to this?
Why is this being done on a webpage load? This should be an off-line back-end process which is scheduled to run. (Look into cron for scheduling tasks.)
Any long running process should be delegated to a back-end service to handle that process. Application interfaces (such as a web page) should respond back to the user as quickly as possible instead of forcing the user to wait (for upwards of an hour?) for a response.
The application can track progress, usually by means of some shared data source (a simple database, for example), of the back-end process and present that progress to the user. That's fine. But the process itself should happen outside of the application.
For example, at a high level...
Have a PHP script scheduled to run to process the emails.
When the script starts, save a record to a database indicating that it's started.
Each time the script reaches a milestone of some kind, update the database record to indicate this.
When the script finishes, update the database record to indicate this.
Have a web application which checks for that database record and shows the user the current status of the back-end process.
You may not care, but even if you coerce this script into staying alive, you shouldn't purposely run a long running script through the webserver. Webserver's use resource heavy threads or processes to run your script, and they have a finite amount of them available to server web requests. A long running script basically takes one of them out of the pool of processes that can be used to server web visitors.
Instead, use a cron job which executes the php binary directly. Specifically, do not use wget or lynx or any other web browser like program as part of the cron job, because those methods run the script through the webserver. The cron command should include something like
php /full/path/to/the/script.php
I'm trying to create a webpage that will allow me to start and stop of a PHP script. The script is part of the backend of a site, and will need to access, read data from, process that data, and update a database on the same server the script exists on. Additionally, I'd like the webpage to allow an administrator to view the current status (running or stopped), and even link to logs created by the script.
I'm starting to go down the path of learning about PHP's exec, passthru, and related functions. Is this the correct path to take? Are there other ways to do this that would be more suitable? Is it possible to do this in a platform agnostic way? I'm developing on a LAMPP+CakePHP stack in Windows, and would like the functionality to exist on any webhost I choose.
I've done this in a recent job but it's probably overkill for you. I did a job processor and it basically sets 2 tables in the database, 2 objects at a minimum and 2 controllers at a minimum.
The first part is the job processing unit, it is composed of a job processor controller that manages the request to start or continue a job and it comes with two activerow models JobQueue and Job. You can remove the queue, but it's always practical to have queing in such systems so you can say that 2,3,4 jobs could execute at once.
The queue is only that, it's a slot that gets several jobs attached to it and it has a queue status to determine if it is running right now or not.
The job is a virtual object that maps to a job table describing what has to be done. In my implementation, i have created an interface that must be implemented into the called controller and a field + a type in the database. The Job instanciates the controller class to call (not the job processor controler, another controler that manages the operation to do) and calls a method in it to start the task processing.
Now, to get tricky, i forced my system to run on a dedicated server just for that portion because i didn't want the task to load the main server or jam the processing queue of Apache. So i had two servers and my Queue class was in charge of calling via an ip address a page on another server to run the job on that server specifically. When the job was done, it called itself back using a HTTP request to restart processing and do the next task. If no task was left, then it would simply die normally.
The advantage of doing it this way is that it doesn't require a cronjob (as long as your script is super stable and cannot crash) because it gets triggered by you when you want it and then you can let it go and it calls itself back with a fsockopen to trigger another page view that triggers the next job.
Work units
It is important to understand that if your jobs are very large, you should segment them. I used the principle of a "work unit" to describe 1 part the job has to do any number of times. Then the Queue Processor became a time manager too so that he could detect if a job took more than X seconds, it would simply defer the rest of the steps for later and call itself back and continue were he was at. That way, you don't need to "set time limit" and you don't jam your server while a 30s script gets executed.
I hope this helps!
To run a script which run continually, you need think to that:
Your php script should be launched as CLI (command line) by a job scheduler like cron or something else. Don't forget that your web server configuration defined a timeout on executed script.
To run 24h a day, maybe you imagine to implement an infinite loop. In that case, you can write a test like jobIsActive which read in a file or in the database every loop if the job should be executed or not. If you click on the button just change the job status (update file, db ...). Your both button can stop the treatment or activate it but doesn't stop the infinite loop.
An infinite loop isn't the most elegant solution, why don't you write an entry in the cron tab to execute the job each night and a click on a button can fired it manually.
Recently I've been researching the use of Beanstalkd with PHP. I've learned quite a bit but have a few questions about the setup on a server, etc.
Here is how I see it working:
I install Beanstalkd and any dependencies (such as libevent) on my Ubuntu server. I then start the Beanstalkd daemon (which should basically run at all times).
Somewhere in my website (such as when a user performs some actions, etc) tasks get added to various tubes within the Beanstalkd queue.
I have a bash script (such as the following one) that is run as a deamon that basically executes a PHP script.
#!/bin/sh
php worker.php
4) The worker script would have something like this to execute the queued up tasks:
while(1) {
$job = $this->pheanstalk->watch('test')->ignore('default')->reserve();
$job_encoded = json_decode($job->getData(), false);
$done_jobs[] = $job_encoded;
$this->log('job:'.print_r($job_encoded, 1));
$this->pheanstalk->delete($job);
}
Now here are my questions based on the above setup (which correct me if I'm wrong about that):
Say I have the task of importing an RSS feed into a database or something. If 10 users do this at once, they'll all be queued up in the "test" tube. However, they'd then only be executed one at a time. Would it be better to have 10 different tubes all executing at the same time?
If I do need more tubes, does that then also mean that i'd need 10 worker scripts? One for each tube all running concurrently with basically the same code except for the string literal in the watch() function.
If I run that script as a daemon, how does that work? Will it constantly be executing the worker.php script? That script loops until the queue is empty theoretically, so shouldn't it only be kicked off once? How does the daemon decide how often to execute worker.php? Is that just a setting?
Thanks!
If the worker isn't taking too long to fetch the feed, it will be fine. You can run multiple workers if required to process more than one at a time. I've got a system (currently using Amazon SQS, but I've done similar with BeanstalkD before), with up to 200 (or more) workers pulling from the queue.
A single worker script (the same script running multiple times) should be fine - the script can watch multiple tubes at the same time, and the first one available will be reserved. You can also use the job-stat command to see where a particular $job came from (which tube), or put some meta-information into the message if you need to tell each type from another.
A good example of running a worker is described here. I've also added supervisord (also, a useful post to get started) to easily start and keep running a number of workers per machine (I run shell scripts, as in the first link). I would limit the number of times it loops, and also put a number into the reserve() to have it wait for a few seconds, or more, for the next job the become available without spinning out of control in a tight loop that does not pause at all - even if there was nothing to do.
Addendum:
The shell script would be run as many times as you need. (the link show how to have it re-run as required with exec $#). Whenever the php script exits, it re-runs the PHP.
Apparently there's a Djanjo app to show some stats, but it's trivial enough to connect to the daemon, get a list of tubes, and then get the stats for each tube - or just counts.
I have a website written in PHP (CakePHP) where certain resource intensive tasks are handled by a background process. This is done through the Beanstalkd message queue. I need some way to retrieve the status of that background process so I can monitor it with Monit.
The background process is a CakePHP Shell (just a PHP CLI script) that communicates with Beanstalkd. It simply does a reserve() on Benastalkd and waits for a new message. When it gets a message, it processes it. I want some way of monitoring this process with Monit so that it can restart the background process if something has gone wrong.
What I have been thinking about so far is writing a PHP CLI script that drops a message in Beanstalkd. The background process picks up the message and somehow communicates it's internal status back to the CLI script. But how? Sockets? Shared memory? Some other IPC method?
Or am I perhaps being too complicated here and is there a much easier way to monitor such a process with Monit?
Thanks in advance!
Here's what I ended up doing in the end.
The CLI script connects to beanstalkd, creates a new queue (tube) and starts watching it. Then it drops a highest priority message in the queue that the background daemon is watching. That message contains the name of the new queue that the CLI script is monitoring.
The background process receives this message almost immediately (because it is highest priority), generates a status message and puts it in the queue that the CLI script is watching. The CLI script receives it and then closes the queue.
When the CLI script does not get a response in 30 seconds it will exit with an error indicating the background daemon is (most likely) hung.
I tied all this into Monit. Monit can now check that the background daemon is running (via the pidfile and process list) and verify that it is actually still processing messages (by using the CLI tool to test that it responds to status requests)
There probably is a plugin to Monit or Nagios to connect, run the stats and return if there are 'too many'. There isn't a 'protocol' written already for that, but t doesn't appear to be exceeding difficult to modify an existing text-based one (like nntp, or smtp) to do what you want. It does mean writing it in C though, by the looks of it.
From a CLI-PHP script, I would go about it through one (or both) of two different methods.
1/ drop a (low-ish) priority message into the queue, and make sure it comes back within a few seconds. Putting it into a dedicated queue and making sure there's nothing there before you put it in there would be a good addition as well.
2/ perform a 'stats' and see how many are waiting: 'current-jobs-ready'.
To get the information back to a website (either way), you can write to a file, or into something like Memcached which gts read and acted upon.