beanstalkd - what happens to reserved, but not completed jobs?

beanstalkd - what happens to reserved, but not completed jobs? - php

I've created a PHP script that reads from beanstalkd and processes the jobs. No problems there.
The last thing I've got to do is just write an init script for it, so it can run as a service.
However, this has now raised another question for me. When trying to stop the service, the one obvious way of doing it would be to try and kill the process. However, if I do that, what will happen to the job, if the PHP script was halfway through processing it? So the job was reserved, but the script never succeeded or failed (to delete or bury respectively), what happens?
My guess is that the TTR will expire, and then it gets put back to the ready queue?
And bonus 2nd question, any hints on how to better manage stopping the PHP service?

When a worker process (beanstalk client) opens up a connection with beanstalkd and reserves a job, the job will be in "reserved" state until the client issues delete/release command (or) job times out.
In case, if the worker process terminates abruptly, its connection with beanstalkd will get closed and the server will immediately release all the jobs that has been reserved using that particular connection.
Ref: http://groups.google.com/group/beanstalk-talk/browse_thread/thread/232d0cac5bebe30f?hide_quotes=no#msg_efa0109e7af4672e

Any job that runs out of time, and is not buried or touched goes back into the ready queue to be reserved.
I've posted elsewhere about using Supervisord and shell scripts to run workers. It has the advantage that most of the time, you probably don't mind waiting for a little while as jobs finish cleanly. You can have supervisord kill the bash scripts that run a worker script, and when the script itself has finished, simply exits, as it can't be restarted.
Another way is to put a highest-priority (0) message into a tube that the workers listen of, that will have the workers first delete the message, and then exit. I setup the shell scripts to check for a specific return value (from exit($val);) and then they too would exit any loop in the shell scripts.
I've used these techniques for Beanstalkd and also AWS:SQS queue runners for some time, dealing with millions of jobs per day running through the system.

If you job is too valuable to lose, you can also use pcntl to wait until the job finishes and then restart/shutdown your worker. I've managed to handle all suitable pcntl signals to release the job back to tube.

Related

RabbitMQ basic_get with multiple consumers

I'm moving some resource intensive functionality currently running on a cron to a RabbitMQ queue. I'm weary of having long running PHP consumer scripts so I'm thinking of doing the following:
Jobs are added to the queue at the start of the day.
A cron runs a command which starts a consumer.
The consumer uses basic_get to get a job, processes the job, acknowledges the job and then exits.
The cron runs again and the next job is processed.
I have a couple of questions around how well this will work.
If I decide to fire up 2 workers via the cron (running the command twice) and the first job is still being processed, and hasn't been acknowledged, would RabbitMQ ever send the same job to the second worker?
I've noticed that basic_consume will be more performant since there's no round trip when receiving each job. Is it possible to use basic_consume rather than basic_get without having to worry about the workers being left to run for too long?

The first part:
No it would not. This would happen only in the case when the first consumer dies without ACKing the message- then that message gets requeued and the next consumer gets it.
The second part:
You should use basic_consume because it's faster, asynchronous and generally better. Using any message retrieval methods has nothing to do with how long will the consumers run.

How can I create, monitor and run/stop a process in linux

I have a cron which runs every minute but the thing is it queues every request from the past minute and executes some tasks one after another. I want to run a background process which will run infinite time. I'll check if there is any new request came in & will process that immediately.
do {
//do my stuff
} while(true)
I need to know the command to check if the process is running or not, if not then start this, else do nothing
FYI - I'm not a linux guy and dont know anything about bash or shell. I need PHP code which I can add in the every minute cron which will just monitor this process is running or not.

What you are looking for is a service control and/or a watchdog. You can use D. J. Bernstein Daemontools or similar software.
Also, if you want to do it in PHP you could, inside the start part of your daemon (that is what you are building) raise a flag (a file), then withing a cron job, run another program to check if the flag is raised (the file exist) every N minutes.

Running Cron jobs in parallel (PHP)

In the past, I ran a bunch of scripts each as a separate cron job. Now I'd like to run a controller script with one cron job, then have that call the scripts separately (and in parallel, all at the same time), so I don't have to create a new cron job every time I add another script.
I looked up pcntl_fork() but we don't have that installed. Can fsockopen() do this as well?
A few questions:
I saw this example, http://phplens.com/phpeverywhere/?q=node/view/254, that uses fsockopen(). Will this allow me to run PHP scripts in parallel? Note, the scripts don't interact, but I would still like to know if any of them exited prematurely with an error.
Secondly the scripts I'm running aren't externally accessible, they are internal only. The script was previously run like so: php -f /path/to/my/script1.php. It's not a web-accessible path. Would the example in #1 work with this, or only web-accessible paths?.
Thanks for any advice you can offer.

You can use proc_open to run multiple processes without waiting for each process to finish.
You will have a process handle, you can terminate each process at any time and you can read the standard output of each process.
You can also communicate via pipes, which is optional.
Passing 1st param php /your/path/to/script.php param1 "param2 x" means starting a separate PHP process.
proc_open (see Example #1)
Ultimately you will want to use an infinite while loop + usleep (or sleep) to avoid maxing out on the CPU. Break when all processes finish, or after you killed them.
Edit: you can know if a process has exited prematurely.
Edit2: a simpler way of doing the above is popen

Please correct me if I'm wrong, but if I understand things correctly, the solution Tiberiu-Ionut Stan proposed implies that starting the processes with proc_open and waiting for them to finish will not be run as a cron script, but is part of a running program/service, right?
As far as I understand the cron jobs, the controller script user920050 was thinking of using would be started by cron on a schedule and each new instance would launch the processes all over again, do the waiting for them to finish and probably run in parallel with other cron-launched instances of the controller script.

How to set up Beanstalkd with PHP

Recently I've been researching the use of Beanstalkd with PHP. I've learned quite a bit but have a few questions about the setup on a server, etc.
Here is how I see it working:
I install Beanstalkd and any dependencies (such as libevent) on my Ubuntu server. I then start the Beanstalkd daemon (which should basically run at all times).
Somewhere in my website (such as when a user performs some actions, etc) tasks get added to various tubes within the Beanstalkd queue.
I have a bash script (such as the following one) that is run as a deamon that basically executes a PHP script.
#!/bin/sh
php worker.php
4) The worker script would have something like this to execute the queued up tasks:
while(1) {
$job = $this->pheanstalk->watch('test')->ignore('default')->reserve();
$job_encoded = json_decode($job->getData(), false);
$done_jobs[] = $job_encoded;
$this->log('job:'.print_r($job_encoded, 1));
$this->pheanstalk->delete($job);
}
Now here are my questions based on the above setup (which correct me if I'm wrong about that):
Say I have the task of importing an RSS feed into a database or something. If 10 users do this at once, they'll all be queued up in the "test" tube. However, they'd then only be executed one at a time. Would it be better to have 10 different tubes all executing at the same time?
If I do need more tubes, does that then also mean that i'd need 10 worker scripts? One for each tube all running concurrently with basically the same code except for the string literal in the watch() function.
If I run that script as a daemon, how does that work? Will it constantly be executing the worker.php script? That script loops until the queue is empty theoretically, so shouldn't it only be kicked off once? How does the daemon decide how often to execute worker.php? Is that just a setting?
Thanks!

If the worker isn't taking too long to fetch the feed, it will be fine. You can run multiple workers if required to process more than one at a time. I've got a system (currently using Amazon SQS, but I've done similar with BeanstalkD before), with up to 200 (or more) workers pulling from the queue.
A single worker script (the same script running multiple times) should be fine - the script can watch multiple tubes at the same time, and the first one available will be reserved. You can also use the job-stat command to see where a particular $job came from (which tube), or put some meta-information into the message if you need to tell each type from another.
A good example of running a worker is described here. I've also added supervisord (also, a useful post to get started) to easily start and keep running a number of workers per machine (I run shell scripts, as in the first link). I would limit the number of times it loops, and also put a number into the reserve() to have it wait for a few seconds, or more, for the next job the become available without spinning out of control in a tight loop that does not pause at all - even if there was nothing to do.
Addendum:
The shell script would be run as many times as you need. (the link show how to have it re-run as required with exec $#). Whenever the php script exits, it re-runs the PHP.
Apparently there's a Djanjo app to show some stats, but it's trivial enough to connect to the daemon, get a list of tubes, and then get the stats for each tube - or just counts.

Infrastructure for Running your Zend Queue Receiver

I have a simple messaging queue setup and running using the Zend_Queue object heirarchy. I'm using a Zend_Queue_Adapter_Db back-end. I'm interested in using this as a job queue, to schedule things for processing at a later time. They're jobs that don't need to happen immediately, but should happen sooner rather than later.
Is there a best-practices/standard way to setup your infrastructure to run jobs? I understand the code for receiving a message from the queue, but what's not so clear to me is how run the program that does that receiving. A cron that receives n messages on the command-line, run once a minute? A cron that fires off multiple web requests, each web request running the receiver script? Something else?
Tangential bonus question. If I'm running other queries with Zend_Db, will the message queue queries be considered part of that transaction?

You can do it like a thread pool. Create a command line php script to handle the receiving. It should be started by a shell script that automatically restarts the process if it dies. The shell script should not start the process if it is already running (use a $pid.running file or similar). Have cron run several of these every 1-10 minutes. That should handle the receiving nicely.
I wouldn't have the cron fire a web request unless your cron is on another server for some strange reason.
Another way to use this would be to have some backround process creating data, and a web user(s) consume it as they naturally browse the site. A report generator might work this way. Company-wide reports are available to all users but you don't want them all generating this db/time intensive report. So you create a queue and process one at a time possible removing duplicates. All users can view the report(s) when ready.
According to the docs it doens't look like the zend db is even using the same connection as your other zend_db queries. But of course the best way to find out is to make a simple test.
EDIT
The multiple lines in the cron are for concurrency. each line represents a worker for the pool. I was not clear, you don't want the pid as the identifier, you want to pass that as a parameter.
/home/byron/run_queue.sh Process1
/home/byron/run_queue.sh Process2
/home/byron/run_queue.sh Process3
The bash script would check for the $process.running file if it finds it exit.
otherwise:
Create the $process.running file.
start the php process. Block/wait until finished.
Delete the $process.running file.
This allows for the php script to die but not cause the pool to loose a worker.
If the queue is empty the php script exits immediately and is started again by the nex invocation of cron.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.