I have implemented a command in my Symfony setup which grabs a job from the DB and then processes it.
How can I run multiple instances of command at once, to get through jobs quicker. I know that multithreading is not supported in PHP but seeing as the command is called from the shell, I was wondering if there was a workaround.
Call command using:
app/console job:process
The way I would solve this is to use a work queue with multiple workers. It's easier to manage and scale than manually running multiple processes and worrying about concurrency.
The simplest general-purpose queue I've found for working with php/symfony is beanstalkd which you can integrate into symfony2 with the LeezyPheanstalkBundle
In general, I'd suggest using enqueue library. You can choose from a variety of transports available, from the simplest like filesystem and Doctrine DBAL to real once like RabbitMQ and Amazon SQS.
Regarding the consumers, you need sort of process manager. There several options:
http://supervisord.org/ - You need extra service. It has to be configured properly.
A pure PHP process manager like this. Based on Symfony process component and pure PHP code. It can handle process reboot, correct exit on sigterm signal and a lot more.
A php\swoole process manager like this. It requires a swoole PHP extension but it is performance is amazing.
I have written a blog post on how to solve this exact problem. https://plume.baucum.me/~/Absolutely/running-multiple-processes-simultaneously-in-a-symfony-command
It is much too long to rehash everything here, but the basic concept is that your command optionally takes in the job's ID. The command will check if the ID was given. If not then it will grab all the jobs from the DB, loop over them, and recall itself with the job ID parameter. As each command is kicked off you store it in an array, and if the array is too big you sleep, for rate throttling. As commands finish you remove them from the array.
When the command is ran with the job ID it will create a lock using Symfony's lock component so that a job cannot accidentally be processed two times at once. It is important that you unlock the job when it either finishes or errors out. Once it has the ID and the lock it will then call whatever code you have written to actually process the job.
Using this technique I have taken commands that took hours to run, as it synchronously went through each task, into taking only minutes. Make sure to try different throttles to balance resource utilization and time it takes to execute your task.
Related
Would appreciate some help understanding typical best practices in carrying out a series of tasks using Gearman in conjunction with PHP (among other things).
Here is the basic scenario:
A user uploads a set of image files through a web-based interface. The php code responding to the POST request generates an entry in a database for each file, mostly with null entries in the columns, queues a job for each to do analysis using Gearman, generates a status page and exits.
The Gearman worker gets a job for a file and starts a relatively long-running analysis. The result of that analysis is a set of parameters that need to be inserted back into the database record for that file.
My question is, what is the generally accepted method of doing this? Should I use a callback that will ultimately kick off a different php script that is going to do the modification, or should the worker function itself do the database modification?
Everything is currently running on the same machine; I'm planning on using Gearman for background scheduling, rather than for scaling by farming out to different machines, but in any case any of the functions could connect to the database wherever it is.
Any thoughts appreciated; just looking for some insights on how this typically gets structured and what might be considered best practice.
Are you sure you want to use Gearman? I only ask because it was the defacto PHP job server about 15 years ago but hasn't been a reliable solution for quite some time. I am not sure if things have drastically improved in the last 12 months, but last time I evaluated Gearman, it wasn't production capable.
Now, on to the questions.
what is the generally accepted method of doing this? Should I use a callback that will ultimately kick off a different php script that is going to do the modification, or should the worker function itself do the database modification?
You are going to follow this general pattern with any job queue:
Collect a unit of work. In your case, it will be 1 of the images and any information about who that image belongs to, user id, etc.
Submit the work to the job queue with this information.
Job Queue's worker process picks up the work, and starts processing it. This is where I would create records in the database as you can opt to not create them on job failure.
The job queue is going to track which jobs have completed and usually the status of completion. If you are using gearman, this is the gearmand process. You also need something pickup work and process that work, I will refer to this as the job worker. The job worker is where the concurrency happens which is what i think you were referring to when you said "kick off a different php script." You can just kick off a PHP script at an interval (with supervisord or a cronjob) for a kind of poll & fork approach. It's not the most efficient approach, but it doesn't sound like it will really matter for your applications use case. You could also use pcntl_fork or pthreads in PHP to get more control over your concurrent processes and implement a worker pool pattern, but it is much more complicated than just firing off a script. If you are interested in trying to implement some concurrency in PHP, I have a proof-of-concept job worker for beanstalkd available on GitHub that implements a worker pool with both fork and pthreads. I have also include a couple of other resources on the subject of concurrency.
Job Worker (pthreads)
Job Worker (fork)
PHP Daemon Example
PHP IPC Example
this side of PHP is rather new to me.
I am interested in firing off a large number (25-50) separate processes from a parent script. I would like for the parent script to not wait for these other scripts to complete AND I would like for these other scripts to run in parallel.
Each script would run for a specified amount of time calling a webservice.
Can anyone give me some direction with this? I'm not asking for a coded answer specifically, but I just need some guidance.
Much thanks.
It really depends on what you want to achieve. #Julien's forking method could work, but this is not preferable if your web service calls are data intensive. I am not saying that forking is bad on the contrary it works, but with the ammount of different wev services you want to call you should have a way manage things better.
Another thing that you can do is base this on cronjobs. For example if you're calling these webservices for some users in your app create a queue - a DB table that you add records that need to be processed. If you are using Cake use the Cake Shells. Then set up cronjobs that call a the shells that processes these records every now and then. Divide all services is separate queues - at least for those that are very different in logic. This way you will also divide your risk because if there is a failure in one of the web service calls you would not jeopardise all in some way. Have separate logging abilities for each queue which will enable you to quickly track down problems. With consuming web services very often problems are external to your application.
I have a web-app, written in PHP, and I am using Symfony2 as the core framework.
I need to regularly run thousands of small network requests every 10 minutes or so and I am trying to find the best solution for asynchronously running these jobs, without conflicting or doubling up.
Currently I have a very basic and inelegant solution where a cron job executes a PHP command script. The command synchronously works through each entry in the database and sends a network request. When that request completes (or fails), it moves on to the next one. When it has iterated over all entries, it exists, to be executed again from the cron job.
For the rewrite, I have looked at php-resque and pcntl_fork as solutions for running jobs in parallel, thereby speeding up the execution significantly. I have also looked at running multiple non-blocking socket requests from PHP but, so far, have preferred the simplicity of isolated jobs.
PHP can't do threading in the traditional sense, so what you're trying to do isn't really possible in the way that you're thinking. You've looked at the best solution (probably) in pcntl_fork, but that still won't be truly asynchronous. Have you considered using cron to accomplish this instead?
http://php.net/pthreads
http://github.com/krakjoe/pthreads
Examples on github... and included in distribution, use code in github, it contains fixes not yet released.
I have a simple messaging queue setup and running using the Zend_Queue object heirarchy. I'm using a Zend_Queue_Adapter_Db back-end. I'm interested in using this as a job queue, to schedule things for processing at a later time. They're jobs that don't need to happen immediately, but should happen sooner rather than later.
Is there a best-practices/standard way to setup your infrastructure to run jobs? I understand the code for receiving a message from the queue, but what's not so clear to me is how run the program that does that receiving. A cron that receives n messages on the command-line, run once a minute? A cron that fires off multiple web requests, each web request running the receiver script? Something else?
Tangential bonus question. If I'm running other queries with Zend_Db, will the message queue queries be considered part of that transaction?
You can do it like a thread pool. Create a command line php script to handle the receiving. It should be started by a shell script that automatically restarts the process if it dies. The shell script should not start the process if it is already running (use a $pid.running file or similar). Have cron run several of these every 1-10 minutes. That should handle the receiving nicely.
I wouldn't have the cron fire a web request unless your cron is on another server for some strange reason.
Another way to use this would be to have some backround process creating data, and a web user(s) consume it as they naturally browse the site. A report generator might work this way. Company-wide reports are available to all users but you don't want them all generating this db/time intensive report. So you create a queue and process one at a time possible removing duplicates. All users can view the report(s) when ready.
According to the docs it doens't look like the zend db is even using the same connection as your other zend_db queries. But of course the best way to find out is to make a simple test.
EDIT
The multiple lines in the cron are for concurrency. each line represents a worker for the pool. I was not clear, you don't want the pid as the identifier, you want to pass that as a parameter.
/home/byron/run_queue.sh Process1
/home/byron/run_queue.sh Process2
/home/byron/run_queue.sh Process3
The bash script would check for the $process.running file if it finds it exit.
otherwise:
Create the $process.running file.
start the php process. Block/wait until finished.
Delete the $process.running file.
This allows for the php script to die but not cause the pool to loose a worker.
If the queue is empty the php script exits immediately and is started again by the nex invocation of cron.
Greetings All!
I am having some troubles on how to execute thousands upon thousands of requests to a web service (eBay), I have a limit of 5 million calls per day, so there are no problems on that end.
However, I'm trying to figure out how to process 1,000 - 10,000 requests every minute to every 5 minutes.
Basically the flow is:
1) Get list of items from database (1,000 to 10,000 items)
2) Make a API POST request for each item
3) Accept return data, process data, update database
Obviously a single PHP instance running this in a loop would be impossible.
I am aware that PHP is not a multithreaded language.
I tried the CURL solution, basically:
1) Get list of items from database
2) Initialize multi curl session
3) For each item add a curl session for the request
4) execute the multi curl session
So you can imagine 1,000-10,000 GET requests occurring...
This was ok, around 100-200 requests where occurring in about a minute or two, however, only 100-200 of the 1,000 items actually processed, I am thinking that i'm hitting some sort of Apache or MySQL limit?
But this does add latency, its almost like performing a DoS attack on myself.
I'm wondering how you would handle this problem? What if you had to make 10,000 web service requests and 10,000 MySQL updates from the return data from the web service... And this needs to be done in at least 5 minutes.
I am using PHP and MySQL with the Zend Framework.
Thanks!
I've had to do something similar, but with Facebook, updating 300,000+ profiles every hour. As suggested by grossvogel, you need to use many processes to speed things up because the script is spending most of it's time waiting for a response.
You can do this with forking, if your PHP install has support for forking, or you can just execute another PHP script via the command line.
exec('nohup /path/to/script.php >> /tmp/logfile 2>&1 & echo $!'), $processId);
You can pass parameters (getopt) to the php script on the command line to tell it which "batch" to process. You can have the master script do a sleep/check cycle to see if the scripts are still running by checking for the process id's. I've tested up to 100 scripts running at once in this manner, at which point the CPU load can get quite high.
Combine multiple processes with multi-curl, and you should easily be able to do what you need.
My two suggestions are (a) do some benchmarking to find out where your real bottlenecks are and (b) use batching and cacheing wherever possible.
Mysqli allows multiple-statement queries, so you could definitely batch those database updates.
The http requests to the web service are more likely the culprit, though. Check the API you're using to see if you can get more info from a single call, maybe? To break up the work, maybe you want a single master script to shell out to a bunch of individual processes, each of which makes an api call and stores the results in a file or memcached. The master can periodically read the results and update the db. (Careful to rotate the data store for safe reading and writing by multiple processes.)
To understand your requirements better, you must implement your solution only in PHP? Or you can interface a PHP part with another part written in another language?
If you could not go for another language, try to perform this update maybe as php script that runs in the background and not through the apache.
You can follow Brent Baisley advice for a simple use case.
If you want to build a robuts solution, then you need to :
set up a representation of the actions in a table in database that will be your process queue;
set up a script that pop this queue and process your action;
set up a cron daemon that run this script every x.
This way you can have 1000 PHP scripts running, using your OS parallelism capabilities and not hanging when ebay is taking to to respond.
The real advantage of this system is that you can fully control the firepower you throw at your task by adjusting :
the number of request one PHP script does;
the order / number / type / priority of the action in the queue;
the number or scripts the cron daemon runs.
Thanks everyone for the awesome and quick answers!
The advice from Brent Baisley and e-satis works nicely, rather than executing the sub-processes using CURL like i did before, the forking takes a massive load off, it also nicely gets around the issues with max out my apache connection limit.
Thanks again!
It is true that PHP is not multithreaded, but it can certainly be setup with multiple processes.
I have created a system that resemebles the one that you are describing. It's running in a loop and is basically a background process. It uses up to 8 processes for batch processing and a single control process.
It is somewhat simplified because i do not have to have any communication between the processes. Everything resides in a database so each process is spawned with the full context taken from the database.
Here is a basic description of the system.
1. Start control process
2. Check database for new jobs
3. Spawn child process with the job data as a parameter
4. Keep a table of the child processes to be able to control the number of simultaneous processes.
Unfortunately it does not appear to be a widespread idea to use PHP for this type of application, and i really had to write wrappers for the low level functions.
The manual has a whole section on these functions, and it appears that there are methods for allowing IPC as well.
PCNTL has the functions to control forking/child processes, and Semaphore covers IPC.
The interesting part of this is that i'm able to fork off actual PHP code, not execute other programs.