I have a php script which I use to make about 1 mil. requests every day to a specific web service.
The problem is that in a "normal" workflow the script is working almost the whole day to complete the job .
Therefore I've worked on an additional component. Basically I developed a script which access the main script using multi-curl GET request to generates some random tempid for each 500 records and finally makes another multi-curl request using POST with all the generated tempids.
However I don't feel this is the right way so I would like some advice/solutions to add multithreading capabilities to the main script without to use additional /external applications (e.g the curl script that I'm currently using).
Here is the main script : http://pastebin.com/rUQ6pwGS
If you want to do it right you should install a message queue. My preference goes out to redis because it is a "data structure server since keys can contain strings, hashes, lists, sets and sorted sets". Also redis is extremely fast.
Using the blpop(spawning a couple of worker threads using php <yourscript> to process work concurrently) to listen for new messages(work) and rpush to push new messages onto the queue. Spawning processes is expensive(relative) and when using a message queue this has to be done only once when the process is created.
I would go for phpredis if you could(need to be to recompile PHP) because it is an extension written in C and therefor going to be a lot faster than the pure PHP clients. Else PRedis is also pretty mature library you could use.
You could also use this brpop/rpush as some sort of lock(if you need to). This is because:
Multiple clients can block for the
same key. They are put into a queue,
so the first to be served will be the
one that started to wait earlier, in a
first-BLPOP first-served fashion.
I would advise you to have a look at Simon's redis tutorial to get an impression of the sheer power that redis has to offer.
This is background process, correct? In which case, you should not run it via a web server. Run it from the command-line, either as a daemon or as a cron job.
My preference is a "cron" job because you get automatic restart for free. Be sure that you don't have more instances of the program running than desired (You can achieve this by locking a file in the filesystem, doing something atomic in a database etc).
Then you just need to start the number of processes you want, and have them read work from a queue.
Normally the pattern for doing this is having a table containing columns to store who is currently excuting a given task:
CREATE TABLE sometasks (
ID of some kind,
Other info required to do task,
some data we need to know if the task is due yet or complete,
locked_by_host VARCHAR(64) NULL,
locked_by_pid INT NULL
)
Then the process will do the following pseduo-query to lock a set of tasks (batch_size is how many per batch, can be 1)
UPDATE sometasks SET locked_by_host=my_hostname, locked_by_pid=my_pid
WHERE not_done_already AND locked_by_host IS NULL ORDER BY ID LIMIT batch_size
Then select the rows back out using a select to find the current process's tasks. Then process the tasks, and update them as being "done" and clear out the lock.
I'd opt for a cron job with a controller process which starts up N child processes and monitors them. The child processes could periodically die (remember PHP does not have good GC, so it can easily leak memory) and be respawned to prevent resource leaks.
If the work is all done, the parent could quit, and wait to be respawned by cron (the next hour or something).
NB: locked_by_host can store the host name (pids aren't unique in different hosts) to allow for distributed processing, but maybe you don't need that, so you can omit it.
You can make this design more robust by putting a locked_time column and detecting when a task has been taking too long - you can alert, kill the process, and try again or something.
Related
I am currently evalutuating Gearman to farm out some expensive data import jobs in our backend. So far this looks very promising. However there is one piece missing that I just can't seem to find any info about. How can I get a list of schedules jobs from Gearman?
I realize I can use the admin protocol to get the number of currently queued jobs for each function, but I need info about the actual jobs. There is also the option of using a persistent queue (eg. MySQL) and query the database for the jobs, but it feels pretty wrong to me to circumvent Gearman for this kind of information. Other than that, I'm out of ideas.
Probably I don't need this at all :) So here's some more background on what I want to do, I'm all open for better suggestions. Both the client and the worker run in PHP. In our admin interface the admins can trigger a new import for a client; as the import takes a while it is started as a background task. Now the simple questions I want to be able to answer: When was the last import run for this client? Is an import already queued for this client (in that case triggering a new import should have no effect)? Nice to have: At which position in the queue is this job (so I can make an estimate on when it will run)?
Thanks!
The Admin protocol is what you'd usually use, but as you've discovered, it won't list the actual tasks in the queue. We've solved this by keeping track of the current tasks we've started in our application layer, and having a callback in our worker telling the application when the task has finished. This allows us to perform cleanup, notification etc. when the task has finished, and allows us to keep this logic in the application and not the worker itself.
Relating to progress the best way is to just use the built-in progress mechanics in Gearman itself, in the PHP module you can call this by using $job->sendStatus(percentDone, 100). A client can then retrieve this value from the server using the task handle (which will be returned when you start the job). That'll allow you to show the current progress to users in your interface.
As long as you have the current running tasks in your application, you can use that to answer wether there are similar tasks already running, but you can also use gearman's built-in job coalescing / de-duplication; see the $unique parameter when adding the task.
The position in the current queue will not be available through Gearman, so you'll have to do this in your application as well. I'd stay away from asking the Gearman persistence layer for this information.
You have pretty much given yourself the answer: use a DBRMS (MySQL or Postgres) as persistance backend and query the gearman_queue table.
For instance, we developed a hybrid solution: we generate and pass an unique id for the job which we pass as third parameter to doBackground() (http://php.net/manual/en/gearmanclient.dobackground.php) when queuing the job.
Then we use this id to query the gearman table to verify the job status looking at the 'unique_key' table field. You can also get the queue position as the record are already ordered.
Pro Bonus: we also catch exceptions inside the worker. If a job fails we write the job payload (which is a JSON serialized object) on a file, and then pick up the file and requeue the job via cronjob incrementing the 'retry' internal counter so we retry a single job 3 times max, and get to inspect the job later if it still fails.
I have an application where I intend users to be able to add events at any time, that is, chunks of code that should only run at a specific time in the future determined by user input. Similar to cronjobs, except at any point there may be thousands of these events that need to be processed, each at its own specific due time. As far as I understand, crontab would not be able to handle them since it is not meant to have massive number of cronjobs, and additionally, I need precision to the second, and not the minute. I am aware it is possible to programmatically add cronjobs to crontab, but again, it would not be enough for what I'm trying to accomplish.
Also, I need these to be real time, faking them by simply checking if there are due items whenever the pages are visited is not a solution; they should also fire even if no pages are visited by their due time. I've been doing some research looking for a sane solution, I read a bit about queue systems such as gearman and rabbitmq but a FIFO system would not work for me either (the order in which the events are added is irrelevant, since it's perfectly possible one adds an event to fire in 1 hour, and right after another that is supposed to trigger in 10 seconds)
So far the best solution that I found is to build a daemon, that is, a script that will run continuously checking for new events to fire. I'm aware PHP is the devil, leaks memory and whatnot, but I'm still hoping nonetheless that it is possible to have a php daemon running stably for weeks with occasional restarts, so as long as I spawn new independent processes to do the "heavy lifting", the actual processing of the events when they fire.
So anyway, the obvious questions:
1) Does this sound sane? Is there a better way that I may be missing?
2) Assuming I do implement the daemon idea, the code naturally needs to retrieve which events are due, here's the pseudocode of how it could look like:
while 1 {
read event list and get only events that are due
if there are due events
for each event that is due
spawn a new php process and run it
delete the event entry so that it is not run twice
sleep(50ms)
}
If I were to store this list on a MySQL DB, and it certainly seems the best way, since I need to be able to query the list using something on the lines of "SELECT * FROM eventlist where duetime >= time();", is it crazy to have the daemon doing a SELECT every 50 or 100 milliseconds? Or I'm just being over paranoid, and the server should be able to handle it just fine? The amount of data retrieved in each iteration should be relatively small, perhaps a few hundred rows, I don't think it will amount for more than a few KBs of memory. Also the daemon and the MySQL server would run on the same machine.
3) If I do use everything described above, including the table on a MySQL DB, what are some things I could do to optimize it? I thought about storing the table in memory, but I don't like the idea of losing its contents whenever the server crashes or is restarted. The closest thing I can think of would be to have a standard InnoDB table where writes and updates are done, and another, 1:1 mirror memory table where reads are performed. Using triggers it should be doable to have the memory table mirror everything, but on the other hand it does sound like a pain in the ass to maintain (fubar situations can easily happen if some reason the tables get desynchronized).
We have a web app that uses IMAP to conditionally insert messages into users' mailboxes at user-defined times.
Each of these 'jobs' are stored in a MySQL DB with a timestamp for when the job should be run (may be months into the future). Jobs can be cancelled at anytime by the user.
The problem is that making IMAP connections is a slow process, and before we insert the message we often have to conditionally check whether there is a reply from someone in the inbox (or similar), which adds considerable processing overhead to each job.
We currently have a system where we have cron script running every minute or so that gets all the jobs from the DB that need delivering in the next X minutes. It then splits them up into batches of Z jobs, and for each batch performs an asynchronous POST request back to the same server with all the data for those Z jobs (in order to achieve 'fake' multithreading). The server then processes each batch of Z jobs that come in via HTTP.
The reason we use an async HTTP POST for multithreading and not something like pnctl_fork is so that we can add other servers and have them POST the data to those instead, and have them run the jobs rather than the current server.
So my question is - is there a better way to do this?
I appreciate work queues like beanstalkd are available to use, but do they fit with the model of having to run jobs at specific times?
Also, because we need to keep the jobs in the DB anyway (because we need to provide the users with a UI for managing the jobs), would adding a work queue in there somewhere actually be adding more overhead rather than reducing it?
I'm sure there are better ways to achieve what we need - any suggestions would be much appreciated!
We're using PHP for all this so a PHP-based/compatible solution is really what we are looking for.
Beanstalkd would be a reasonable way to do this. It has the concept of put-with-delay, so you can regularly fill the queue from your primary store with a message that will be able to be reserved, and run, in X seconds (time you want it to run - the time now).
The workers would then run as normal, connecting to the beanstalkd daemon and waiting for a new job to be reserved. It would also be a lot more efficient without the overhead of a HTTP connection. As an example, I used to post messages to Amazon SQS (by http). This could barely do 20 QPS at very most, but Beanstalkd accepted over a thousand per second with barely any effort.
Edited to add: You can't delete a job without knowing it's ID, though you could store that outside. OTOH, do users have to be able to delete jobs at any time up to the last minute? You don't have to put a job into the queue weeks or months in advance, and so you would still only have one DB-reader that ran every, say, 1 to 5 mins to put the next few jobs into the queue, and still have as many workers as you would need, with the efficiencies they can bring.
Ultimately, it depends on the number of DB read/writes that you are doing, and how the database server is able to handle them.
If what you are doing is not a problem now, and won't become so with additional load, then carry on.
I have a challenge I don't seem to get a good grip on.
I am working on an application that generates reports (big analysis from database but that's not relevant here). I have 3 identical scripts that I call "process scripts".
A user can select multiple variables to generate a report. If done, I need one of the three scripts to pick up the task and start generating the report. I use multiple servers so all three of them can work simultaneously. When there is too much work, a queue will start so the first "process script" to be ready can pick up the next and so on.
I don't want to have these scripts go to the database all the time, so I have a small file "thereiswork.txt". I want the three scripts to read the file and if there is something to do go do it. If not, do nothing.
At first, I just randomly let a "process script" to be chosen & they all have their own queue. However, I now see that in some cases 1 process script has a queue of hours while the other 2 are doing nothing. Just because they had the "luck" of not getting very big reports to generate so I need a more fair solutions to equally balance the work.
How can I do this? Have a queue multiple scripts can work on?
PS
I use set_time_limit(0); for these scripts and they all currently are in a while() loop, and sleep(5) all the time...
No, no, no.
PHP does not have the kind of sophisticated lock management facilities to support concurrent raw file access. Few languages do. That's not to say it's impossible to implement them (most easily with mutexes).
I don't want to have these scripts go to the database all the time
DBMS provide great support for concurrent access. And while there is an overhead in perfroming an operation on the DB, it's very small in comparison to the amount of work which each request will generate. It's also a very convenient substrate for managing the queue of jobs.
they all have their own queue
Why? using a shared queue on a first-come, first-served basis will ensure the best use of resources.
At first, I just randomly let a "process script" to be chosen
This is only going to distribute work evenly with a very large number of jobs and a good random number generator. One approach is to shard data (e.g. instance 1 picks up jobs where mod(job_number, number_of_instances)=0, instance picks up jobs where mod(job_number, number_of_instances)=1....) - but even then it doesn't make best use of available resources.
they all currently are in a while() loop, and sleep(5) all the time
No - this is wrong too.
It's inefficient to have the instances constantly polling an empty queue - so you implement a back-ofr plan, e.g.
$maxsleeptime=100;
$sleeptime=0;
while (true) {
$next_job=get_available_job_from_db_queue();
if (!$next_job) {
$sleeptime=min($sleeptime*2, $maxsleeptime);
sleep($sleeptime);
} else {
$sleeptime=0;
process_job($next_job);
mark_job_finished($next_job);
}
}
No job is destined for a particular processor until that processor picks it up from the queue. By logging sleeptime (or start and end of processing) it's also a lot easier to see when you need to add more processor scripts - and if you handle the concurrency on the database, then you don't need to worry about configuring each script to know about the number of other scripts running - you can add and retired instances as required.
For this task, I use the Gearman job server. Your PHP code sends out jobs and you have a background script running to pick them up. It comes down to a solution similar to symcbean's, but the dispatching does not require arbitrary sleeps. It waits for events instead and essentially wakes up exactly when needed.
It comes with an excellent PHP extension and is very well documented. Most examples are in PHP too, although it works transparently with other languages too.
http://gearman.org/
I have a PHP script that grabs data from an external service and saves data to my database. I need this script to run once every minute for every user in the system (of which I expect to be thousands). My question is, what's the most efficient way to run this per user, per minute? At first I thought I would have a function that grabs all the user Ids from my database, iterate over the ids and perform the task for each one, but I think that as the number of users grow, this will take longer, and no longer fall within 1 minute intervals. Perhaps I should queue the user Ids, and perform the task individually for each one? In which case, I'm actually unsure of how to proceed.
Thanks in advance for any advice.
Edit
To answer Oddthinking's question:
I would like to start the processes for each user at the same time. When the process for each user completes, I want to wait 1 minute, then begin the process again. So I suppose each process for each user should be asynchronous - the process for user 1 shouldn't care about the process for user 2.
To answer sims' question:
I have no control over the external service, and the users of the external service are not the same as the users in my database. I'm afraid I don't know any other scripting languages, so I need to use PHP to do this.
Am I summarising correctly?
You want to do thousands of tasks per minute, but you are not sure if you can finish them all in time?
You need to decide what do when you start running over your schedule.
Do you keep going until you finish, and then immediately start over?
Do you keep going until you finish, then wait one minute, and then start over?
Do you abort the process, wherever it got to, and then start over?
Do you slow down the frequency (e.g. from now on, just every 2 minutes)?
Do you have two processes running at the same time, and hope that the next run will be faster (this might work if you are clearing up a backlog the first time, so the second run will run quickly.)
The answers to these questions depend on the application. Cron might not be the right tool for you depending on the answer. You might be better having a process permanently running and scheduling itself.
So, let me get this straight: You are querying an external service (what? SOAP? MYSQL?) every minute for every user in the database and storing the results in the same database. Is that correct?
It seems like a design problem.
If the users on the external service are the same as the users in your database, perhaps the two should be more closely configured. I don't know if PHP is the way to go for syncing this data. If you give more detail, we could think about another solution. If you are in control of the external service, you may want to have that service dump it's data or even write directly to the database. Some other syncing mechanism might be better.
EDIT
It seems that you are making an application that stores data for a user that can then be viewed chronologically. Otherwise you may as well just fetch the data when the user requests it.
Fetch all the user IDs in go.
Iterate over them one by one (assuming that the data being fetched is unique to each user) and (you'll have to be creative here as PHP threads do not exist AFAIK) call a process for each request as you want them all to be executed at the same time and not delayed if one user does not return data.
Said process should insert the data returned into the db as soon as it is returned.
As for cron being right for the job: As long as you have a powerful enough server that can handle thousands of the above cron jobs running simultaneously, you should be fine.
You could get creative with several PHP scripts. I'm not sure, but if every CLI call to PHP starts a new PHP process, then you could do it like that.
foreach ($users as $user)
{
shell_exec("php fetchdata.php $user");
}
This is all very heavy and you should not expect to get it done snappy with PHP. Do some tests. Don't take my word for it.
Databases are made to process BULKS of records at once. If you're processing them one-by-one, you're looking for trouble. You need to find a way to batch up your "every minute" task, so that by executing a SINGLE (complicated) query, all of the affected users' info is retrieved; then, you would do the PHP processing on the result; then, in another single query, you'd PUSH the results back into the DB.
Based on your big-picture description it sounds like you have a dead-end design. If you are able to get it working right now, it'll most likely be very fragile and it won't scale at all.
I'm guessing that if you have no control over the external service, then that external service might not be happy about getting hammered by your script like this. Have you approached them with your general plan?
Do you really need to do all users every time? Is there any sort of timestamp you can use to be more selective about which users need "updates"? Perhaps if you could describe the goal a little better we might be able to give more specific advice.
Given your clarification of wanting to run the processing of users simultaneously...
The simplest solution that jumps to mind is to have one thread per user. On Windows, threads are significantly cheaper than processes.
However, whether you use threads or processes, having thousands running at the same time is almost certainly unworkable.
Instead, have a pool of threads. The size of the pool is determined by how many threads your machine can comfortable handle at a time. I would expect numbers like 30-150 to be about as far as you might want to go, but it depends very much on the hardware's capacity, and I might be out by another order of magnitude.
Each thread would grab the next user due to be processed from a shared queue, process it, and put it back at the end of the queue, perhaps with a date before which it shouldn't be processed.
(Depending on the amount and type of processing, this might be done on a separate box to the database, to ensure the database isn't overloaded by non-database-related processing.)
This solution ensures that you are always processing as many users as you can, without overloading the machine. As the number of users increases, they are processed less frequently, but always as quickly as the hardware will allow.