Scalable job queue system for large scale task scheduling [closed]

Scalable job queue system for large scale task scheduling [closed] - php

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
The scenario:
TL;DR - I need a queue system for triggering jobs based on a future timestamp and NOT on the order it is inserted
I have a MySQL database of entries that detail particular events that need to be performed (which will consist mostly of a series of arithmetic calculations and a database insert/update) in a precise sequence based on timestamps. The time the entry is inserted and when the event will be "performed" has no correlation and is determined by outside factors. The table also contains a second column of milliseconds which increases the timing precision.
This table is part of a job "queue" which will contain entries set to execute from anywhere between a few seconds to a few days in the future, and can potentially have up to thousands of entries added every second. The queue needs to be parsed constantly (every second?) - perhaps by doing a select of all timestamps that have expired during this second and sorting by the milliseconds, and then executing each event detailed by the entries.
The problem
Currently the backend is completely written in PHP on an apache server with MySQL (ie standard LAMP architecture). Right now, the only way I can think of to achieve what I've specified is to write a custom PHP job queue script that will do the parsing and execution, looped every second using this method. There are no other job systems that I'm aware of which can queue jobs according to a specified timestamp/millisecond rather than the entry time.
This method however sounds rather infeasible CPU wise even on paper - I have to perform a huge MySQL query every second and execute some sort of function for each row retrieved, with the possibility of it running over a second of execution time which will start introducing delays to the parsing time and messing up the looping script.
I am of course attempting to create a solution that will be scalable should there be heavy traffic on the system, which this solution fails miserably as it will continue falling behind as the number of entries get larger.
The questions
I'd prefer to stick to the standard LAMP architecture, but is there any other technology I can integrate nicely into the stack that is better equipped to deal with what I'm attempting to do here?
Is there another method entirely to to accurately trigger events at a specified future date without the messy fiddling about with the constant queue checking?
If neither of the above options are suitable, is there a better way to loop the PHP script in the background? In the worst case scenario I can accept the long execution times and split the task up between multiple 'workers'.
Update
RabbitMQ was a good suggestion, but unfortunately doesn't execute the task as soon as it 'expires' - it has to go through a queue first and wait up on any tasks in front that have yet to expire. The expiry time has a wide range between a few seconds to a few days, and the queue needs to be sorted somehow each time a new event is added in so the expiry time is always in order in the queue. This isn't possible as far as I'm aware of in RabbitMQ, and doesn't sound very efficient either. Is there an alternative or a programmatic fix?

Sometimes, making a square peg fit into a round hole takes too much effort. While using MySQL to create queues can be effective, it gets much trickier to scale. I would suggest that this might be an opportunity for RabbitMQ.
Basically, you would setup a message queue that you can put the events into. You would then have a "fanout" architecture with your workers processing each queue. Each worker would listen to the queue and check to see if the particular event needs to be processed. I imagine that a combination of "Work Queues" and the "Routing" techniques available in Rabbit would achieve what you are looking for in a scalable and reliable way.
I would envision a system that works something like this:
spawn workers to listen to queues, using routing keys to prune down how many messages they get
each worker checks the messages to see if they are to be performed now
if the message is to be performed, perform it and acknowledge -- otherwise, re-dispatch the message for future processing. There are some simple techniques available for this.
As you need more scale, you add more workers. RabbitMQ is extremely robust and easy to cluster as well when you eventually cap out your queue server. There are also other cloud-based queuing systems such as Iron.IO and StormMQ

Related

Running heavy PHP scripts in background [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I want to run more than 800 PHP scripts in the background simultaneously on Linux. Each PHP script will execute forever, meaning it will not stop once it has started. Each script will send request and get response from the server. How much RAM do I need for that? Will it possible to run more than 800 scripts? What kind of hardware do I need?

You're probably doing it wrong. Since your scripts are I/O bound instead of CPU bound, an event loop will help you. That way you just need as many workers as CPU cores.
This approach does not only lower your required resources in terms of memory and CPU cycles, but also reduce the number of scripts you have to monitor.
There are various PHP implementations, here are the three most popular ones:
Amp
Icicle
React

Well, I'm sure the hardware you seek exists, but you will need a time machine to access it ... do you have a time machine ??
I'm going to assume you do not have access to, or plans to build a time machine, and say that this is not sensible.
In case humour didn't do it for you; There is no hardware that is capable of executing that many processes concurrently, setting out to create an architecture that requires more threads than any commonly available hardware can execute is clearly a bad idea.
If all you are doing is I/O, then you should use non-blocking, asynchronous I/O.

To figure out how much ram you will need is simple, how much data will be stored in memory during execution x 800.
You can improve memory usage by setting variables to null as soon as you are done with the data, even if you are re-using the variables again I would highly recommend this. That way the execution will not turn into a memory leak filling up RAM and crashing your server.
$myVariable = null; //clears memory
The second part of your question "execute forever" is easy too, you simply need to tell PHP to allow the script to run for a long time... Personally though I would do the following:
Setup 800 crons of your script all running every 1 hour.
I would assume your script is in an infinite loop... note the time into a variable before the infinite loop and in the loop check if 1 hour has passed, if 1 hour has passed end the loop and the process (a new one will replace it).
Doing the above will ensure the process will be cleaned every hour, also if for some reason a process gets killed by the server due to resource or security checks the process will spring back up within the hour.
Of course you could lower this to 30 mins, 15 mins, 5 mins depending on how heavy each loop is and how often you want to re-establish the processes.

How can I set a news event/announcement to expire in PHP? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
PHP newbie here. My question is related to the logic/best practice for executing a certain task.
I have a news event/announcement being displayed on a website and needs to be removed/expire after a certain date/time. I have set a expiration date/time in the database, when a user visits the website after the date/time has passed a query is triggered and the news event/announcement is set to "0" (which is hidden).
I would like to know if this the best practice of accomplishing it or is there a better way?
Thanks.

The method you mention is usually effective, especially for small applications, but not a best practice. The reason it's not a best practice is because:
Issues
You are making the user wait for task execution
If there is no activity on your website, these tasks will not be done
If anything happens while executing a task, your user will receive the errors
The first two might not seem to matter much, but that's only because the tasks you have now are very fast and non-critical. However, if a task would take a second or two, that would mean the user now has to wait 2 extra seconds before he sees the page, which is bad.
Likewise, if nobody visits your site for a week and there's a list of 15 tasks that need to be done, the user now would have to wait for 30 seconds. It might even time out the whole page; which would mean your tasks are now unfinished and the user is annoyed with getting a timeout for seemingly no reason.
In addition, if one of your tasks is time critical, it still won't be done. For example if the task is to send someone a reminder email after 24 hours but nobody logs in, the mail won't be sent.
The last one is also a problem; both because this makes it hard to see when a task fails (as the error is logged as a user problem, if at all) and because your user (again for no reason) is now looking at an error screen.
Solution
If you want to use the best practice, move all these sorts of tasks to either a Scheduled Task (under windows) or a Cronjob (under unix). This means you have a system service that periodically starts up and executes a PHP script that can do maintenance to your site, such as removing these news messages, sending out emails, or other things.
This has a number of advantages:
The server will always be there on time to run the tasks when they need to be run
You can disable timeouts and upgrade memory availability to run intensive tasks
Users will not have to wait for anything to complete
You can add special logging to the server, so that you know when these important but hidden tasks fail
Most providers allow you to set these kinds of tasks even on cheap hosting packages.

By far the simplest way to do this is to have a publisheduntil field with a date time. Each time you get a list of events to be shown on a page check this field eg
Select * from my_table where published = true and published_until > todays_date
This is avery simple way of ensuring that events disappear when they should.

"php script"/"mysql update" dynamic schedule

Hello fellow programmers! :)
I want to be able to set up some php script to be run after some events, triggered by user. Let's say, user creates a forum thread, that should be closed after 48 hours automatically. It is equivalent to an update to MySQL row:
UPDATE threads SET closed = '1' WHERE threads.id = 'x'.
Hence, this problem should not necessarily be solved exclusively with php.
This kind of questions pop up from time to time, but everything I found was to set up a cron job to run every 'x' amount of time, that checks if the time has come to close the thread. The problem is, that running this checks often cause higher system load than if you schedule a script to be run once at a given time. Not to forget, that there could be hundreds or even thousands of threads, each with it's own time to be closed. We can avoid checking every single thread by creating some sort of queue, for instance in MySQL, so the script selects from the DB entries with "time_to_close < NOW()" and closes these. Another drawback is, that I would like the thread to be closed exactly after 48 hours. In that case the script should be run every second and should take very little time to be executed completely.
Alternatively to cron job I think following method can also be useful:
check at every access to the Thread if it should be closed. This also causes higher load, especially if the thread is accessed very often.
So is there any efficient way to schedule a (php) script run depending on the time of a specific event? While writing this question I stumbled upon MySQL event scheduler. Together with procedures, that can provide additional flow control (close thread only if there was no activity since 48 hours) I think my idea can be implemented. I am not familiar with these functions of MySQL, so I would appreciate any help on this topic.
With best regards,
system__failure.

I know a lot of websites compensate for this kind of behavior by doing this on a per user request basis. The overhead is not that bad and your records are always displaying correctly (unless you have a design problem.) This also works because most hosts don't give you cron access. It is very rare you will need to schedule a job in php. There are a few exceptions like report generations every hour. But trying to catch user actions with cron is not a good idea.

Building an event scheduling/custom cronjob system in PHP and MySQL, is this a sane approach?

I have an application where I intend users to be able to add events at any time, that is, chunks of code that should only run at a specific time in the future determined by user input. Similar to cronjobs, except at any point there may be thousands of these events that need to be processed, each at its own specific due time. As far as I understand, crontab would not be able to handle them since it is not meant to have massive number of cronjobs, and additionally, I need precision to the second, and not the minute. I am aware it is possible to programmatically add cronjobs to crontab, but again, it would not be enough for what I'm trying to accomplish.
Also, I need these to be real time, faking them by simply checking if there are due items whenever the pages are visited is not a solution; they should also fire even if no pages are visited by their due time. I've been doing some research looking for a sane solution, I read a bit about queue systems such as gearman and rabbitmq but a FIFO system would not work for me either (the order in which the events are added is irrelevant, since it's perfectly possible one adds an event to fire in 1 hour, and right after another that is supposed to trigger in 10 seconds)
So far the best solution that I found is to build a daemon, that is, a script that will run continuously checking for new events to fire. I'm aware PHP is the devil, leaks memory and whatnot, but I'm still hoping nonetheless that it is possible to have a php daemon running stably for weeks with occasional restarts, so as long as I spawn new independent processes to do the "heavy lifting", the actual processing of the events when they fire.
So anyway, the obvious questions:
1) Does this sound sane? Is there a better way that I may be missing?
2) Assuming I do implement the daemon idea, the code naturally needs to retrieve which events are due, here's the pseudocode of how it could look like:
while 1 {
read event list and get only events that are due
if there are due events
for each event that is due
spawn a new php process and run it
delete the event entry so that it is not run twice
sleep(50ms)
}
If I were to store this list on a MySQL DB, and it certainly seems the best way, since I need to be able to query the list using something on the lines of "SELECT * FROM eventlist where duetime >= time();", is it crazy to have the daemon doing a SELECT every 50 or 100 milliseconds? Or I'm just being over paranoid, and the server should be able to handle it just fine? The amount of data retrieved in each iteration should be relatively small, perhaps a few hundred rows, I don't think it will amount for more than a few KBs of memory. Also the daemon and the MySQL server would run on the same machine.
3) If I do use everything described above, including the table on a MySQL DB, what are some things I could do to optimize it? I thought about storing the table in memory, but I don't like the idea of losing its contents whenever the server crashes or is restarted. The closest thing I can think of would be to have a standard InnoDB table where writes and updates are done, and another, 1:1 mirror memory table where reads are performed. Using triggers it should be doable to have the memory table mirror everything, but on the other hand it does sound like a pain in the ass to maintain (fubar situations can easily happen if some reason the tables get desynchronized).

Best solution for running multiple intensive jobs at specific times

We have a web app that uses IMAP to conditionally insert messages into users' mailboxes at user-defined times.
Each of these 'jobs' are stored in a MySQL DB with a timestamp for when the job should be run (may be months into the future). Jobs can be cancelled at anytime by the user.
The problem is that making IMAP connections is a slow process, and before we insert the message we often have to conditionally check whether there is a reply from someone in the inbox (or similar), which adds considerable processing overhead to each job.
We currently have a system where we have cron script running every minute or so that gets all the jobs from the DB that need delivering in the next X minutes. It then splits them up into batches of Z jobs, and for each batch performs an asynchronous POST request back to the same server with all the data for those Z jobs (in order to achieve 'fake' multithreading). The server then processes each batch of Z jobs that come in via HTTP.
The reason we use an async HTTP POST for multithreading and not something like pnctl_fork is so that we can add other servers and have them POST the data to those instead, and have them run the jobs rather than the current server.
So my question is - is there a better way to do this?
I appreciate work queues like beanstalkd are available to use, but do they fit with the model of having to run jobs at specific times?
Also, because we need to keep the jobs in the DB anyway (because we need to provide the users with a UI for managing the jobs), would adding a work queue in there somewhere actually be adding more overhead rather than reducing it?
I'm sure there are better ways to achieve what we need - any suggestions would be much appreciated!
We're using PHP for all this so a PHP-based/compatible solution is really what we are looking for.

Beanstalkd would be a reasonable way to do this. It has the concept of put-with-delay, so you can regularly fill the queue from your primary store with a message that will be able to be reserved, and run, in X seconds (time you want it to run - the time now).
The workers would then run as normal, connecting to the beanstalkd daemon and waiting for a new job to be reserved. It would also be a lot more efficient without the overhead of a HTTP connection. As an example, I used to post messages to Amazon SQS (by http). This could barely do 20 QPS at very most, but Beanstalkd accepted over a thousand per second with barely any effort.
Edited to add: You can't delete a job without knowing it's ID, though you could store that outside. OTOH, do users have to be able to delete jobs at any time up to the last minute? You don't have to put a job into the queue weeks or months in advance, and so you would still only have one DB-reader that ran every, say, 1 to 5 mins to put the next few jobs into the queue, and still have as many workers as you would need, with the efficiencies they can bring.
Ultimately, it depends on the number of DB read/writes that you are doing, and how the database server is able to handle them.
If what you are doing is not a problem now, and won't become so with additional load, then carry on.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.