I want to create a feature on my website that allows users to schedule a post at a specific time. The posts will be placed in a queue and then posted at the specified time. What is the proper method for handling this? Do I need to create a separate cron job for each individual scheduled post?
Typically this is done with a single cron job acting as "trigger". That trigger (a php script) checks the list of pending jobs in the database and executes one or more.
It might spawn sub requests for each job to improve robustness, but this adds load to the system.
It is important to mark started jobs inside the database so that in case of concurrent trigger processes (whyever, for example system load), a single job is not started twice. This is especially true for jobs that crash...
You might also want to implementing a locking strategy, so that only one single concurrent trigger request is possible.
Related
I have very big database, and my users can sample from this database.
They build very large queries that link about 30-40 tables. The result of the query sometimes reaches 2 minutes. I optimized the server as much as possible, but still the data transfer rate is very low.
So I made a visual effect of the query, so that the user could save the request, and the result will be sent to him in the browser when the query is executed.
But there is one problem. I do not know how to make a database scan for the execution of the request.
I created the Event system. I bookmark events in the database and then process them. Separately, I did a database scan through the cron.
But the problem of the cronis that it does not have time to work in 1 minute and a new cron is launched and this increases the load on the server and creates a recursion.
I want to create a php task so that after saving a request from the user it starts executing it, but only after the event is created for its execution.
Could you please, how do I better do this, what methods can help me in this.
Thanks
I would use a framework such as Laravel and take advantage of its queue system.
https://laravel.com/docs/5.6/queues#job-events
There is already one implemented for databases.
"Using the before and after methods on the Queue facade, you may specify callbacks to be executed before or after a queued job is processed.".
I guess this can give you an idea about what to do after the query is processed.
When a user submits a form on my site, I have to do a job based on the form which is essentially:
Check for user locks (in redis, prevents user from doing naughty things), if no locks continue and put job queue lock in place, otherwise quit job and give error to user
Update row/s in a mysql table, potentially delete some rows in the same table and do at least 1 insert (potentially across different tables)
remove job queue lock
I would like to queue these jobs up as they come in, with the queue always processing new jobs that get put into it.
I am using php and mysql. I have looked at gearman and also resque for php. Gearman seems like it might be overkill. And also I want to potentially be able to handle thousands of these jobs per second. So speed is important.
It's crucial that these jobs in the queue occur sequentially and in the order they come in. It would also be a bonus if every half a second I could insert a job to the front of the queue (it's a different job but kind of related).
I've never done anything like this before.
Since you're already into PHP & Redis it looks like Resque may work for you.
I have 5 cron jobs running a PHP file. The PHP file checks the MySQL database for items that require processing. Since cron launches the scripts all at the same time, it seems that some of the items are processed twice, or even sometimes up to five times.
Upon SELECting the file in one of the scripts, it immediately sends an UPDATE query so that other jobs shouldn't run it again. But looks like it's still double processing.
What can I do to prevent the other scripts from processing an item that was previously selected by the other cron jobs?
This issue is called "race condition". In this case it happens due to SELECT and UPDATE, though called one after another, are not a single operation. Therefore, there is a chance that two jobs do SELECT the same job, then first does UPDATE, and then second does UPDATE. And so they proceed to run this job simultaneously.
There is a workaround, however.
You could add a field to your table containing ID of current cron job worker (if you run it all on one machine, it may be PID). In worker you do UPDATE first, trying to reserve a job for it:
UPDATE jobs
SET worker = $PID, status = 'processing'
WHERE worker IS NULL AND status = 'awaiting' LIMIT 1
Then you verify you successfully reserved a job for this worker:
SELECT * FROM jobs WHERE worker = $PID
If it did not return you a row, it means other worker was first to reserve it. You can try again from step 1 to aquire another job. If it did return a row, you do all your processing, and then final UPDATE in the end:
UPDATE jobs
SET status = 'done', worker = NULL
WHERE id = $JOB_ID
I think you have a typical problem to use semaphores. Take a look at this article:
http://www.re-cycledair.com/php-dark-arts-semaphores
The idea would be at first of each script, ask for the same semaphore and wait until it be free. Then SELECT and UPDATE the DB as you do it, free the semaphore and start the process. This is the only way you can be sure that no more than one script is reading the DB while another one is about to write on it.
I would start again. This train of thought:
it takes time to process one item. about 30 seconds. if i have five cron jobs, five items are processed in 30 seconds
This is just plain wrong and you should not write your code with this in mind.
By that logic why not make 100 cron jobs and do 100 per 30 seconds? Answer, because your server is not RoadRunner and it will fall over and fail.
You should
Rethink your problem, this is the most important as it will help with 1 and 2.
Optimise your code so that it does not take 30 seconds.
Segment your code so that each job is only doing one task at a time which will make it quicker and also ensure that you do not get this 'double processing' effect.
EDIT
Even with the new knowledge of this being on a third party server my logic still stands, do not start multiple calls that you are not in control of, in fact this is now even more important.
If you do not know what they are doing with the calls then you cannot be sure they are in the right order, when or if they are processed. So just make one call to ensure you do not get double processing.
A technical solution would be for them to improve the processing time or for you to cache the responses - but that may not be relevant to your situation.
I am currently evalutuating Gearman to farm out some expensive data import jobs in our backend. So far this looks very promising. However there is one piece missing that I just can't seem to find any info about. How can I get a list of schedules jobs from Gearman?
I realize I can use the admin protocol to get the number of currently queued jobs for each function, but I need info about the actual jobs. There is also the option of using a persistent queue (eg. MySQL) and query the database for the jobs, but it feels pretty wrong to me to circumvent Gearman for this kind of information. Other than that, I'm out of ideas.
Probably I don't need this at all :) So here's some more background on what I want to do, I'm all open for better suggestions. Both the client and the worker run in PHP. In our admin interface the admins can trigger a new import for a client; as the import takes a while it is started as a background task. Now the simple questions I want to be able to answer: When was the last import run for this client? Is an import already queued for this client (in that case triggering a new import should have no effect)? Nice to have: At which position in the queue is this job (so I can make an estimate on when it will run)?
Thanks!
The Admin protocol is what you'd usually use, but as you've discovered, it won't list the actual tasks in the queue. We've solved this by keeping track of the current tasks we've started in our application layer, and having a callback in our worker telling the application when the task has finished. This allows us to perform cleanup, notification etc. when the task has finished, and allows us to keep this logic in the application and not the worker itself.
Relating to progress the best way is to just use the built-in progress mechanics in Gearman itself, in the PHP module you can call this by using $job->sendStatus(percentDone, 100). A client can then retrieve this value from the server using the task handle (which will be returned when you start the job). That'll allow you to show the current progress to users in your interface.
As long as you have the current running tasks in your application, you can use that to answer wether there are similar tasks already running, but you can also use gearman's built-in job coalescing / de-duplication; see the $unique parameter when adding the task.
The position in the current queue will not be available through Gearman, so you'll have to do this in your application as well. I'd stay away from asking the Gearman persistence layer for this information.
You have pretty much given yourself the answer: use a DBRMS (MySQL or Postgres) as persistance backend and query the gearman_queue table.
For instance, we developed a hybrid solution: we generate and pass an unique id for the job which we pass as third parameter to doBackground() (http://php.net/manual/en/gearmanclient.dobackground.php) when queuing the job.
Then we use this id to query the gearman table to verify the job status looking at the 'unique_key' table field. You can also get the queue position as the record are already ordered.
Pro Bonus: we also catch exceptions inside the worker. If a job fails we write the job payload (which is a JSON serialized object) on a file, and then pick up the file and requeue the job via cronjob incrementing the 'retry' internal counter so we retry a single job 3 times max, and get to inspect the job later if it still fails.
We have a web app that uses IMAP to conditionally insert messages into users' mailboxes at user-defined times.
Each of these 'jobs' are stored in a MySQL DB with a timestamp for when the job should be run (may be months into the future). Jobs can be cancelled at anytime by the user.
The problem is that making IMAP connections is a slow process, and before we insert the message we often have to conditionally check whether there is a reply from someone in the inbox (or similar), which adds considerable processing overhead to each job.
We currently have a system where we have cron script running every minute or so that gets all the jobs from the DB that need delivering in the next X minutes. It then splits them up into batches of Z jobs, and for each batch performs an asynchronous POST request back to the same server with all the data for those Z jobs (in order to achieve 'fake' multithreading). The server then processes each batch of Z jobs that come in via HTTP.
The reason we use an async HTTP POST for multithreading and not something like pnctl_fork is so that we can add other servers and have them POST the data to those instead, and have them run the jobs rather than the current server.
So my question is - is there a better way to do this?
I appreciate work queues like beanstalkd are available to use, but do they fit with the model of having to run jobs at specific times?
Also, because we need to keep the jobs in the DB anyway (because we need to provide the users with a UI for managing the jobs), would adding a work queue in there somewhere actually be adding more overhead rather than reducing it?
I'm sure there are better ways to achieve what we need - any suggestions would be much appreciated!
We're using PHP for all this so a PHP-based/compatible solution is really what we are looking for.
Beanstalkd would be a reasonable way to do this. It has the concept of put-with-delay, so you can regularly fill the queue from your primary store with a message that will be able to be reserved, and run, in X seconds (time you want it to run - the time now).
The workers would then run as normal, connecting to the beanstalkd daemon and waiting for a new job to be reserved. It would also be a lot more efficient without the overhead of a HTTP connection. As an example, I used to post messages to Amazon SQS (by http). This could barely do 20 QPS at very most, but Beanstalkd accepted over a thousand per second with barely any effort.
Edited to add: You can't delete a job without knowing it's ID, though you could store that outside. OTOH, do users have to be able to delete jobs at any time up to the last minute? You don't have to put a job into the queue weeks or months in advance, and so you would still only have one DB-reader that ran every, say, 1 to 5 mins to put the next few jobs into the queue, and still have as many workers as you would need, with the efficiencies they can bring.
Ultimately, it depends on the number of DB read/writes that you are doing, and how the database server is able to handle them.
If what you are doing is not a problem now, and won't become so with additional load, then carry on.