What is the best way to perform cron-job automation for multiple users?
Example:
A cron-job needs to run every 10 minutes an call a PHP script that connects to an external API (via curl) and collects data (site visitors and other data) a user has received on an external web property. We need to check periodically every 10 minutes via the API if there is any new data available for that user and fetch it -- and that for EACH user in the web-app.
Such cron-job PHP script call to the API usually takes 1-3 seconds per user, but could occasionally take up to 30 seconds or more to complete for one user (in exceptional situations).
Question...
What is the best way to perform this procedure and collect external data like that for MULTIPLE users? Hundreds, even thousands of users?
For each user, we need to check for data every 10 minutes.
Originally I was thinking of calling 10 users in a row in a loop with one cron-job call, but since each user collection can take 30 seconds...for 10 users a loop could take several minutes and...the script could timeout? Correct?
Do you have some tips and suggestions on how to perform this procedure for many users most efficiently? Should separate cron jobs be used for each user? Instead of a loop?
Thank you!
=== EDIT ===
Let's say one PHP script can call the API for 10 users within 1 minute... Could I create 10 cron-jobs that essentially call the same PHP script simultaneously, but each one collecting a different batch of 10 users? This way I could potentially get data for 100 users within one minute? No?
It could look like this:
/usr/local/bin/php -q get_data.php?users_group=1
/usr/local/bin/php -q get_data.php?users_group=2
/usr/local/bin/php -q get_data.php?users_group=3
and so on...
Is this going to work?
=== NOTE ===
Each user has a unique Access Key with the external API service, so one API call can only be for one user at a time. But the API could receive multiple simultaneous calls for different users at once.
If it takes 30 seconds a user and you have more than 20 users, you won't finish before you need to start again. I would consider using GearMan or other job server to handle each of these requests in an async way. GearMan can also wait for jobs to complete, so you should be able to loop over all the requests you need to make and then wait for them to finish. You can probably accomplish the same thing with PHP's pthread implementation, however, that's going to be significantly more difficult.
Related
I am trying to create a website monitoring webapp using PHP. At the minute I'm using curl to collect headers from different websites and update a MySQL database when a website's status changes (e.g. if a site that was 'up' goes 'down').
I'm using curl_multi (via the Rolling Curl X class which I've adapted slightly) to process 20 sites in parallel (which seems to give the fastest results) and CURLOPT_NOBODY to make sure only headers are collected and I've tried to streamline the script to make it as fast as possible.
It is working OK and I can process 40 sites in approx. 2-4 seconds. My plan has been to run the script via cron every minute... so it looks like I will be able to process about 600 websites per minute. Although this is fine at the minute it won't be enough in the long term.
So how can I scale this? Is it possible to run multiple crons in parallel or will this run into bottle-necking issues?
Off the top of my head I was thinking that I could maybe break the database into groups of 400 and run a separate script for these groups (e.g. ids 1-400, 401-800, 801-1200 etc. could run separate scripts) so there would be no danger of database corruption. This way each script would be completed within a minute.
However it feels like this might not work since the one script running curl_multi seems to max out performance at 20 requests in parallel. So will this work or is there a better approach?
yes, the simple solution is use the same PHP CLI script and pass the args 1 and 2 i.e., indicates the min and max range to process the db record contains the each site information.
Ex. crontab list
* * * * * php /user/script.php 1 400
* * * * * php /user/script.php 401 800
Or using a single script, you can trigger multi-threading (multi-threading in PHP with pthreads). But the cron interval should be based on the benchmark of completion of 800 sites.
Ref: How can one use multi threading in PHP applications
Ex. the script multithread completes in 3 minutes
then give the interval as */3.
I currently have a scheduled console command that runs every 5 minutes without overlap like this:
$schedule->command('crawler')
->everyFiveMinutes()
->withoutOverlapping()
->sendOutputTo('../_laravel/storage/logs/scheduler-log.txt');
So it works great, but I currently have about 220 pages that takes about 3 hours to finish in increments of 5 minutes because I just force it to crawl 10 pages at each interval since each page takes like 20-30 seconds to crawl due to various factors. Each page is a record in the database. If I end up having 10,000 pages to crawl, this method would not work because it would take more than 24 hours and each page is supposed to be re-crawled once a day.
So my vendor allows up to 10 concurrent requests (or more with higher plans), so what's the best way to run it concurrently? If I just duplicate the scheduler code, does it run the same command twice or like 10 times if I duplicated it 10 times? Any issues that would cause?
And then I need to pass on parameters to the console such as 1, 2, 3, etc... in which I could use to determine which pages to crawl? i.e. 1 would be 1-10 records, 2 would be next 11-20 records, and so on.
Using this StackOverfow answer, I think I know how to pass it along, like this:
$schedule->command('crawler --sequence=1')
But how do I read that parameter within my Command class? Does it just become a regular PHP variable, i.e. $sequence?
Better to use queue for job processing
on cron, add all jobs to queue
Run multiple queue workers, which will process jobs in parallel
Tip: It happened with us.
It might happen that job added previously is not complete, yet cron adds the same task in queue again. As queues works sequentially. To save yourself from the situation, you should in database mark when a task is completed last time, so you know when to execute the job (if it was seriously delayed)
I found this on the documentation, I hope this is what you're looking for:
Retrieving Input
While your command is executing, you will obviously need to access the
values for the arguments and options accepted by your application. To
do so, you may use the argument and option methods:
Retrieving The Value Of A Command Argument
$value = $this->argument('name');
Retrieving All Arguments
$arguments = $this->argument();
Retrieving The Value Of A Command
Option
$value = $this->option('name');
Retrieving All Options
$options = $this->option();
source
Here's what I'm trying to accomplish in high-level pseudocode:
query db for a list of names (~100)
for each name (using php) {
query a 3rd party site for xml based on the name
parse/trim the data received
update my db with this data
Wait 15 seconds (the 3rd party site has restrictions and I can only make 4 queries / minute)
}
So this was running fine. The whole script took ~25 minutes (99% of the time was spent waiting 15 seconds after every iteration). My web host then made a change so that scripts will timeout after 70 seconds (understandable). This completely breaks my script.
I assume I need to use cronjobs or command line to accomplish this. I only understand the basic us of cronjobs. Any high level advice on how to split up this work in a cronjob? I am not sure how a cronjob could parse through a dynamic list.
cron itself has no idea of your list and what is done already, but you can use two kinds of cron-jobs.
The first cron-job - that runs for example once a day - could add your 100 items to a job queue.
The second cron-job - that runs for example once every minute in a certain period - can check if there are items in the queue, execute one (or a few) and remove it from the queue.
Note that both cron-jobs are just triggers to start a php script in this case and you have two different scripts, one to set the queue and one to process part of a queue so almost everything is still done in php.
In short, there is not much that is different. Instead of executing the script via modphp or fcgi, you are going to execute it via command line php /path/to/script.php.
Because this is a different environment than http, some things obviously don't work. Sessions, cookies, get and post variables. Output gets send to stdout instead of the browser.
You can pass arguments to your script by using $argv.
We have a web app that uses IMAP to conditionally insert messages into users' mailboxes at user-defined times.
Each of these 'jobs' are stored in a MySQL DB with a timestamp for when the job should be run (may be months into the future). Jobs can be cancelled at anytime by the user.
The problem is that making IMAP connections is a slow process, and before we insert the message we often have to conditionally check whether there is a reply from someone in the inbox (or similar), which adds considerable processing overhead to each job.
We currently have a system where we have cron script running every minute or so that gets all the jobs from the DB that need delivering in the next X minutes. It then splits them up into batches of Z jobs, and for each batch performs an asynchronous POST request back to the same server with all the data for those Z jobs (in order to achieve 'fake' multithreading). The server then processes each batch of Z jobs that come in via HTTP.
The reason we use an async HTTP POST for multithreading and not something like pnctl_fork is so that we can add other servers and have them POST the data to those instead, and have them run the jobs rather than the current server.
So my question is - is there a better way to do this?
I appreciate work queues like beanstalkd are available to use, but do they fit with the model of having to run jobs at specific times?
Also, because we need to keep the jobs in the DB anyway (because we need to provide the users with a UI for managing the jobs), would adding a work queue in there somewhere actually be adding more overhead rather than reducing it?
I'm sure there are better ways to achieve what we need - any suggestions would be much appreciated!
We're using PHP for all this so a PHP-based/compatible solution is really what we are looking for.
Beanstalkd would be a reasonable way to do this. It has the concept of put-with-delay, so you can regularly fill the queue from your primary store with a message that will be able to be reserved, and run, in X seconds (time you want it to run - the time now).
The workers would then run as normal, connecting to the beanstalkd daemon and waiting for a new job to be reserved. It would also be a lot more efficient without the overhead of a HTTP connection. As an example, I used to post messages to Amazon SQS (by http). This could barely do 20 QPS at very most, but Beanstalkd accepted over a thousand per second with barely any effort.
Edited to add: You can't delete a job without knowing it's ID, though you could store that outside. OTOH, do users have to be able to delete jobs at any time up to the last minute? You don't have to put a job into the queue weeks or months in advance, and so you would still only have one DB-reader that ran every, say, 1 to 5 mins to put the next few jobs into the queue, and still have as many workers as you would need, with the efficiencies they can bring.
Ultimately, it depends on the number of DB read/writes that you are doing, and how the database server is able to handle them.
If what you are doing is not a problem now, and won't become so with additional load, then carry on.
hello i have some problems with my php ajax script
i'm using PHP/mysql
i have a field in my accounts table that will save the time for the last request from a user, i will use that to kick the idle user out of the chat. and i will make a php function that will delete all the rows that its time field more than the time limit, but where should i use this method is it okay to fire it every time a new request sent to my index.php ? i think that will make a huge load on the server,is n't it ? do you have a better solution?
thanks
There are two viable solutions:
either create a small PHP script that makes this deletion in an infinite loop (and of course sleeps for a specified amount of time before doing it again), and then start it via PHP CLI,
or create one that makes the deletion only once, then exits, and call it from cron (if you're using a UNIXish server) or Task Scheduler (on Windows).
The second one is simpler, but its drawback is that you can't make the interval between the deletions shorter than 60 seconds.
A solution could be to fire the deletion function just once every few requests.
Using rand() you could give it a 1 in 100 (for example) change of running the function, so that about one page request in a 100 will clean up the expired data.