How to effectively execute this cron job? - php

I have a table with 200 rows. I'm running a cron job every 10 minutes to perform some kind of insert/update operation on the table. The operation needs to be performed only on 5 rows at a time every time the cron job runs. So in first 10 mins records 1-5 are updated, records 5-10 in the 20th minute and so on.
When the cron job runs for the 20th time, all the records in the table would have been updated exactly once. This is what is to be achieved at least. And the next cron job should repeat the process again.
The problem:
is that, every time a cron job runs, the insert/update operation should be performed on N rows (not just 5 rows). So, if N is 100, all records would've been updated by just 2 cron jobs. And the next cron job would repeat the process again.
Here's an example:
This is the table I currently have (200 records). Every time a cron job executes, it needs to pick N records (which I set as a variable in PHP) and update the time_md5 field with the current time's MD5 value.
+---------+-------------------------------------+
| id | time_md5 |
+---------+-------------------------------------+
| 10 | 971324428e62dd6832a2778582559977 |
| 72 | 1bd58291594543a8cc239d99843a846c |
| 3 | 9300278bc5f114a290f6ed917ee93736 |
| 40 | 915bf1c5a1f13404add6612ec452e644 |
| 599 | 799671e31d5350ff405c8016a38c74eb |
| 56 | 56302bb119f1d03db3c9093caf98c735 |
| 798 | 47889aa559636b5512436776afd6ba56 |
| 8 | 85fdc72d3b51f0b8b356eceac710df14 |
| .. | ....... |
| .. | ....... |
| .. | ....... |
| .. | ....... |
| 340 | 9217eab5adcc47b365b2e00bbdcc011a | <-- 200th record
+---------+-------------------------------------+
So, the first record(id 10) should not be updated more than once, till all 200 records are updated once - the process should start over once all the records are updated once.
I have some idea on how this could be achieved, but I'm sure there are more efficient ways of doing it.
Any suggestions?

You could use a Red/Black system (like for cluster management).
Basically, all your rows start out as black. When you run your cron, it will mark the rows it updated as "Red". Once all the rows are red, you switch, and now start turning all the red rows to be black. You keep this alternation going, and it should allow you to effectively mark rows so that you do not update them twice. (You could store whatever color goal you want in a file or something so that it is shared between crons)

I would just run the PHP script every 10/5 minutes with cron, and then use PHP's time and date functions to perform the rest of the logic. If you cannot time it, you could store a position marking variable in a small file.

Related

Executing tasks in various intervals

I have table in my mysql as follows :
+----+---------------+--------------+----------------------+---------------------+
| id | report_name | report_id | rinterval | last_run |
+----+---------------+--------------+----------------------+---------------------+
| 1 | test report 1 | 434234234234 | every morning | 2016-05-20 12:55:07 |
| 2 | test report 2 | 3434232 | every sunday morning | 2016-05-20 12:55:07 |
| 3 | test report 3 | 342423423 | never | 2016-05-20 12:55:07 |
| 4 | test report 4 | 4324234 | every morning | 2016-05-20 12:55:07 |
+----+---------------+--------------+----------------------+---------------------+
I am trying to create a php script (preferably) that when is called, runs the appropriate report. I would like some suggestions on the best way to do that.
Let's assume I set a cron job to call the script every morning, and the intervals are as above (+similar): (every morning, every sunday, twice a month, etc). Also let's assume that the scripts should not run automatically if less than 24 hours have passed. Also a manual call can be initiated.
I was thinking something like this :
Call script
Find what day and time it is
Find what day it is
Select * from above where day and time more than 24 hours
Iterate the above records and run the reports (report is run like http://example.com/report_name/report_id)
If rinterval reads "every morning" - run the report
If rinterval reads "every Sunday morning" - run the report if it is Sunday (and similar for other days using a case)
If rinterval reads "never" - do not run the report
if rinterval reads "twice a month" - find the last day run and see if the interval is more than 15 - if yes run it. (or similar)
In all the above cases, on succesfull run, update the last_run timestamp.
One of my problems is what happens if I run a manual call - or if I want to run 2 manual calls 2 minutes apart for testing. If I run the report say on Monday afternoon, I still want it to be run on Tuesday morning. Should I introduce another column that indicates if this is manual call or automatic? I still need to know that the run was made if it is manual, but do not want to break the schedule as the report must be run before 08.00 in the morning.
What are your thoughts? I am sure there is a more efficient way to do this. I am open to all suggestions, I am doing this from scratch.
One of my problems is what happens if I run a manual call - or if I want to run 2 manual calls 2 minutes apart for testing
That's depends on what your scripts are doing.
I still need to know that the
Then simply keep log of invocations (separate table) with i.e. name of script, date of invocation, way of invocation.

PHP, MySQL and Cron Jobs - Does a query store all rows to start, or go through sequentially?

I have a MySQL table:
Col 1 | Col 2 | Col 3 | Status
... | ... | ... | 0
... | ... | ... | 1
... | ... | ... | 2
etc
It is important that the table have the most up-to-date information in it, and so a cron job is run every minute, to update the table.
The Status column is to store whether the row needs to be updated, or is currently being updated. If the row needs to be updated, the status is 0. If the row is currently being updated, the status is 1. If the row has already been updated, the status is 2.
Once all rows have a status of 2, they are all reset to 0, and the process starts over.
The cron job runs every minute, but sometimes updating a row might take multiple minutes, meaning multiple cron jobs will be running simultaneously.
My question is, if I have a query like:
UPDATE * FROM table WHERE status=0
does the query go through one at a time, to the next 0? Or does the query look at all the rows first, and store which ones it will eventually visit?
EXAMPLE
Say that the following table is set up:
Col 1 | Col 2 | Col 3 | Status
... | ... | ... | 0
... | ... | ... | 0
... | ... | ... | 0
... | ... | ... | 0
... | ... | ... | 0
At t=0, the first cron job (cj1) begins. It enters the first row, and sets the status to 1.
Col 1 | Col 2 | Col 3 | Status
... | ... | ... | 1
... | ... | ... | 0
... | ... | ... | 0
... | ... | ... | 0
... | ... | ... | 0
This process takes more than a minute, and so a second cron job (cj2) begins at t=1m.
cj2 sees that the first row is already being updated, and so goes to the second row.
Col 1 | Col 2 | Col 3 | Status
... | ... | ... | 1
... | ... | ... | 1
... | ... | ... | 0
... | ... | ... | 0
... | ... | ... | 0
Let's say that cj2 is busy updating that row for a few minutes. When cj1 finishes with the first row, will it skip to the 3rd row, because it sees that row 2 has a status of 1? Or will cj2 go to the second row, because it initially had a status of 0 when the query was called?
Since you're using InnoDB, each query will be performed as a transaction by default. So when you do
UPDATE table
SET <whatever>
WHERE status = 0
it will lock all the rows that match the status value. Other processes that perform a similar query will be blocked if they try to access any of these rows.
The specific way that it does this depends on whether there's an index on the status column. If there is, it simply locks that index entry and then updates all the rows it refers to.
If there's no index, it will have to step through the database sequentially. Whenever it encouters a row with status = 0 it will lock that row and then update it. Other clients may scan the database in a different order, so they might see the rows that this query will update before it gets to it. If they upate the status at the same time as they update other columns, then you shouldn't have a problem, because when this process reaches those rows they won't match the status = 0 criteria any more.
It depends on your implementation how you have implemented. If finished updating a row.. a cron job go and fetch next 1 result to be updated (status=0) and process that only one record you can achieve your desired goal.
If you will fetch all results with status=0 and store then into $resuls variable and start a loop on it then cron job 1 will reprocess record 2 without knowing that cron 2 already did processed because $result variable is not having updated information.

Questions about Queue System

I have a mysql queue that manages tasks for several php workers that run every minute via cron job.
I'll simplify everything to make it more understandable.
For the mysql part I have 2 tables:
worker_info
worker_id | name | hash | last_used
1 | worker1 | d8f9zdf8z | 2014-03-03 13:00:01
2 | worker2 | odfi9dfu8 | 2014-03-03 13:01:01
3 | worker3 | sdz7std74 | 2014-03-03 13:02:03
4 | worker4 | duf8s763z | 2014-03-03 13:02:01
...
tasks
task_id | times_run | task_id | workers_used
1 | 3 | 2932 | 1,6,3
2 | 2 | 3232 | 6,8
3 | 6 | 5321 | 3,2,6,10,5,20
4 | 1 | 8321 | 3
...
Tasks is a table to keep track of the tasks:
task_id identifies each task, times_run is the number of times a task has been successfully executed. task_id is a number the php script needs for its routines.
workers_used is a text field that holds the ids of all worker_infos that have been processed for this task. I don't want the same worker_info multiple times per task, only one time.
worker_info is a table that holds some infos the php script needs to do its job along with last_used which is a global indicator for when this worker was last used.
Several php scripts work on the same tasks and I need the values to be precise as each worker_info should be used only 1 time for each task.
The PHP cron jobs include all the same routines:
the script performs a mysql query to get a task.
1. SELECT * FROM tasks ORDER BY times_run ASC LIMIT 1 We are always working with 1 job at a time
The script locks the worker_info table to avoid that one worker_info gets selected multiple times from a tasks query
2. LOCK TABLES worker_info WRITE
Then it gets a list of all worker_infos not used for this task, sorted by last_used
3. SELECT * FROM worker_info WHERE worker_id NOT IN($workers_used) ORDER BY last_used ASC LIMIT 1
Then it updates the last_used parameter so the same worker_info won't get selected in the meantime when the task still runs
4. UPDATE workder_info Set last_used = NOW() WHERE worker_id = $id
Finally the lock gets released
5. UNLOCK TABLES
The php script performs its routines and if the task was successful it gets updated
6. UPDATE tasks SET times_run = times_run + 1, workers_used = IF(workers_used = '', '$worker_id', CONCAT(workers_used,', $worker_id')) I know it's very bad practice to perform the workers_used this way not using a second table to declare the dependencies but I'm a bit scared of the space it would take.
One Task can have several thousand workers_used and I have several thousand tasks themselves. This way the table would quickly become bigger than 1 million entries and I fear that this could slow down things a lot so I went with this way of storage.
Then the script performs step 2-6 10 times for each task before going back to step 1 selecting a new task and doing everything again.
Now this setup has served me well for about one year but now that I need to have 50+ php scripts active on this queue system, I get more and more problems in terms of performance.
PHP queries take up to 20 seconds and I cannot scale everymore like I need, if I just run more PHP scripts, the mysql server crashes.
I want no data loss if the system crashes, therefore I'm writing every change into the db as it happens. Also when I created the system I had problems with the workers_used because when 10 php scripts work on 1 task it occured very often that one worker_info data was used multiple times in the same task which I do not want.
Therefore I introduced the LOCK which fixed this but I suspect it to be the bottleneck of the system. If one worker locks the table to perform its actions, all other 49 php workers need to wait for that which is bad.
Now my questions are:
Is this implementation even good? Should I stick to it or throw it over and do something else?
Is this LOCK even my problem or does something else might slow down the system?
How can I improve this setup to make it a lot faster?
//Edit As suggested by jeremycole:
I suppose I need to update the worker_info table in order to implement the changes:
worker_info
worker_id | name | hash | tasks_owner | last_used
1 | worker1 | d8f9zdf8z | 1 | 2014-03-03 13:00:01
2 | worker2 | odfi9dfu8 | NULL | 2014-03-03 13:01:01
3 | worker3 | sdz7std74 | NULL | 2014-03-03 13:02:03
4 | worker4 | duf8s763z | NULL | 2014-03-03 13:02:01
...
And then change the routine to:
SET autocommit=0 Set autocommit to 0 so the queries won't get autocommitted
1. SELECT * FROM tasks ORDER BY times_run ASC LIMIT 1 Select a Task to process
2. START TRANSACTION
3. SELECT * FROM worker_info WHERE worker_id NOT IN($workers_used) AND tasks_owner IS NULL ORDER BY last_used ASC LIMIT 1 FOR UPDATE
4. UPDATE worker_info SET last_used = NOW(), tasks_owner = $task_id WHERE worker_id = $worker_id
5. COMMIT
Do PHP routine and if successful:
6. UPDATE tasks SET times_run = times_run + 1, workers_used = IF(workers_used = '', '$worker_id', CONCAT(workers_used,', $worker_id'))
That should be it or am I wrong at some point?
Is the tasks_owner really needed or would it be sufficient to change the last_used date?
It may be useful to read my answer to another question about how to implement a job queue in MySQL here:
MySQL deadlocking issue with InnoDB
In short, using LOCK TABLES for this is quite unnecessary and unlikely to yield good results.

How does the concept of repeat mode works

In the project (in codeigniter) I am working, a user can create a task and set its repeat mode as (Once/Daily/Weekly) where
Daily - Task will appear for the same time everyday in future
Weekly - Task will appear every Monday (say if task is being added on Monday)
Once - Task will get added only for today
Now every task created by user creates a record in database,
For example, suppose a task is created today(13-01-2014) from 2:00-3:00 with repeat mode as Daily, this will create a record against this (13-01-2014) date but I can't add the same task at that time for all future dates.
And also user can change/edit the mode of task anytime then that should not repeat thereafter.
Can anyone plz explain me the concept of how this repeating mode works? I mean when actually to create a task for future dates, or how to maintain the same in database.
"Explain the concept of repeat mode" is a pretty vague request. However, I think I understand what piece is missing.
I assume you have some kind of taskId, which is a unique key for each task. What you need is a batchId as well. Your end result would look something like this:
+----------+----------+----------------------+
|taskId |batchId |description |
|----------|----------|----------------------|
| 1 | | Some meeting |
| 2 | | Another meeting |
| 3 | 1 | Daily meeting |
| 4 | 1 | Daily meeting |
| 5 | 1 | Daily meeting |
| 6 | 2 | Go to the gym! |
| 7 | 2 | Go to the gym! |
| 8 | 2 | Go to the gym! |
| 9 | 2 | Go to the gym! |
| 10 | | Yet another meeting |
+----------+----------+----------------------+
Having a batchId lets you group these events in the case you need to modify all the tasks at once, but still lets you modify each task individually if need be, thanks to the taskId.
The actual implementation of this batchId is up to you. For example, it can be:
a random string generated on-the-fly
a hash of the first taskId to ensure that their always unique
a foreign key in a separate table that auto-generates a batchId as its key
Use the one that best suits your needs, or make one up yourself.
I just made up taskId and batchId. Replace those with whatever makes sense to you.

Multi-cURL 5000 URLs

I need to check for broken images with db entries. So now I am selecting all the items from table and using CURL to check it is broken or not. I have almost 5000 items in DB and CURL is taking lot of time. For one result, it is showing the total time as 0.07 seconds. My table structure is the following :
+----+----------------------------------------+
| id | image_url |
+----+----------------------------------------+
| 1 | http://s3.xxx.com/images/imagename.gif |
| 2 | http://s3.xxx.com/images/imagename.gif |
| 3 | http://s3.xxx.com/images/imagename.gif |
| 4 | http://s3.xxx.com/images/imagename.gif |
+----+----------------------------------------+
So is there any other idea to check for broken images?. I think I cannot use LIMITS here as I need to check for all items and then print the result. I have user file_get_contents() but it is also taking lot of time.
What you can do here is the following:
Use multi_curl to cURL the images in parallell.
Specify header only (as you're not interested in the image data) and if the status code is anything but 200 OK (or 302/Found), then the image does not exist.
Chunk the 5000 items first, don't run them all with multi_curl. About 50-100 items at a time is fine.

Categories