Questions about Queue System - php

I have a mysql queue that manages tasks for several php workers that run every minute via cron job.
I'll simplify everything to make it more understandable.
For the mysql part I have 2 tables:
worker_info
worker_id | name | hash | last_used
1 | worker1 | d8f9zdf8z | 2014-03-03 13:00:01
2 | worker2 | odfi9dfu8 | 2014-03-03 13:01:01
3 | worker3 | sdz7std74 | 2014-03-03 13:02:03
4 | worker4 | duf8s763z | 2014-03-03 13:02:01
...
tasks
task_id | times_run | task_id | workers_used
1 | 3 | 2932 | 1,6,3
2 | 2 | 3232 | 6,8
3 | 6 | 5321 | 3,2,6,10,5,20
4 | 1 | 8321 | 3
...
Tasks is a table to keep track of the tasks:
task_id identifies each task, times_run is the number of times a task has been successfully executed. task_id is a number the php script needs for its routines.
workers_used is a text field that holds the ids of all worker_infos that have been processed for this task. I don't want the same worker_info multiple times per task, only one time.
worker_info is a table that holds some infos the php script needs to do its job along with last_used which is a global indicator for when this worker was last used.
Several php scripts work on the same tasks and I need the values to be precise as each worker_info should be used only 1 time for each task.
The PHP cron jobs include all the same routines:
the script performs a mysql query to get a task.
1. SELECT * FROM tasks ORDER BY times_run ASC LIMIT 1 We are always working with 1 job at a time
The script locks the worker_info table to avoid that one worker_info gets selected multiple times from a tasks query
2. LOCK TABLES worker_info WRITE
Then it gets a list of all worker_infos not used for this task, sorted by last_used
3. SELECT * FROM worker_info WHERE worker_id NOT IN($workers_used) ORDER BY last_used ASC LIMIT 1
Then it updates the last_used parameter so the same worker_info won't get selected in the meantime when the task still runs
4. UPDATE workder_info Set last_used = NOW() WHERE worker_id = $id
Finally the lock gets released
5. UNLOCK TABLES
The php script performs its routines and if the task was successful it gets updated
6. UPDATE tasks SET times_run = times_run + 1, workers_used = IF(workers_used = '', '$worker_id', CONCAT(workers_used,', $worker_id')) I know it's very bad practice to perform the workers_used this way not using a second table to declare the dependencies but I'm a bit scared of the space it would take.
One Task can have several thousand workers_used and I have several thousand tasks themselves. This way the table would quickly become bigger than 1 million entries and I fear that this could slow down things a lot so I went with this way of storage.
Then the script performs step 2-6 10 times for each task before going back to step 1 selecting a new task and doing everything again.
Now this setup has served me well for about one year but now that I need to have 50+ php scripts active on this queue system, I get more and more problems in terms of performance.
PHP queries take up to 20 seconds and I cannot scale everymore like I need, if I just run more PHP scripts, the mysql server crashes.
I want no data loss if the system crashes, therefore I'm writing every change into the db as it happens. Also when I created the system I had problems with the workers_used because when 10 php scripts work on 1 task it occured very often that one worker_info data was used multiple times in the same task which I do not want.
Therefore I introduced the LOCK which fixed this but I suspect it to be the bottleneck of the system. If one worker locks the table to perform its actions, all other 49 php workers need to wait for that which is bad.
Now my questions are:
Is this implementation even good? Should I stick to it or throw it over and do something else?
Is this LOCK even my problem or does something else might slow down the system?
How can I improve this setup to make it a lot faster?
//Edit As suggested by jeremycole:
I suppose I need to update the worker_info table in order to implement the changes:
worker_info
worker_id | name | hash | tasks_owner | last_used
1 | worker1 | d8f9zdf8z | 1 | 2014-03-03 13:00:01
2 | worker2 | odfi9dfu8 | NULL | 2014-03-03 13:01:01
3 | worker3 | sdz7std74 | NULL | 2014-03-03 13:02:03
4 | worker4 | duf8s763z | NULL | 2014-03-03 13:02:01
...
And then change the routine to:
SET autocommit=0 Set autocommit to 0 so the queries won't get autocommitted
1. SELECT * FROM tasks ORDER BY times_run ASC LIMIT 1 Select a Task to process
2. START TRANSACTION
3. SELECT * FROM worker_info WHERE worker_id NOT IN($workers_used) AND tasks_owner IS NULL ORDER BY last_used ASC LIMIT 1 FOR UPDATE
4. UPDATE worker_info SET last_used = NOW(), tasks_owner = $task_id WHERE worker_id = $worker_id
5. COMMIT
Do PHP routine and if successful:
6. UPDATE tasks SET times_run = times_run + 1, workers_used = IF(workers_used = '', '$worker_id', CONCAT(workers_used,', $worker_id'))
That should be it or am I wrong at some point?
Is the tasks_owner really needed or would it be sufficient to change the last_used date?

It may be useful to read my answer to another question about how to implement a job queue in MySQL here:
MySQL deadlocking issue with InnoDB
In short, using LOCK TABLES for this is quite unnecessary and unlikely to yield good results.

Related

Proper way to create a sequence based on multiple fields

I'm using Laravel and Migrations to build my entire database structure.
Problem description
In schema, I have a pack table, that belongs to user and group and need to keep a kind of unique "index" for each different combination of these tables.
It means: a sequential number that increments based on distinct user_id and group_id. For example:
| id | user_id | group_id | sequence |
| 1 | 1 | 1 | 1 |
| 2 | 1 | 2 | 1 |
| 3 | 1 | 3 | 1 |
| 4 | 1 | 1 | 2 |
| 5 | 1 | 2 | 2 |
| 6 | 1 | 3 | 2 |
| 7 | 2 | 1 | 1 |
| 8 | 2 | 2 | 1 |
| 9 | 2 | 3 | 1 |
This will be used to references a pack on view layer:
user 1, this is your pack 1 of group 1.
user 1, this is your pack 2 of group 1.
user 1, this is your pack 1 of group 2.
I designed my migration (on up) like:
Schema::create('pack', function (Blueprint $table) {
$table->increments('id');
$table->integer('user_id')->unsigned();
$table->foreign('user_id')->references('id')->on('user');
$table->integer('group_id')->unsigned();
$table->foreign('group_id')->references('id')->on('group');
$table->integer('sequence')->unsigned();
});
And use this business logic to fill $pack->sequence field on model layer.
Question 1:
Theoretically, this should be considered the best strategy to use in described scenario?
Question 2:
There's some pattern/approach that can be used to fill sequence field on database layer?
It appears you already have an auto-increment column id. MySQL does not support more than one auto-increment column per table.
In general, you can't get the behavior you're describing while allowing concurrent inserts to the table. The reason is that you have to read the max sequence value for some user/group pair, then insert the next value as you insert a new row.
But this creates a race condition, because some other concurrent session could be doing the same thing, and it will sneak in and insert a row with the next sequence value in between your session's steps of reading and inserting.
The solution is to use locks in a way to prevent a concurrent insert of the same user_id and group_id. InnoDB will use gap locks to help this.
Example:
Open two MySQL clients. In the first session, try this:
mysql> begin;
mysql> select max(sequence) from pack where user_id=1 and group_id=1 for update;
+---------------+
| max(sequence) |
+---------------+
| 2 |
+---------------+
The FOR UPDATE locks the rows examined, and it locks the "gap" which is the place where other rows with the same user_id and group_id would be inserted.
To prove this, try in the second session:
mysql> begin;
mysql> insert into pack set user_id=1, group_id=1, sequence=3;
It hangs. It can't do the insert, because that conflicts with the gap lock still held by the first session. The race-condition has been avoided.
Now in the first session, finish the work.
mysql> insert into pack set user_id=1, group_id=1, sequence=3;
mysql> commit;
Notice after the commit, immediately session 1's locks are released. The second session resolves its blocked INSERT, but it correctly gets an error:
ERROR 1062 (23000): Duplicate entry '1-1-3' for key 'user_id'
Of course, session 2 should have done the same SELECT...FOR UPDATE. That would have also been blocked until it could resolve the lock conflict. Once it resolved, it would have returned the correct new max sequence value.
The locks are only per user_id/group_id combo, if and only if you have a suitable index. I used:
ALTER TABLE pack ADD UNIQUE KEY (user_id, group_id, sequence);
Once you have that key, the SELECT...FOR UPDATE is able to be specific to the right set of rows when it locks them.
What this means is that even if user_id=1, group_id=1 is locked, you can still insert a new entry for any other values of user_id or group_id. They lock distinct parts of the index, so there's no conflict.
I encourage you to do some experiments yourself to prove to yourself you understand how it works. You can do this without writing any PHP code. I just opened two Terminal windows, ran the mysql command-line client, and started writing at the mysql> prompt. You can too!

How does the concept of repeat mode works

In the project (in codeigniter) I am working, a user can create a task and set its repeat mode as (Once/Daily/Weekly) where
Daily - Task will appear for the same time everyday in future
Weekly - Task will appear every Monday (say if task is being added on Monday)
Once - Task will get added only for today
Now every task created by user creates a record in database,
For example, suppose a task is created today(13-01-2014) from 2:00-3:00 with repeat mode as Daily, this will create a record against this (13-01-2014) date but I can't add the same task at that time for all future dates.
And also user can change/edit the mode of task anytime then that should not repeat thereafter.
Can anyone plz explain me the concept of how this repeating mode works? I mean when actually to create a task for future dates, or how to maintain the same in database.
"Explain the concept of repeat mode" is a pretty vague request. However, I think I understand what piece is missing.
I assume you have some kind of taskId, which is a unique key for each task. What you need is a batchId as well. Your end result would look something like this:
+----------+----------+----------------------+
|taskId |batchId |description |
|----------|----------|----------------------|
| 1 | | Some meeting |
| 2 | | Another meeting |
| 3 | 1 | Daily meeting |
| 4 | 1 | Daily meeting |
| 5 | 1 | Daily meeting |
| 6 | 2 | Go to the gym! |
| 7 | 2 | Go to the gym! |
| 8 | 2 | Go to the gym! |
| 9 | 2 | Go to the gym! |
| 10 | | Yet another meeting |
+----------+----------+----------------------+
Having a batchId lets you group these events in the case you need to modify all the tasks at once, but still lets you modify each task individually if need be, thanks to the taskId.
The actual implementation of this batchId is up to you. For example, it can be:
a random string generated on-the-fly
a hash of the first taskId to ensure that their always unique
a foreign key in a separate table that auto-generates a batchId as its key
Use the one that best suits your needs, or make one up yourself.
I just made up taskId and batchId. Replace those with whatever makes sense to you.

How to distribute rows to concurrent processes in the order defined on DB?

I do have a DB table, which is kind of spool for performing tasks:
| id | status | owner | param1 |
+----+--------+-------+--------+
| 1 | used | user1 | AAA1 |
| 2 | free | user2 | AAA2 |
| 3 | free | user1 | AAA3 |
| 4 | free | user1 | AAA4 |
| 5 | free | user3 | AAA2 |
This table is being access by many parallel processes, what would be the best way to assure, that each row from the table would be "used" just by single process but also at the same time given out in the same order as they are in table (sorted by id column value)?
My first idea was to simply mark always next row in queue with simple update:
UPDATE table
SET status = "used"
WHERE owner = "userX"
AND status <> "used"
ORDER BY id
LIMIT 1
and then fetch the marked row.
This was not performing at all - with some data (e.g. 3.000.000 rows) and bigger loads process list was full UPDATE statements and mysql crashed with "Out of sort memory" error...
So my next idea is doing following steps/queries:
step1
get the first unused row:
SELECT id
FROM table
WHERE owner = "userX"
AND status = "free"
ORDER BY id
LIMIT 1
step2
try to mark it as used if it is still free:
UPDATE table
SET status = "used"
WHERE id = <id from SELECT above>
AND status = "free"
step3
go to step1 if row was NOT updated (because some other process already used it) or go to step4 if row was updated
step4
do the required work with successfully found row
The disadvantage is that on many concurrent processes there will be always a lot of jumping between steps 1. and 2. till each process finds its "own" row. So to be sure that system works stable - I would need to limit the number of tries each process does and risk that processes may reach the limit and find nothing while there are still entries in the table.
Maybe there is some better way to solve this problem?
P.S. everything is done at the moment with PHP+MySQL
Just a suggestion, instead of sorting and limiting to 1 maybe just grab min(id):
SELECT MIN(id)
FROM table
WHERE owner = "userX"
AND status = "free"
I am also using a MySQL database to choose rows that need to be enqueued for lengthy processing, preferring to do them in the order of the primary index ID column, also using optimistic concurrency control as shown above (no transactions needed). Thank you to #sleeperson for the answer using min(id), that is far superior to order by / limit 1.
I am posting one additional suggestion that allows for graceful restart. I implemented the following that's done only at startup time:
step0
get lingering rows:
SELECT id
FROM table
WHERE owner = "userX"
AND status = "used"
call step4
Etc. After a crash or other unwelcome (yet oh so common event) this will distribute out for processing the rows that should have been done previously, instead of leaving them marked 'used' in the database to trip me up later.

How to effectively execute this cron job?

I have a table with 200 rows. I'm running a cron job every 10 minutes to perform some kind of insert/update operation on the table. The operation needs to be performed only on 5 rows at a time every time the cron job runs. So in first 10 mins records 1-5 are updated, records 5-10 in the 20th minute and so on.
When the cron job runs for the 20th time, all the records in the table would have been updated exactly once. This is what is to be achieved at least. And the next cron job should repeat the process again.
The problem:
is that, every time a cron job runs, the insert/update operation should be performed on N rows (not just 5 rows). So, if N is 100, all records would've been updated by just 2 cron jobs. And the next cron job would repeat the process again.
Here's an example:
This is the table I currently have (200 records). Every time a cron job executes, it needs to pick N records (which I set as a variable in PHP) and update the time_md5 field with the current time's MD5 value.
+---------+-------------------------------------+
| id | time_md5 |
+---------+-------------------------------------+
| 10 | 971324428e62dd6832a2778582559977 |
| 72 | 1bd58291594543a8cc239d99843a846c |
| 3 | 9300278bc5f114a290f6ed917ee93736 |
| 40 | 915bf1c5a1f13404add6612ec452e644 |
| 599 | 799671e31d5350ff405c8016a38c74eb |
| 56 | 56302bb119f1d03db3c9093caf98c735 |
| 798 | 47889aa559636b5512436776afd6ba56 |
| 8 | 85fdc72d3b51f0b8b356eceac710df14 |
| .. | ....... |
| .. | ....... |
| .. | ....... |
| .. | ....... |
| 340 | 9217eab5adcc47b365b2e00bbdcc011a | <-- 200th record
+---------+-------------------------------------+
So, the first record(id 10) should not be updated more than once, till all 200 records are updated once - the process should start over once all the records are updated once.
I have some idea on how this could be achieved, but I'm sure there are more efficient ways of doing it.
Any suggestions?
You could use a Red/Black system (like for cluster management).
Basically, all your rows start out as black. When you run your cron, it will mark the rows it updated as "Red". Once all the rows are red, you switch, and now start turning all the red rows to be black. You keep this alternation going, and it should allow you to effectively mark rows so that you do not update them twice. (You could store whatever color goal you want in a file or something so that it is shared between crons)
I would just run the PHP script every 10/5 minutes with cron, and then use PHP's time and date functions to perform the rest of the logic. If you cannot time it, you could store a position marking variable in a small file.

Which is the best way to bi-directionally synchronize dynamic data in real time using mysql

Here is the scenario. 2 web servers in two separate locations having two mysql databases with identical tables. The data within the tables is also expected to be identical in real time.
Here is the problem. if a user in either location simultaneously enters a new record into identical tables, as illustrated in the two first tables below, where the third record in each table has been entered simultaneously by the different people. The data in the tables is no longer identical. Which is the best way to maintain that the data remains identical in real time as illustrated in the third table below regardless of where the updates take place? That way in the illustrations below instead of ending up with 3 rows in each table, the new records are replicated bi-directionally and they are inserted in both tables to create 2 identical tables again with 4 columns this time?
Server A in Location A
==============
Table Names
| ID| NAME |
|-----------|
| 1 | Tom |
| 2 | Scott |
|-----------|
| 3 | John |
|-----------|
Server B in Location B
==============
Table Names
| ID| NAME |
|-----------|
| 1 | Tom |
| 2 | Scott |
|-----------|
| 3 | Peter |
|-----------|
Expected Scenario
===========
Table Names
| ID| NAME |
|-----------|
| 1 | Tom |
| 2 | Scott |
| 3 | Peter |
| 4 | John |
|-----------|
There isn't much performance to be gained from replicating your database on two masters. However, there is a nifty bit of failover if you code your application correct.
Master-Master setup is essentially the same as the Slave-Master setup but has both Slaves started and an important change to your config files on each box.
Master MySQL 1:
auto_increment_increment = 2
auto_increment_offset = 1
Master MySQL 2:
auto_increment_increment = 2
auto_increment_offset = 2
These two parameters ensure that when two servers are fighting over a primary key for some reason, they do not duplicate and kill the replication. Instead of incrementing by 1, any auto-increment field will by default increment by 2. On one box it will start offset from 1 and run the sequence 1 3 5 7 9 11 13 etc. On the second box it will start offset at 2 and run along 2 4 6 8 10 12 etc. From current testing, the auto-increment appears to take the next free number, not one that has left before.
E.g. If server 1 inserts the first 3 records (1 3 and 5) when Server 2 inserts the 4th, it will be given the key of 6 (not 2, which is left unused).
Once you've set that up, start both of them up as Slaves.
Then to check both are working ok, connect to both machines and perform the command SHOW SLAVE STATUS and you should note that both Slave_IO_Running and Slave_SQL_Running should both say “YES” on each box.
Then, of course, create a few records in a table and ensure one box is only inserting odd numbered primary keys and the other is only incrementing even numbered ones.
Then do all the tests to ensure that you can perform all the standard applications on each box with it replicating to the other.
It's relatively simple once it's going.
But as has been mentioned, MySQL does discourage it and advise that you ensure you are mindful of this functionality when writing your application code.
Edit: I suppose it's theoretically possible to add more masters if you ensure that the offsets are correct and so on. You might more realistically though, add some additional slaves.
MySQL does not support synchronous replication, however, even if it did, you would probably not want to use it (can't take the performance hit of waiting for the other server to sync on every transaction commit).
You will have to consider more appropriate architectural solutions to it - there are third party products which will do a merge and resolve conflicts in a predetermined way - this is the only way really.
Expecting your architecture to function in this way is naive - there is no "easy fix" for any database, not just MySQL.
Is it important that the UIDs are the same? Or would you entertain the thought of having a table or column mapping the remote UID to the local UID and writing custom synchronisation code for objects you wish to replicate across that does any necessary mapping of UIDs for foreign key columns, etc?
The only way to ensure your tables are synchronized is to setup a 2-ways replication between databases.
But, MySQL only permits one-way replication, so you can't simply resolve your problem in this configuration.
To be clear, you can "setup" a 2-ways replication but MySQL AB discourages this.

Categories