Implementing a simple queue with PHP and MySQL? - php

I have a PHP script that retrieves rows from a database and then performs work based on the contents. The work can be time consuming (but not necessarily computationally expensive) and so I need to allow multiple scripts to run in parallel.
The rows in the database looks something like this:
+---------------------+---------------+------+-----+---------------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------------------+---------------+------+-----+---------------------+----------------+
| id | bigint(11) | NO | PRI | NULL | auto_increment |
.....
| date_update_started | datetime | NO | | 0000-00-00 00:00:00 | |
| date_last_updated | datetime | NO | | 0000-00-00 00:00:00 | |
+---------------------+---------------+------+-----+---------------------+----------------+
My script currently selects rows with the oldest dates in date_last_updated (which is updated once the work is done) and does not make use of date_update_started.
If I were to run multiple instances of the script in parallel right now, they would select the same rows (at least some of the time) and duplicate work would be done.
What I'm thinking of doing is using a transaction to select the rows, update the date_update_started column, and then add a WHERE condition to the SQL statement selecting the rows to only select rows with date_update_started greater than some value (to ensure another script isn't working on it). E.g.
$sth = $dbh->prepare('
START TRANSACTION;
SELECT * FROM table WHERE date_update_started > 1 DAY ORDER BY date_last_updated LIMIT 1000;
UPDATE table DAY SET date_update_started = UTC_TIMESTAMP() WHERE id IN (SELECT id FROM table WHERE date_update_started > 1 DAY ORDER BY date_last_updated LIMIT 1000;);
COMMIT;
');
$sth->execute(); // in real code some values will be bound
$rows = $sth->fetchAll(PDO::FETCH_ASSOC);
From what I've read, this is essentially a queue implementation and seems to be frowned upon in MySQL. All the same, I need to find a way to allow multiple scripts to run in parallel, and after the research I've done this is what I've come up with.
Will this type of approach work? Is there a better way?

I think your approach could work, as long as you also add some kind of identifier to the rows you selected that they are currently been worked on, it could be as #JuniusRendel suggested and i would even think about using another string key (random or instance id) for cases where the script resulted in errors and did not complete gracefully, as you will have to clean these fields once you updated the rows back after your work.
The problem with this approach as i see it is the option that there will be 2 scripts that run at the same point and will select the same rows before they were signed as locked. here as i can see it, it really depends on what kind of work you do on the rows, if the end result in these both scripts will be the same, i think the only problem you have is for wasted time and server memory (which are not small issues but i will put them aside for now...). if your work will result in different updates on both scripts your problem will be that you could have the wrong update at the end in the TB.
#Jean has mentioned the second approach you can take that involves using the MySql locks. i am not an expert of the subject but it seems like a good approach and using the 'Select .... FOR UPDATE' statement could give you what you are looking for as you could do on the same call the select & the update - which will be faster than 2 separate queries and could reduce the risk for other instances to select these rows as they will be locked.
The 'SELECT .... FOR UPDATE' allows you to run a select statement and lock those specific rows for updating them, so your statement could look like:
START TRANSACTION;
SELECT * FROM tb where field='value' LIMIT 1000 FOR UPDATE;
UPDATE tb SET lock_field='1' WHERE field='value' LIMIT 1000;
COMMIT;
Locks are powerful but be careful that it wont affect your application in different sections. Check if those selected rows that are currently locked for the update, are they requested somewhere else in your application (maybe for the end user) and what will happen in that case.
Also, Tables must be InnoDB and it is recommended that the fields you are checking the where clause with have a Mysql index as if not you may lock the whole table or encounter the 'Gap Lock'.
There is also a possibility that the locking process and especially when running parallel scripts will be heavy on your CPU & memory.
here is another read on the subject: http://www.percona.com/blog/2006/08/06/select-lock-in-share-mode-and-for-update/
Hope this helps, and would like to hear how you progressed.

We have something like this implemented in production.
To avoid duplicates, we do a MySQL UPDATE like this (I modified the query to resemble your table):
UPDATE queue SET id = LAST_INSERT_ID(id), date_update_started = ...
WHERE date_update_started IS NULL AND ...
LIMIT 1;
We do this UPDATE in a single transaction, and we leverage the LAST_INSERT_ID function. When used like that, with a parameter, it writes in the transaction session the parameter that, in this case, it's the ID of the single (LIMIT 1) queue that has been updated (if there is one).
Just after that, we do:
SELECT LAST_INSERT_ID();
When used without parameter, it retrieves the previously stored value, obtaining the queue item's ID that has to be performed.

Edit: Sorry, I totally misunderstood your question
You should just put a "locked" column on your table put the value to true on the entries your script is working with, and when it's done put it to false.
In my case i have put 3 other timestamp (integer) columns: target_ts , start_ts , done_ts.
You
UPDATE table SET locked = TRUE WHERE target_ts<=UNIX_TIMESTAMP() AND ISNULL(done_ts) AND ISNULL(start_ts);
and then
SELECT * FROM table WHERE target_ts<=UNIX_TIMESTAMP() AND ISNULL(start_ts) AND locked=TRUE;
Do your jobs and update each entry one by one (to avoid data inconcistencies) setting the done_ts property to current timestamp (you can also unlock them now). You can update target_ts to the next update you wish or you can ignore this column and just use done_ts for your select

Each time the script runs I would have the script generate a uniqid.
$sctiptInstance = uniqid();
I would add a script instance column to hold this value as a varchar and put an index on it. When the script runs I would use select for update inside of a transaction to select your rows based on whatever logic, excluding rows with a script instance, and then update those rows with the script instance. Something like:
START TRANSACTION;
SELECT * FROM table WHERE script_instance = '' AND date_update_started > 1 DAY ORDER BY date_last_updated LIMIT 1000 FOR UPDATE;
UPDATE table SET date_update_started = UTC_TIMESTAMP(), script_instance = '{$scriptInstance}' WHERE script_instance = '' AND date_update_started > 1 DAY ORDER BY date_last_updated LIMIT 1000;
COMMIT;
Now those rows will be excluded from other instances of the script. Do you work, and then update the rows to set the script instance back to null or blank, and also update your date last updated column.
You could also use the script instance to write to another table called "current instances" or something like that, and have the script check that table to get a count of running scripts to control the number of concurrent scripts. I would add the PID of the script to the table as well. You could then use that information to create a housekeeping script to run from cron periodically to check for long running or rogue processes and kill them, etc.

I have a system working exactly like this in production. We run a script every minute to do some processing, and sometimes that run can take more than a minute.
We have a table column for status, which is 0 for NOT RUN YET, 1 for FINISHED, and other value for under way.
The first thing the script does is to update the table, setting a line or multiple lines with a value meaning that we are working on that line. We use getmypid() to update the lines that we want to work on, and that are still unprocessed.
When we finish the processing, the script updates the lines that have the same process ID, marking them as finished (status 1).
This way we avoid each of the scripts to try and process a line that is already under processing, and it works like a charm. This doesn't mean that there isn't a better way, but this does get the work done.

I have used a stored procedure for very similar reasons in the past. We used the FOR UPDATE read lock to lock the table while a selected flag was updated to remove that entry from any future selects. It looked something like this:
CREATE PROCEDURE `select_and_lock`()
BEGIN
START TRANSACTION;
SELECT your_fields FROM a_table WHERE some_stuff=something
AND selected = 0 FOR UPDATE;
UPDATE a_table SET selected = 1;
COMMIT;
END$$
No reason it has to be done in a stored procedure though now I think about it.

Related

MariaDB row lock for read

I have two scripts using PHP7 / 10.4.14-MariaDB . Both update the same value in the database.
Script1 uses a transaction; script2 does not. Script1 is executed slightly earlier than script2.
The pseudo-code for both are:
Script 1:
$objDb->startTransaction();
$objDb->query("select ID,name from table1 where name='nameB' limit 1 FOR UPDATE ");
if($objDb->totalRows()>0)
{
$objDb->get();
$objDb->query("update table1 set name ='nameBB' where ID=".$objDb->row['ID']." ");
}
sleep(3);
$objDb->commit();
Script 2:
$objDb->query("select ID,name from table1 where name='nameB' limit 1");
if($objDb->totalRows()>0)
{
$objDb->get();
$objDb->query("update table1 set name ='nameCC' where ID=".$objDb->row['ID']." ");
}
If I would execute script2 with a transaction then the final database-value is 'nameBB' since script2 waits until script 1 is committed, as expected.
However in the current script2 example (without a transaction) the final database-value is 'nameCC'. I expected it also to be 'nameBB'. Apparently no read-lock is placed for the ID of table1.
How can I make sure that regular select queries ( without transaction / autocommit ) are put in read lock?
help appreciated
The Script 1 starts an transaction and updates name to 'nameBB'. This happens inside the transaction. This means that the change is not visible to other processes until it is committed.
The Script 2 is free to read the "old" data, but it is blocked to update the row until the transaction from Script 1 is either committed or it is rolled back.
When the Script 1 commits, the lock is released and the Script2 performs the update resulting 'nameCC' as name column value.
Note that the two scripts are independent of each other. It could have been that the Script 2's read happened before the row was locked by Script 1. The result would have been the same, so locking the read is not the answer.
What you should do, is avoid using separate SELECT/UPDATE and when possible do:
update table1 set name ='nameCC' where name='nameB' limit 1
If you hve two processes updating the same data simultaneously, you need to decide which of the updates is the valid one.
If you want to use separate SELECT/UPDATE, you can for example use updated_at datetime column to make sure your update matches the read.

Two incrementing columns in the same table

I have a table that contains invoices for several companies, each company needs to have their own incrementing invoice number system.
id | invoiceId | companyId
--------------------------
1 | 1 | 1
2 | 2 | 1
3 | 1 | 2
4 | 1 | 3
I was hoping to achieve this with a unique compound key similar to this approach for MyISAM outlined here, but it seems it is not possible with InnoDB.
I need to return the new ID immediately after insertion and have concerns about creating a race condition if I try and achieve this with PHP.
Is my best option to create a trigger and if yes what would that look like? I have no experience with triggers and my research into using an after insert trigger has me worried with this quote from the MariaDB documentation:
RESTRICTIONS
You can not create an AFTER trigger on a view. You can not update the
NEW values. You can not update the OLD values.
Thanks for any advice
You need to add a unique index besides getting your next value. The next value is best gotten by querying the table with a trigger or by some procedure within a transaction. The remark of trigger on a view is not relevant in that case.
The unique index is on companyId,invoiceId is required to prevent two insert processes running on the same company adding an invoice, which then can end up both with the same invoiceId. Even better is when you switch to InnoDB so you can use transactions: Then 2 processes started at virtually the same time can benefit from transaction isolation with as result that they will be serialized and you get 2 unique incrementing invoice ids returned without having to handle the unique index exception in your code.
As far as I know, mysql's last_id is connection based, and not global and shared between processes.
using this simple script I've validated my concerns
(note this is codeigniter syntax, but it's relatively easy to understand)
I accessed the following function twice, within 10 seconds of each other(in different browser windows)
function test(){
$arr['var'] = rand(0,100000000);
$this->db->insert('test_table',$arr);
sleep(30);
$id = $this->db->insert_id();
var_dump($id);
}
Interesting to note, instead of getting "2" as a response in both of them, I've gotten 1 and two respectfully. This makes even more sense when you look at the underlying function
function insert_id()
{
return #mysqli_insert_id($this->conn_id);
}
This solves the returned ID, Your race condition is the product of the underlying query, which is basically "Select MAX(invoiceId ) WHERE companyID = X" and add +1 to that, and insert it.
This should be possible with a table lock before insert, however this depends on how many times per second you expect this table to get updated.
note, on persistent connection the last_insert_id might work differently, I haven't tested it.

parallel cron jobs picking up the same SQL row

I've basically got a cron file which sends a multi_curl at the same time to 1 file thus being parallel, Inside the cron file.
My cron file looks like this (sends a parallel request)
<?php
require "files/bootstrap.php";
$amount = array(
"10","11","12","13","14"
);
$urls = array();
foreach($amount as $cron_id) {
$urls[] = Config::$site_url."single_cron.php?cron_id=".$cron_id;
}
$pg = new ParallelGet($urls);
?>
Then inside my single_cron.php I've got the following query
SELECT *
FROM accounts C JOIN proxies P
ON C.proxy_id = P.proxy_id
WHERE C.last_used < DATE_SUB(NOW(), INTERVAL 1 MINUTE)
AND C.status = 1
AND C.running = 0
AND P.proxy_status = 1
AND C.test_account = 0
ORDER BY uuid()
LIMIT 1
Even though I've got the uuid inside the query they still appear to be picking up the same row somehow, what's the best way to prevent this? I've heard something about transactions
The current framework I'm using is PHP, so if any solution in that would work, then I'm free to solutions.
Check the select for update command. This prevents other parallel queries from selecting the same row by blocking them until you do a commit. So your select should include some condition like last_process_time > 60 , and you should update the row after selecting it, setting last_processed_time to the current time. Maybe you have a different mechanism to detect whether a row has been recently selected/processed, you can use that as well. The important thing about it is that select for update will place a lock on the row, so even if you run your queries in parallel, they will be serialized by the mysql server.
This is the only way to be sure you don't have 2 queries selecting the same row - even if your order by uuid() worked correctly, you'd select the same row in 2 parallel queries every now and then anyways.
The correct way to do this with transactions is:
START TRANSACTION;
SELECT *
FROM accounts C JOIN proxies P
ON C.proxy_id = P.proxy_id
WHERE C.last_used < DATE_SUB(NOW(), INTERVAL 1 MINUTE)
AND C.status = 1
AND C.running = 0
AND P.proxy_status = 1
AND C.test_account = 0
LIMIT 1;
(assume you have a column 'ID' in your accounts table that identifies rows uniquely)
UPDATE accounts
set last_used=now(), .... whatever else ....
where id=<insert the id you selected here>;
COMMIT;
The query that reaches the server first will be executed, and the returned row locked. All the other queries will be blocked at that point. Now you update whatever you want to. After the commit, the other queries from other processes will be executed. They won't find the row you just changed, because the last_used < ... condition isn't true anymore. One of these queries will find a row, lock it, and the others will get blocked again, until the second process does the commit. This continues until everything is finished.
Instead of START TRANSACTION, you can set autocommit to 0 in your session as well. And don't forget this only works if you use InnoDB tables. Check the link i gave you if you need more details.

Why is this query abnormally long?

I'm timing various part of the site's "initialisation" code (including such things as verifying the user is logged in, connecting to the database, importing functions...)
This query is currently taking up abouve half the total initialisation time all by itself:
$sql = "update `users` set `lastclick`=now(),".(substr($_SERVER['PHP_SELF'],0,6) == "/ajax/" ? "" : " `lastactive`=now(),")." `lastip`='".addslashes($_SERVER['REMOTE_ADDR'])."' where `id`=".$userdata['id'];
Generating the query takes no time at all, it's the running that's the problem. Example result query:
update `users` set `lastclick`=now(), `lastactive`=now(), `lastip`='192.168.0.1' where `id`=1
Simple enough query, right? I am the only user on the server right now, there is literally nothing else running. So why does a simple update take up more time than connecting to the database, SELECTing the user data in the first place, validating the cookies, and defining a bunch of functions all combined?
(I just tried replacing now() with a literal value, but that made no difference - in fact it ended up taking 13ms the first time instead of 4...)
EDIT: As requested:
explain select * from `users` where `id`=1
1 row returned
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE users const PRIMARY PRIMARY 4 const 1
Solved my own mystery. Turns out one of the fields being updated (lastactive) was in an index, and the slowness was coming from rebuilding that index.
Since the only time that index might be used is in updating the list of users who are online, and that only happens by cron every set interval, I've dropped the index and now the query runs a heck of a lot faster.
Thanks to those who tried to help - you did help me find the problem, indirectly!

Best way to update user rankings without killing the server

I have a website that has user ranking as a central part, but the user count has grown to over 50,000 and it is putting a strain on the server to loop through all of those to update the rank every 5 minutes. Is there a better method that can be used to easily update the ranks at least every 5 minutes? It doesn't have to be with php, it could be something that is run like a perl script or something if something like that would be able to do the job better (though I'm not sure why that would be, just leaving my options open here).
This is what I currently do to update ranks:
$get_users = mysql_query("SELECT id FROM users WHERE status = '1' ORDER BY month_score DESC");
$i=0;
while ($a = mysql_fetch_array($get_users)) {
$i++;
mysql_query("UPDATE users SET month_rank = '$i' WHERE id = '$a[id]'");
}
UPDATE (solution):
Here is the solution code, which takes less than 1/2 of a second to execute and update all 50,000 rows (make rank the primary key as suggested by Tom Haigh).
mysql_query("TRUNCATE TABLE userRanks");
mysql_query("INSERT INTO userRanks (userid) SELECT id FROM users WHERE status = '1' ORDER BY month_score DESC");
mysql_query("UPDATE users, userRanks SET users.month_rank = userRanks.rank WHERE users.id = userRanks.id");
Make userRanks.rank an autoincrementing primary key. If you then insert userids into userRanks in descending rank order it will increment the rank column on every row. This should be extremely fast.
TRUNCATE TABLE userRanks;
INSERT INTO userRanks (userid) SELECT id FROM users WHERE status = '1' ORDER BY month_score DESC;
UPDATE users, userRanks SET users.month_rank = userRanks.rank WHERE users.id = userRanks.id;
My first question would be: why are you doing this polling-type operation every five minutes?
Surely rank changes will be in response to some event and you can localize the changes to a few rows in the database at the time when that event occurs. I'm pretty certain the entire user base of 50,000 doesn't change rankings every five minutes.
I'm assuming the "status = '1'" indicates that a user's rank has changed so, rather than setting this when the user triggers a rank change, why don't you calculate the rank at that time?
That would seem to be a better solution as the cost of re-ranking would be amortized over all the operations.
Now I may have misunderstood what you meant by ranking in which case feel free to set me straight.
A simple alternative for bulk update might be something like:
set #rnk = 0;
update users
set month_rank = (#rnk := #rnk + 1)
order by month_score DESC
This code uses a local variable (#rnk) that is incremented on each update. Because the update is done over the ordered list of rows, the month_rank column will be set to the incremented value for each row.
Updating the users table row by row will be a time consuming task. It would be better if you could re-organise your query so that row by row updates are not required.
I'm not 100% sure of the syntax (as I've never used MySQL before) but here's a sample of the syntax used in MS SQL Server 2000
DECLARE #tmp TABLE
(
[MonthRank] [INT] NOT NULL,
[UserId] [INT] NOT NULL,
)
INSERT INTO #tmp ([UserId])
SELECT [id]
FROM [users]
WHERE [status] = '1'
ORDER BY [month_score] DESC
UPDATE users
SET month_rank = [tmp].[MonthRank]
FROM #tmp AS [tmp], [users]
WHERE [users].[Id] = [tmp].[UserId]
In MS SQL Server 2005/2008 you would probably use a CTE.
Any time you have a loop of any significant size that executes queries inside, you've got a very likely antipattern. We could look at the schema and processing requirement with more info, and see if we can do the whole job without a loop.
How much time does it spend calculating the scores, compared with assigning the rankings?
Your problem can be handled in a number of ways. Honestly more details from your server may point you in a totally different direction. But doing it that way you are causing 50,000 little locks on a heavily read table. You might get better performance with a staging table and then some sort of transition. Inserts into a table no one is reading from are probably going to be better.
Consider
mysql_query("delete from month_rank_staging;");
while(bla){
mysql_query("insert into month_rank_staging values ('$id', '$i');");
}
mysql_query("update month_rank_staging src, users set users.month_rank=src.month_rank where src.id=users.id;");
That'll cause one (bigger) lock on the table, but might improve your situation. But again, that may be way off base depending on the true source of your performance problem. You should probably look deeper at your logs, mysql config, database connections, etc.
Possibly you could use shards by time or other category. But read this carefully before...
You can split up the rank processing and the updating execution. So, run through all the data and process the query. Add each update statement to a cache. When the processing is complete, run the updates. You should have the WHERE portion of the UPDATE reference a primary key set to auto_increment, as mentioned in other posts. This will prevent the updates from interfering with the performance of the processing. It will also prevent users later in the processing queue from wrongfully taking advantage of the values from the users who were processed before them (if one user's rank affects that of another). It also prevents the database from clearing out its table caches from the SELECTS your processing code does.

Categories