parallel cron jobs picking up the same SQL row

parallel cron jobs picking up the same SQL row - php

I've basically got a cron file which sends a multi_curl at the same time to 1 file thus being parallel, Inside the cron file.
My cron file looks like this (sends a parallel request)
<?php
require "files/bootstrap.php";
$amount = array(
"10","11","12","13","14"
);
$urls = array();
foreach($amount as $cron_id) {
$urls[] = Config::$site_url."single_cron.php?cron_id=".$cron_id;
}
$pg = new ParallelGet($urls);
?>
Then inside my single_cron.php I've got the following query
SELECT *
FROM accounts C JOIN proxies P
ON C.proxy_id = P.proxy_id
WHERE C.last_used < DATE_SUB(NOW(), INTERVAL 1 MINUTE)
AND C.status = 1
AND C.running = 0
AND P.proxy_status = 1
AND C.test_account = 0
ORDER BY uuid()
LIMIT 1
Even though I've got the uuid inside the query they still appear to be picking up the same row somehow, what's the best way to prevent this? I've heard something about transactions
The current framework I'm using is PHP, so if any solution in that would work, then I'm free to solutions.

Check the select for update command. This prevents other parallel queries from selecting the same row by blocking them until you do a commit. So your select should include some condition like last_process_time > 60 , and you should update the row after selecting it, setting last_processed_time to the current time. Maybe you have a different mechanism to detect whether a row has been recently selected/processed, you can use that as well. The important thing about it is that select for update will place a lock on the row, so even if you run your queries in parallel, they will be serialized by the mysql server.
This is the only way to be sure you don't have 2 queries selecting the same row - even if your order by uuid() worked correctly, you'd select the same row in 2 parallel queries every now and then anyways.
The correct way to do this with transactions is:
START TRANSACTION;
SELECT *
FROM accounts C JOIN proxies P
ON C.proxy_id = P.proxy_id
WHERE C.last_used < DATE_SUB(NOW(), INTERVAL 1 MINUTE)
AND C.status = 1
AND C.running = 0
AND P.proxy_status = 1
AND C.test_account = 0
LIMIT 1;
(assume you have a column 'ID' in your accounts table that identifies rows uniquely)
UPDATE accounts
set last_used=now(), .... whatever else ....
where id=<insert the id you selected here>;
COMMIT;
The query that reaches the server first will be executed, and the returned row locked. All the other queries will be blocked at that point. Now you update whatever you want to. After the commit, the other queries from other processes will be executed. They won't find the row you just changed, because the last_used < ... condition isn't true anymore. One of these queries will find a row, lock it, and the others will get blocked again, until the second process does the commit. This continues until everything is finished.
Instead of START TRANSACTION, you can set autocommit to 0 in your session as well. And don't forget this only works if you use InnoDB tables. Check the link i gave you if you need more details.

Related

MariaDB row lock for read

I have two scripts using PHP7 / 10.4.14-MariaDB . Both update the same value in the database.
Script1 uses a transaction; script2 does not. Script1 is executed slightly earlier than script2.
The pseudo-code for both are:
Script 1:
$objDb->startTransaction();
$objDb->query("select ID,name from table1 where name='nameB' limit 1 FOR UPDATE ");
if($objDb->totalRows()>0)
{
$objDb->get();
$objDb->query("update table1 set name ='nameBB' where ID=".$objDb->row['ID']." ");
}
sleep(3);
$objDb->commit();
Script 2:
$objDb->query("select ID,name from table1 where name='nameB' limit 1");
if($objDb->totalRows()>0)
{
$objDb->get();
$objDb->query("update table1 set name ='nameCC' where ID=".$objDb->row['ID']." ");
}
If I would execute script2 with a transaction then the final database-value is 'nameBB' since script2 waits until script 1 is committed, as expected.
However in the current script2 example (without a transaction) the final database-value is 'nameCC'. I expected it also to be 'nameBB'. Apparently no read-lock is placed for the ID of table1.
How can I make sure that regular select queries ( without transaction / autocommit ) are put in read lock?
help appreciated

The Script 1 starts an transaction and updates name to 'nameBB'. This happens inside the transaction. This means that the change is not visible to other processes until it is committed.
The Script 2 is free to read the "old" data, but it is blocked to update the row until the transaction from Script 1 is either committed or it is rolled back.
When the Script 1 commits, the lock is released and the Script2 performs the update resulting 'nameCC' as name column value.
Note that the two scripts are independent of each other. It could have been that the Script 2's read happened before the row was locked by Script 1. The result would have been the same, so locking the read is not the answer.
What you should do, is avoid using separate SELECT/UPDATE and when possible do:
update table1 set name ='nameCC' where name='nameB' limit 1
If you hve two processes updating the same data simultaneously, you need to decide which of the updates is the valid one.
If you want to use separate SELECT/UPDATE, you can for example use updated_at datetime column to make sure your update matches the read.

very slow search and update database operation

i have a table "table1" which has almost 400,000 records. There is another table "table2" which has around 450,000 records.
I need to delete all the rows in table1 which are duplicate in table2. I been trying to do it with php and the script was running for hours and not completed yet. Does it really takes that much time?
field asin is varchar(20) in table1
field ASIN is Index and char(10) in table2
$duplicat = 0;
$sql="SELECT asin from asins";
$result = $conn->query($sql);
if ($result->num_rows > 0) {
while($row = $result->fetch_assoc()) {
$ASIN = $row['asin'];
$sql2 = "select id from asins_chukh where ASIN='$ASIN' limit 1";
$result2 = $conn->query($sql2);
if ($result2->num_rows > 0) {
$duplicat++;
$sql3 = "UPDATE `asins` SET `duplicate` = '1' WHERE `asins`.`asin` = '$ASIN';";
$result3 = $conn->query($sql3);
if($result3) {
echo "duplicate = $ASIN <br/>";
}
}
}
}
echo "totaal :$duplicat";

u can run one single sql command, instead of a loop, something like:
update table_2 t2
set t2.duplicate = 1
where exists (
select id
from table_1 t1
where t1.id = t2.id);
Warning! i didn't test the sql above, so you may need to verify the syntax.
For such kind of database operation, using php to loop and join is never a good idea. Most of the time will be wasted on network data transfer between your php server and mysql server.
If even the the above sql takes too long, you can consider limiting the query set with some range. Something like:
update table_2 t2
set t2.duplicate = 1
where exists (
select id
from table_1 t1
where t1.id = t2.id
and t2.id > [range_start] and t2.id < [range_end] );
This way, you can kick of several updates running in parallel

Yes, processing RBAR (Row By Agonizing Row) is going to be slow. There is overhead associated with each of those individual SELECT and UPDATE statements that get executed... sending the SQL text to the database, parsing the tokens for valid syntax (keywords, commas, expressions), validating the semantics (table references and column references valid, user has required privileges, etc.), evaluating possible execution plans (index range scan, full index scan, full table scan), converting the selected execution plan into executable code, executing the query plan (obtaining locks, accessing rows, generating rollback, writing to the innodb and mysql binary logs, etc.), and returning the results.
All of that takes time. For a statement or two, the time isn't that noticeable, but put thousands of executions into a tight loop, and it's like watching individual grains of sand falling in an hour glass.
MySQL, like most relational databases, is designed to efficiently operate on sets of data. Give the database work to do, and let the database crank, rather than spend time round tripping back and forth to the database.
It's like you've got a thousand tiny items to deliver, all to the same address. You can individually handle each item. Get a box, put the item into the box with a packing slip, seal the package, address the package, weigh the package and determine postage, affix postage, and then put it into the car, drive to the post office, drop the package off. Then drive back, and handle the next item, put it into a box, ... over and over and over.
Or, we could handle a lot of tiny items together, as a larger package, and reduce the amount of overhead work (time) packaging and round trips to and from the post office.
For one thing, there's really no need to run a separate SELECT statement, to find out if we need to do an UPDATE. We could just run the UPDATE. If there are no rows to be updated, the query will return an "affected rows" count of 0.
(Running the separate SELECT is like another round trip in the car to the post office, to check the list of packages that need to be delivered, before each round trip to the post office to drop off a package. Instead of two round trips, we can take the package with us one the first trip.)
So, that could improve things a bit. But it doesn't really get to the root of the performance problem.
The real performance boost comes from getting more work done in fewer SQL statements.
How would we identify ALL of the rows that need to be updated?
SELECT t.asins
FROM asins t
JOIN asins_chukh s
ON s.asin = t.asin
WHERE NOT ( t.duplicate <=> '1' )
(If asin isn't unique, we need to tweak the query a bit, to avoid returning "duplicate" rows. The point is, we can write a single SELECT statement that identifies all of the rows that need to be updated.)
For non-trivial tables, for performance, we need to have suitable indexes available. In this case, we'd want indexes with a leading column of asin. If such an index doesn't exist, for example...
... ON asins_chukh (asin)
If that query doesn't return a huge number of rows, we can handle the UPDATE in one fell swoop:
UPDATE asins t
JOIN asins_chukh s
ON s.asin = t.asin
SET t.duplicate = '1'
WHERE NOT ( t.duplicate <=> '1' )
We need to be careful about the number of rows. We want to avoid holding blocking locks for a long time (impacting concurrent processes that may be accessing the asins table), and we want to avoid generating a huge amount of rollback.
We can break the work up into more manageable chunks.
(Referring back to the shipping tiny items analogy... if we have millions of tiny items, and putting all of those into a single shipment would create a package larger and heaver than a container ship container... we can break the shipment into manageably sized boxes.)
For example, we could handle the UPDATE in "batches" of 10,000 id values (assuming id is a unique (or nearly unique), is the leading column in the cluster key, and the id values are grouped fairly well into mostly contiguous ranges, we can get the update activity localized into one section of blocks, and not have to revist most of those same blocks again...
The WHERE clause could be something like this:
WHERE NOT ( t.duplicate <=> 1 )
AND t.id >= 0
AND t.id < 0 + 10000
For the next next batch...
WHERE NOT ( t.duplicate <=> 1 )
AND t.id >= 10000
AND t.id < 10000 + 10000
Then
WHERE NOT ( t.duplicate <=> 1 )
AND t.id >= 20000
AND t.id < 20000 + 10000
And so on, repeating that until we're past the maximum id value. (We could run a SELECT MAX(id) FROM asins as the first step, before the loop.)
(We want to test these statements as SELECT statements first, before we convert to an UPDATE.)
Using the id column might not be the most appropriate way to create our batches.
Our objective is to create manageable "chunks" we can put into a loop, where the chunks don't overlap the same database blocks... we won't need to revisit the same block over and over, with multiple statements, to make changes to rows within the same block multiple times.

Implementing a simple queue with PHP and MySQL?

I have a PHP script that retrieves rows from a database and then performs work based on the contents. The work can be time consuming (but not necessarily computationally expensive) and so I need to allow multiple scripts to run in parallel.
The rows in the database looks something like this:
+---------------------+---------------+------+-----+---------------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------------------+---------------+------+-----+---------------------+----------------+
| id | bigint(11) | NO | PRI | NULL | auto_increment |
.....
| date_update_started | datetime | NO | | 0000-00-00 00:00:00 | |
| date_last_updated | datetime | NO | | 0000-00-00 00:00:00 | |
+---------------------+---------------+------+-----+---------------------+----------------+
My script currently selects rows with the oldest dates in date_last_updated (which is updated once the work is done) and does not make use of date_update_started.
If I were to run multiple instances of the script in parallel right now, they would select the same rows (at least some of the time) and duplicate work would be done.
What I'm thinking of doing is using a transaction to select the rows, update the date_update_started column, and then add a WHERE condition to the SQL statement selecting the rows to only select rows with date_update_started greater than some value (to ensure another script isn't working on it). E.g.
$sth = $dbh->prepare('
START TRANSACTION;
SELECT * FROM table WHERE date_update_started > 1 DAY ORDER BY date_last_updated LIMIT 1000;
UPDATE table DAY SET date_update_started = UTC_TIMESTAMP() WHERE id IN (SELECT id FROM table WHERE date_update_started > 1 DAY ORDER BY date_last_updated LIMIT 1000;);
COMMIT;
');
$sth->execute(); // in real code some values will be bound
$rows = $sth->fetchAll(PDO::FETCH_ASSOC);
From what I've read, this is essentially a queue implementation and seems to be frowned upon in MySQL. All the same, I need to find a way to allow multiple scripts to run in parallel, and after the research I've done this is what I've come up with.
Will this type of approach work? Is there a better way?

I think your approach could work, as long as you also add some kind of identifier to the rows you selected that they are currently been worked on, it could be as #JuniusRendel suggested and i would even think about using another string key (random or instance id) for cases where the script resulted in errors and did not complete gracefully, as you will have to clean these fields once you updated the rows back after your work.
The problem with this approach as i see it is the option that there will be 2 scripts that run at the same point and will select the same rows before they were signed as locked. here as i can see it, it really depends on what kind of work you do on the rows, if the end result in these both scripts will be the same, i think the only problem you have is for wasted time and server memory (which are not small issues but i will put them aside for now...). if your work will result in different updates on both scripts your problem will be that you could have the wrong update at the end in the TB.
#Jean has mentioned the second approach you can take that involves using the MySql locks. i am not an expert of the subject but it seems like a good approach and using the 'Select .... FOR UPDATE' statement could give you what you are looking for as you could do on the same call the select & the update - which will be faster than 2 separate queries and could reduce the risk for other instances to select these rows as they will be locked.
The 'SELECT .... FOR UPDATE' allows you to run a select statement and lock those specific rows for updating them, so your statement could look like:
START TRANSACTION;
SELECT * FROM tb where field='value' LIMIT 1000 FOR UPDATE;
UPDATE tb SET lock_field='1' WHERE field='value' LIMIT 1000;
COMMIT;
Locks are powerful but be careful that it wont affect your application in different sections. Check if those selected rows that are currently locked for the update, are they requested somewhere else in your application (maybe for the end user) and what will happen in that case.
Also, Tables must be InnoDB and it is recommended that the fields you are checking the where clause with have a Mysql index as if not you may lock the whole table or encounter the 'Gap Lock'.
There is also a possibility that the locking process and especially when running parallel scripts will be heavy on your CPU & memory.
here is another read on the subject: http://www.percona.com/blog/2006/08/06/select-lock-in-share-mode-and-for-update/
Hope this helps, and would like to hear how you progressed.

We have something like this implemented in production.
To avoid duplicates, we do a MySQL UPDATE like this (I modified the query to resemble your table):
UPDATE queue SET id = LAST_INSERT_ID(id), date_update_started = ...
WHERE date_update_started IS NULL AND ...
LIMIT 1;
We do this UPDATE in a single transaction, and we leverage the LAST_INSERT_ID function. When used like that, with a parameter, it writes in the transaction session the parameter that, in this case, it's the ID of the single (LIMIT 1) queue that has been updated (if there is one).
Just after that, we do:
SELECT LAST_INSERT_ID();
When used without parameter, it retrieves the previously stored value, obtaining the queue item's ID that has to be performed.

Edit: Sorry, I totally misunderstood your question
You should just put a "locked" column on your table put the value to true on the entries your script is working with, and when it's done put it to false.
In my case i have put 3 other timestamp (integer) columns: target_ts , start_ts , done_ts.
You
UPDATE table SET locked = TRUE WHERE target_ts<=UNIX_TIMESTAMP() AND ISNULL(done_ts) AND ISNULL(start_ts);
and then
SELECT * FROM table WHERE target_ts<=UNIX_TIMESTAMP() AND ISNULL(start_ts) AND locked=TRUE;
Do your jobs and update each entry one by one (to avoid data inconcistencies) setting the done_ts property to current timestamp (you can also unlock them now). You can update target_ts to the next update you wish or you can ignore this column and just use done_ts for your select

Each time the script runs I would have the script generate a uniqid.
$sctiptInstance = uniqid();
I would add a script instance column to hold this value as a varchar and put an index on it. When the script runs I would use select for update inside of a transaction to select your rows based on whatever logic, excluding rows with a script instance, and then update those rows with the script instance. Something like:
START TRANSACTION;
SELECT * FROM table WHERE script_instance = '' AND date_update_started > 1 DAY ORDER BY date_last_updated LIMIT 1000 FOR UPDATE;
UPDATE table SET date_update_started = UTC_TIMESTAMP(), script_instance = '{$scriptInstance}' WHERE script_instance = '' AND date_update_started > 1 DAY ORDER BY date_last_updated LIMIT 1000;
COMMIT;
Now those rows will be excluded from other instances of the script. Do you work, and then update the rows to set the script instance back to null or blank, and also update your date last updated column.
You could also use the script instance to write to another table called "current instances" or something like that, and have the script check that table to get a count of running scripts to control the number of concurrent scripts. I would add the PID of the script to the table as well. You could then use that information to create a housekeeping script to run from cron periodically to check for long running or rogue processes and kill them, etc.

I have a system working exactly like this in production. We run a script every minute to do some processing, and sometimes that run can take more than a minute.
We have a table column for status, which is 0 for NOT RUN YET, 1 for FINISHED, and other value for under way.
The first thing the script does is to update the table, setting a line or multiple lines with a value meaning that we are working on that line. We use getmypid() to update the lines that we want to work on, and that are still unprocessed.
When we finish the processing, the script updates the lines that have the same process ID, marking them as finished (status 1).
This way we avoid each of the scripts to try and process a line that is already under processing, and it works like a charm. This doesn't mean that there isn't a better way, but this does get the work done.

I have used a stored procedure for very similar reasons in the past. We used the FOR UPDATE read lock to lock the table while a selected flag was updated to remove that entry from any future selects. It looked something like this:
CREATE PROCEDURE `select_and_lock`()
BEGIN
START TRANSACTION;
SELECT your_fields FROM a_table WHERE some_stuff=something
AND selected = 0 FOR UPDATE;
UPDATE a_table SET selected = 1;
COMMIT;
END$$
No reason it has to be done in a stored procedure though now I think about it.

MYSQL SELECT WITH PAUSE

We are making an php emailer which works perfect.
Selecting all the users from a database and send them emails are good to go.
But, since were have a huge amount of emails that has to be send, we would like start and pause the transactions of emails with [ 1000 ] to not overload the server.
Example:
SELECT: 1000;
PAUSE MYSQL
SELECT ANOTHER 1000;
PAUSE MYSQL
ETC.
I read about the START TRANSACTION, COMMIT & ROLLBACK functions, and I think I implemented this right..
Can someone help me to include a pause of 100 seconds before ROLLBACK the transaction?
I don't know what to do..
What i got until now [prefixed code]..
$max=1000;
$send=0;
$rollback=false;
mysql_query('START TRANSACTION;');
$query = mysql_query("SELECT DISTINCT mail_id, customers_email_address newsletters WHERE ORDER BY mail_id ASC");
while($result=mysql_fetch_array($query){
if( $rollback == true ){
$rollback = false;
mysql_query("ROLLBACK;");
}
[------script to send the emails-----]
$send++;
if( $max == $send ){
mysql_query("COMMIT;");
$rollback = true;
}
}
Cheers Jay

There is no need for transactions here at all - you're not updating anything. In fact, the overhead of transactions is entirely pointless here, so I'd advise you take that out.
You could simply (in theory, you can write the code for this)
Select the first 1000 rows from the database: SELECT ... LIMIT 0, 1000
Increment your offset by 1000
Select the next 1000 rows: SELECT ... LIMIT 1000, 1000
Rinse and repeat, until you get less than 1000 rows back from your query.
Please note that in order for that method to work, you'll want to ORDER BY the primary key in ASC order or something, to be sure you don't get the same row twice.

all you need is to schedule your sender script with cron for example and sending some amount of emails. (in sql use LIMIT).
it will send than N emails every M minutes and server will be happy ;)

Few optios like below:
1) You can implement Cronjob.
2) There is a opensource small application of php as PHPList which can be integrated in few seconds. (i already use this one, so)
3) 3rd option, you can use sleep function of php. (i am not sure about this)

Execution order of mysql queries from php script when same script is launched quickly twice

I have a php script that executes mysql pdo queries. There are a few reads and writes to the same table in this script.
For sake of example let's say that there are 4 queries, a read, write, another read, another write, each read takes 10 second to execute, and each write takes .1 seconds to execute.
If I execute this script from the cli nohup php execute_queries.php & twice in 1/100th of a second, what would the execution order of the queries be?
Would all the queries from the first instance of the script need to finish before the queries from the 2nd instance begin to run, or would the first read from both instances start and finish before the table is locked by the write?
NOTE: assume that I'm using myisam and that the write is an update to a record (IE, entire table gets locked during the write.)

Since you are not using transactions, then no, the won't wait for all the queries in one script to finish an so the queries may get overlaped.
There is an entire field of study called concurrent programming that teaches this.
In databases it's about transactions, isolation levels and data locks.
Typical (simple) race condition:
$visits = $pdo->query('SELECT visits FROM articles WHERE id = 44')->fetch()[0]['visits'];
/*
* do some time-consuming thing here
*
*/
$visits++;
$pdo->exec('UPDATE articles SET visits = '.$visits.' WHERE id = 44');
The above race condition can easily turn sour if 2 PHP processes read the visits from the database one millisecond after the other, and assuming the initial value of visits was 6, both would increment it to 7 and both would write 7 back into the database even though the desired effect was that 2 visits increment the value by 2 (final value of visits should've been 8).
The solution to this is using atomic operations (because the operation is simple and can be reduced to one single atomic operation).
UPDATE articles SET visits = visits+1 WHERE id = 44;
Atomic operations are guaranteed by the database engines to take place uninterrupted by other processes/threads. Usually the database has to queue incoming updates so that they don't affect each other. Queuing obviously slows things down because each process has to wait for all processes before it until it gets the chance to be executed.
In a less simple operation we need more than one statement:
SELECT #visits := visits FROM articles WHERE ID = 44;
SET #visits = #visits+1;
UPDATE articles SET visits = #visits WHERE ID = 44;
But again even at the database level 3 separate atomic statements are not guaranteed to yield an atomic result. They can be overlap with other operations. Just like the PHP example.
To solve this you have to do the following:
START TRANSACTION
SELECT #visits := visits FROM articles WHERE ID = 44 FOR UPDATE;
SET #visits = #visits+1;
UPDATE articles SET visits = #visits WHERE ID = 44;
COMMIT;

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.