mysql - deleting large number of rows with a limit (php nightly cron) - php

Not sure what the best way to handle this is. For my particular situation I have numerous tables where I want to delete any rows with a timestamp that is greater than 3 months ago... aka only keep records for the last 3 months.
Very simply it would be something like so :
//set binding cutoff timestamp
$binding = array(
'cutoff_time' => strtotime('-3 months')
);
//##run through all the logs and delete anything before the cutoff time
//app
$stmt = $db->prepare("
DELETE
FROM app_logs
WHERE app_logs.timestamp < :cutoff_time
");
$stmt->execute($binding);
//more tables after this
Every table I am going to be deleting from has a timestamp column which is indexed. I am concerned about down the road when the number of rows to delete is large. What would be the best practice to limit the chunks in a loop? All I can think of is doing an initial select to find if there are any rows which need to be deleted then run the delete if there are... repeat until the initial doesn't find any results. This adds in an additional count query for each iteration of the loop.
What is the standard/recommended practice here?
EDIT:
quick writeup of what I was thinking
//set binding cutoff timestamp
$binding = array(
'cutoff_time' => strtotime('-3 months')
);
//set limit value
$binding2 = array(
'limit' => 1000
);
//##run through all the logs and delete anything before the cutoff time
//get the total count
$stmt = $db->prepare("
SELECT
COUNT(*)
FROM app_logs
WHERE app_logs.timestamp < :cutoff_time
");
$stmt->execute($binding);
//get total results count from above
$found_count = $stmt->fetch(PDO::FETCH_COLUMN, 0);
// loop deletes
$stmt = $db->prepare("
DELETE
FROM app_logs
WHERE app_logs.timestamp < :cutoff_time
LIMIT :limit
");
while($found_count > 0)
{
$stmt->execute( array_merge($binding, $binding2) );
$found_count = $found_count - $binding2['limit'];
}

It depends on you table size and its workload so you can try some iterations:
Just delete everything that is older than 3 month. Take a look if it's timing is good enough. Is there performance degradation or table locks? How your app handles period of data deletion?
It case everything is bad consider to delete with 10k limit or so on. Check it as above. Add proper indexes
Even it's still bad, consider selecting PK before delete and than delete on PK with 10k limit and pauses between queries.
Still bad? Add new column "to delete", and perform operation on it with all requirements above.
There is a lot of tricks on for rotating tables. Try something and you will face your needs

Related

Randomly Generate 4 digit Code & Check if exist, Then re-generate

My code randomly generates a 4 or 5 digit code along with 3 digit pre-defined text and checks in database, If it is already exists, then it regenerates the code and saves into database.
But sometimes the queries get stuck & become slower, if each pre-defined keyword has around 1000 record.
Lets take an example for one Keyword "XYZ" and Deal ID = 100 and lets say it has 8000 records in database. The do while loops take a lot of time.
$keyword = "XYZ"; // It is unique for each deal id.
dealID = 100; // It is Foreign key of another table.
$initialLimit = 1;
$maxLimit = 9999;
do {
$randomNo = rand($initialLimit, $maxLimit);
$coupon = $keyword . str_pad($randomNo, 4, '0', STR_PAD_LEFT);
$findRecord = DB::table('codes')
->where('code', $coupon)
->where('deal_id', $dealID)
->exists();
} while ($findRecord == 1);
As soon as the do-while loops end, Record is being inserted into database after above code. But the Above code takes too much time,
The above query is printed as follow in MySQL. like for above example deal id, it has already over 8000 records. The above code keeps querying until it finds. When traffic is high, app becomes slower.
select exists(select * from `codes` where `code` = 'XYZ1952' and `deal_id` = '100');
select exists(select * from `codes` where `code` = 'XYZ2562' and `deal_id` = '100');
select exists(select * from `codes` where `code` = 'XYZ7159' and `deal_id` = '100');
Multiple queries like this get stuck in database. The codes table has around 500,000 records against multiple deal ids. But Each deal id has around less than 10,000 records, only few has more than 10,000.
Any suggestions, How can I improve the above code?
Or I should use the MAX() function and find the code and do the +1 and insert into db?
When 80% of the numbers are taken, it takes time to find one that is not taken. The first test loses 80% of the time; the second loses 64% of the time, then 51%, etc. Once in a while, it will take a hundred tries to find a free number. And if the random number generator is "poor", it could literally never finish.
Let's make a very efficient way to generate the numbers.
Instead of looking for a free one, pre-determine all 9999 values. Do this by taking all the numbers 1..9999 and shuffle them. Store that list somewhere and "pick the next one" (instead of scrounging for an unused one).
This could be done with an extra column (indexed). Or an extra table. Or any number of other ways.

How to limit mysql rows to select newest 50 rows

How to limit mysql rows to select newest 50 rows and have a next button such that next 50 rows are selected without knowing the exact number of rows?
I mean there may be an increment in number of rows in table. Well I will explain it clearly: I was developing a web app as my project on document management system using php mysql html. Everything is done set but while retrieving the documents I mean there may be thousands of documents.
All the documents whatever in my info table are retrieving at a time in home page which was not looking good. So I would like to add pages on such that only newest 50 documents are placed in first page next 50 are in second and so on.
But how come I know the exact number of rows every time and I cannot change the code every time a new document added so... numrows may not be useful I think...
Help me out please...
What you are looking for is called pagination, and the easiest way to implement a simple pagination is using LIMIT x , y in your SQL queries.
You don't really need the total ammount of rows you have, you just need two numbers:
The ammount of elemments you have already queried, so you know where you have to continue the next query.
The ammount of elements you want to list each query (for example 50, as you suggested).
Let's say you want to query the first 50 elements, you should insert at the end of your query LIMIT 0,50, after that you'll need to store somewhere the fact that you have already queried 50 elements, so the next time you change the limit to LIMIT 50,50 (starting from element number 50 and query the 50 following elements).
The order depends on the fields you are making when the entries are inserted. Normally you can update your table and add the field created TIMESTAMP DEFAULT CURRENT_TIMESTAMP and then just use ORDER BY created, because from now on your entries will store the exact time they were created in order to look for the most recent ones (If you have an AUTO_INCREMENT id you can look for the greater values aswell).
This could be an example of this system using php and MySQL:
$page = 1;
if(!empty($_GET['page'])) {
$page = filter_input(INPUT_GET, 'page', FILTER_VALIDATE_INT);
if(false === $page) {
$page = 1;
}
}
// set the number of items to display per page
$items_per_page = 50;
// build query
$offset = ($page - 1) * $items_per_page;
$sql = "SELECT * FROM your_table LIMIT " . $offset . "," . $items_per_page;
I found this post really useful when I first try to make this pagination system, so I recommend you to check it out (is the source of the example aswell).
Hope this helped you and sorry I coudn't provide you a better example since I don't have your code.
Search for pagination using php & mysql. That may become handy with your problem.
To limit a mysql query to fetch 50 rows use LIMIT keyword. You may need to find & store the last row id(50th row) so that you can continue with 51th to 100th rows in the next page.
Post what you have done with your code. Please refer to whathaveyoutried[dot]com
check this example from another post https://stackoverflow.com/a/2616715/6257039, you could make and orber by id, or creation_date desc in your query

Delete row by row or a bulk

I would like like to delete a bulk of data. this table have approximately 11207333 now
However I have several method to delete it.
The data that will be deleted is approximately 300k. I have two method to do this but unsure which one perform faster.
My first option:
$start_date = "2011-05-01 00:00:00";
$end_date = "2011-05-31 23:59:59";
$sql = "DELETE FROM table WHERE date>='$start_date' and date <='$end_date'";
$mysqli->query($sql);
printf("Affected rows (DELETE): %d\n", $mysqli->affected_rows);
second option:
$query = "SELECT count(*) as count FROM table WHERE date>='$start_date' and date <='$end_date'";
$result = $mysqli->query($query);
$row = $result->fetch_array(MYSQLI_ASSOC);
$total = $row['count'];
if ($total > 0) {
$query = "SELECT * FROM table WHERE date>='$start_date' and date <='$end_date' LIMIT 0,$total";
$result = $mysqli->query($query);
while ($row = $result->fetch_array(MYSQLI_ASSOC)) {
$table_id = $row['table_id']; // primary key
$query = "DELETE FROM table where table_id = $table_id LIMIT 0,$total";
$mysqli->query($query);
}
}
This table data is displayed to client to see, I afraid that if the deletion go wrong and it will affect my client.
I was wondering are there any method better than mine.
If you guys need more info from me just let me know.
Thank you
In my opinion, the first option is faster.
The second option contains looping which I think will be slower because it keeps looping several times looking for your table id.
If you did not provide the wrong start and end date, I think you're safe either option, but option 1 is faster in my opinion.
and yea, i dont see any deletion in option 2, but I assume you have it in mind but using looping method.
Option one is your best bet.
If you are afraid something will "go wrong" you could protect yourself by backing up the data first, exported the rows you plan to delete, or implementing a logical delete flag.
Assuming that there is indeed a DELETE query in it, the second method is not only slower, it may break if another connection deletes one of the rows you intend to delete in your while loop, before it had a chance to do it. For it to work, you need to wrap it in a transaction:
mysqli_query("START TRANSACTION;");
# your series of queries...
mysql_query("COMMIT;");
This will allow the correct processing of your queries in isolation of the rest of the events happening in the db.
At any rate, if you want the first query to be faster, you need to tune your table definition by adding an index on the column used for the deletion, namely `date` (however, recall that this new index may amper other queries in your app, if there are already several indexes on that table).
Without that index, mysql will basically process the query more or less the same way as in method 2, but without:
PHP interpretation,
network communication and
query analysis overhead.
You don't need any SELECTS to make the delete in a loop. Just use LIMIT in your delete query and check if there are affected rows:
$start_date = "2011-05-01 00:00:00";
$end_date = "2011-05-31 23:59:59";
$deletedRecords = 0;
$sql = "DELETE FROM table WHERE date>='$start_date' and date <='$end_date' LIMIT 100";
do {
$mysqli->query($sql);
$deletedRecords += $mysqli->affected_rows;
while ($mysqli->affected_rows > 0);
}
printf("Affected rows (DELETE): %d\n", $deletedRecords);
Which method is better depends on the storage engine you are using.
If you are using InnoDB, this is the recommended way. The reason is that the DELETE statement runs in a transaction (even in auto-commit mode, every sql statement is run in a transaction, in order to be atomic... if it fails in the middle, the whole delete will be rolled back and you won't end with half-data). Which means that you will have a long running transaction, and you will have a lot of locked rows during the transaction, which will block anyone who wants to update such data (it can block insterts if there are unique indexes involved) and reads will be done via the rollback log. In other words, for InnoDB, large deletes are faster if performed in chunks.
In MyISAM however, the delete locks the entire table. If you do in lot of small chunks, you will have too many LOCK/UNLOCK commands executed, which will actually slow the process. I would make it in a loop for MyISAM as well, to give chance to other processes to use the table, but in larger chunks compared to InnoDB. I would never do it row by row for MyISAM based table because of the LOCK/UNLOCK overhead.

parallel cron jobs picking up the same SQL row

I've basically got a cron file which sends a multi_curl at the same time to 1 file thus being parallel, Inside the cron file.
My cron file looks like this (sends a parallel request)
<?php
require "files/bootstrap.php";
$amount = array(
"10","11","12","13","14"
);
$urls = array();
foreach($amount as $cron_id) {
$urls[] = Config::$site_url."single_cron.php?cron_id=".$cron_id;
}
$pg = new ParallelGet($urls);
?>
Then inside my single_cron.php I've got the following query
SELECT *
FROM accounts C JOIN proxies P
ON C.proxy_id = P.proxy_id
WHERE C.last_used < DATE_SUB(NOW(), INTERVAL 1 MINUTE)
AND C.status = 1
AND C.running = 0
AND P.proxy_status = 1
AND C.test_account = 0
ORDER BY uuid()
LIMIT 1
Even though I've got the uuid inside the query they still appear to be picking up the same row somehow, what's the best way to prevent this? I've heard something about transactions
The current framework I'm using is PHP, so if any solution in that would work, then I'm free to solutions.
Check the select for update command. This prevents other parallel queries from selecting the same row by blocking them until you do a commit. So your select should include some condition like last_process_time > 60 , and you should update the row after selecting it, setting last_processed_time to the current time. Maybe you have a different mechanism to detect whether a row has been recently selected/processed, you can use that as well. The important thing about it is that select for update will place a lock on the row, so even if you run your queries in parallel, they will be serialized by the mysql server.
This is the only way to be sure you don't have 2 queries selecting the same row - even if your order by uuid() worked correctly, you'd select the same row in 2 parallel queries every now and then anyways.
The correct way to do this with transactions is:
START TRANSACTION;
SELECT *
FROM accounts C JOIN proxies P
ON C.proxy_id = P.proxy_id
WHERE C.last_used < DATE_SUB(NOW(), INTERVAL 1 MINUTE)
AND C.status = 1
AND C.running = 0
AND P.proxy_status = 1
AND C.test_account = 0
LIMIT 1;
(assume you have a column 'ID' in your accounts table that identifies rows uniquely)
UPDATE accounts
set last_used=now(), .... whatever else ....
where id=<insert the id you selected here>;
COMMIT;
The query that reaches the server first will be executed, and the returned row locked. All the other queries will be blocked at that point. Now you update whatever you want to. After the commit, the other queries from other processes will be executed. They won't find the row you just changed, because the last_used < ... condition isn't true anymore. One of these queries will find a row, lock it, and the others will get blocked again, until the second process does the commit. This continues until everything is finished.
Instead of START TRANSACTION, you can set autocommit to 0 in your session as well. And don't forget this only works if you use InnoDB tables. Check the link i gave you if you need more details.

Time Taken for each row to be inserted to MySQL database

Hi I have data to upload to a Mysql database and I need to get the time taken for each row to be inserted so that I can pass that number through to a progress bar. I have alreay tried accomplishing this by determining the number of rows affetced by the insertion then find the percentage of that number which is not the correct manner to do this.
here is the code
$result3=mysql_query("INSERT INTO dest_table.create_info SELECT * from Profusion.source_cdr") or die(mysql_error());
$progress=mysql_affected_rows();
// Total processes
$total = $progress;
// Loop through process
for($i=1; $i<=$total; $i++){
// Calculate the percentage
$percent = intval($i/$total * 100)."%";
echo $percent;
this actually divides the total number of rows by 1 and multiplies by 100 to get the percentage and this is wrong .
I need the time taken for each row to be inserted and then find the percentage of that.
Your help will be highly appreciated.
It is not easy to determine the time of one insertion in this case; because normal insertion INSERT INTO my_table VALUES(2,4,5)is different from INSERT INTO my_Table SELECT foo,bar,zoo FROM my_other_table.
In your case, the mysql server will try to keep both tables in memory, which need a lot of memory if tables are big. Check this blog post: Even faster: loading half a billion rows in MySQL revisited for some details.
Anyway, in the above script, after returning from mysql_query; the query is ALREADY EXECUTED which means you percentage will count after the query is executed.
EDIT: A solution it to chunk the entries in the table into chunks. Pseudo code of the solution shall be like this:
mysql_query("SELECT count(*) FROM my_other_table");
$total_count = get_count_from_query_result();
$one_percent = intval($total_count/100);
for($i=0; $i<$total_count; $i += $one_percent)
{
mysql_query("INSERT INTO my_Table SELECT foo,bar,zoo FROM my_other_table LIMIT $i, $one_percent");
increment_progress_bar();
}

Categories