I have to do a update query of 800k rows and looking for the best way to do this. All rows are updated with the same values excepted one field (D in my exemple). This field can be 1 or 0. I use update() methode of Zend_Db.
I think about 3 methods to do this :
Methode 1 : Update each row, one after one (with a foreach).
Methode 2 : Do an IF in the update to set the value of the field
Methode 3 : Divide rows in two groups (one with field = 1 and another
with field
= 0) and make two updates (UPDATE ... WHERE id IN (...)), one for each group.
Query looks like this :
$a_data = array(
'A' => foo,
'B' => 99,
'C' => 0,
'D' => (0 OR 1 ?)
);
$where['id IN (?)'] = $a_id;
$update = $this->_db->update($this->_name, $a_data, $where);
Witch method can be the best way to do this ? Thanks
For the record, 800k rows updated on a live production server isn't a good plan. Except being done at an actual mysql level, the chances of this update stopping your server are high.
Now, that being said, and assuming you're running MySql,
Method 1. isn't feasible if for nothing else than that you have 800k rows => 800k queries. max_timeout in php.ini will not allow for the script to run that long. If you still want to try it, try splicing the results into batches of 50-100-200 (depending on your server configuration) and run each batch with a time difference between them. Do a batch, wait a second, do a batch, wait a second, and so on...
Method 2. i guess it pertains to your certain problem, but it will be quicker.
Method 3. see answer for Method 1, except it's not 800k at once, but depends on the ratio between your 0 and 1's. It's going to be 2 queries each pretty large.
Usually, when there's a large batch update like this, I'd say, use mysql from a command line.
If this is an update php script that you're running, the best results are splicing the results and updating 50-100-whatever number at a time. Although it's time consuming (800.000rows / 100rows at a time = 800 runs of the script + a pause of a second after every updated batch).
Related
I'm using Laravel 5.7 to fetch large amounts of data (around 500k rows) from an API server and insert it into a table (call it Table A) quite frequently (at least every six hours, 24/7) - however, it's enough to insert only the changes the next time we insert (but at least 60-70% of the items will change). So this table will quickly have tens of millions of rows.
I came up with the idea to make a helper table (call it Table B) to store all the new data into it. Before inserting everything into Table A, I want to compare it to the previous data (with Laravel, PHP) from Table B - so I will only insert the records that need to be updated. Again it will usually be around 60-70% of the records.
My first question is if this above-mentioned way is the preferred way of doing it, in this situation (obviously I want to make it happen as fast as possible.) I assume that searching for an updating the records in the table would take a lot more time and it would keep the table busy / lock it. Is there a better way to achieve the same (meaning to update the records in the DB).
The second issue I'm facing is the slow insert times. Right now I'm using a local environment (16GB RAM, I7-6920HQ CPU) and MySQL is inserting the rows very slowly (about 30-40 records at a time). The size of one row is around 50 bytes.
I know it can be made a lot faster by fiddling around with InnoDB's settings. However, I'd also like to think that I can do something on Laravel's side to improve performance.
Right now my Laravel code looks like this (only inserting 1 record at a time):
foreach ($response as $key => $value)
{
DB::table('table_a')
->insert(
[
'test1' => $value['test1'],
'test2' => $value['test2'],
'test3' => $value['test3'],
'test4' => $value['test4'],
'test5' => $value['test5'],
]);
}
$response is a type of array.
So my second question: is there any way to increase the inserting time of the records to something like 50k/second - both on the Laravel application layer (by doing batch inserts) and MySQL InnoDB level (changing the config).
Current InnoDB settings:
innodb_buffer_pool_size = 256M
innodb_log_file_size = 256M
innodb_thread_concurrency = 16
innodb_flush_log_at_trx_commit = 2
innodb_flush_method = normal
innodb_use_native_aio = true
MySQL version is 5.7.21.
If I forgot to tell/add anything, please let me know in a comment and I will do it quickly.
Edit 1:
The server that I'm planning to use will have SSD on it - if that makes any difference. I assume MySQL inserts will still count as I/O.
Disable autocommit and manually commit at end of insertion
According to MySQL 8.0 docs. (8.5.5 Bulk Data Loading for InnoDB Tables)
You can increase the INSERT speed by turning off auto commit:
When importing data into InnoDB, turn off autocommit mode, because it performs a log flush to disk for every insert. To disable autocommit during your import operation, surround it with SET autocommit and COMMIT statements:
SET autocommit=0;
... SQL import statements ...
COMMIT;
Other way to do it in Laravel is using Database Transactions:
DB::beginTransaction()
// Your inserts here
DB::commit()
Use INSERT with multiple VALUES
Also according to MySQL 8.0 docs (8.2.5.1 Optimizing INSERT Statements) you can optimize INSERT speed by using multiple VALUES on a single insert statement.
To do it with Laravel, you can just pass an array of values to the insert() method:
DB::table('your_table')->insert([
[
'column_a'=>'value',
'column_b'=>'value',
],
[
'column_a'=>'value',
'column_b'=>'value',
],
[
'column_a'=>'value',
'column_b'=>'value',
],
]);
According to the docs, it can be many times faster.
Read the docs
Both MySQL docs links that I put on this post have tons of tips on increasing INSERT speed.
Avoid using Laravel/PHP for inserting it
If your data source is (or can be) a CSV file, you can run it a lot faster using mysqlimport to import the data.
Using PHP and Laravel to import data from a CSV file is an overhead, unless you need to do some data processing before inserting.
Thanks #Namoshek, I had also the same problem. solution is like this.
$users= array_chunk($data, 500, true);
foreach ($users as $key => $user) {
Model::insert($user);
}
Depends on data, you can also make use of array_push() and then insert.
Don't call insert() inside a foreach() because it will execute n number of queries to the database when you have n number of data.
First create an array of data objects matching with the database column names. and then pass the created array to insert() function.
This will only execute one query to the database regardless of how many number of data you have.
This is way faster, way too faster.
$data_to_insert = [];
foreach ($response as $key => $value)
{
array_push($data_to_insert, [
'test1' => $value['test1'],
'test2' => $value['test2'],
'test3' => $value['test3'],
'test4' => $value['test4'],
'test5' => $value['test5'],
]);
}
DB::table('table_a')->insert($data_to_insert);
You need to do multiple row insert but also chunk your insert to not exceed your DB limits
You can do this by chunking your array
foreach (array_chunk($response, 1000) as $responseChunk)
{
$insertableArray = [];
foreach($responseChunk as $value) {
$insertableArray[] = [
'test1' => $value['test1'],
'test2' => $value['test2'],
'test3' => $value['test3'],
'test4' => $value['test4'],
'test5' => $value['test5'],
];
}
DB::table('table_a')->insert($insertableArray);
}
You can increase the size of the chunk 1000 till you approach you DB configuration limit. Make sure to leave some security margin (0.6 times your DB limit).
You can't go any faster than this using laravel.
When I make a call to an MSSQL database using ad-hoc SQL (for example, "SELECT foo FROM tablename"), I can give a batch size for that call. This is very useful when I expect a lot of data returned.
In my case, I have a table with over 200 million rows, and I'm getting them all. Yes, I have reasons for being a big slurpy data hog like this.
My DB guys said, "Hey, stop using ad-hoc SQL, here, use this nifty SP. It does the same thing."
So I'm using it with the mssql_execute() function call, but there's no way to specify a batch size when doing this as there is with mssql_query()
I not only have to do a ini_set('memory_limit', '64G'); to make this work, I also have to sweat things as the SP call takes upwards of a half hour to run. Once it runs, I can do a loop on mssql_fetch_row(), no problem, but that initial call is a nail-biter!
And once I'm done, I have a process taking up 57G of memory (on a 96G box) that then takes a full hour at 80% CPU just to unwind and garbage collect. Yeah, I could kill the process, but that's a hack.
There has to be a better way!
With ad-hoc SQL, I call mssql_query() with a batch size of 10,000 rows and process them and then go back for more. I can then do something like echo "Yes, indeed, I'm on row $i right now..." and salve my paranoia that everything is running right.
So... what's the appropriate way to do this if I'm forced to use the SP that my DB guys want me to use?
Assuming the table has a primary key, I suggest you ask the DB guys to add 2 parameters to the stored procedure, one for the number of rows to be returned and another for the starting key value. Pass NULL for the initial batch and the last key value returned for each subsequent batch. This will provide efficient forward-only pagination. For example:
CREATE PROCEDURE dbo.usp_select_tablename
#NumRows int
, #StartKey int = NULL
AS
IF #StartKey IS NULL
BEGIN
SELECT TOP(#NumRows) foo
FROM tableName
ORDER BY StartKey;
END
ELSE
BEGIN
SELECT TOP(#NumRows) foo
FROM tableName
WHERE Key > #StartKey
ORDER BY StartKey;
END;
I have a cron task running every x seconds on n servers. It will "SELECT FROM table WHERE time_scheduled<CURRENT_TIME" and then perform a lengthy task on this result set.
My problem is now: How do I avoid having two seperate servers perform the same task at the same time?
The idea is to update *time_scheduled* with a set interval after selecting it. But if two servers happen to run the query at the same time, that will be too late, no?
All ideas are welcome. It doesnt have to be a strict MySQL solution.
Thanks!
I am guessing you have a single MySQL instance, and connections from your n servers to run this processing job. You're implementing a job queue here.
The table you mention needs to use the InnoDB access method (or one of the other transaction-friendly access methods offered by Percona or MariaDB).
Do these items in your table need to be processed in batches? That is, are they somehow inter-related? Or is it possible for your server processes to handle them one-by-one? This is an important question, because you'll get better load balancing between your server processes if you can handle them individually or in small batches. Let's assume the small batches.
The idea is to prevent any server process from grabbing onto a row in your table if some other server process has that row. I've had to do this kind of thing a lot, and here is my suggestion; I know this works.
First, add an integer column to your table. Call it "working" or some such thing. Give it a default value of zero.
Second, assign a permanent id number to each server. The last part of the server's IP address (for example, if the server's IP address is 10.1.0.123, the id number is 123) is a good choice, because it's probably unique in your environment.
Then, when a server's grabbing work to do, use these two SQL queries.
UPDATE table
SET working = :this_server_id
WHERE working = 0
AND time_scheduled < CURRENT_TIME
ORDER BY time_scheduled
LIMIT 1
SELECT table_id, whatever, whatever
FROM table
WHERE working = :this_server_id
The first query will consistently grab a batch of rows to work on. If another server process comes in at the same time, it won't ever grab the same rows, because no process can grab rows unless working = 0. Notice that the LIMIT 1 will limit your batch size. You don't have to do this, but you can. I also threw in ORDER BY to process the rows first that have been waiting the longest. That's probably a useful way to do things.
The second query retrieves the information you need to do the work. Don't forget to retrieve the primary key values (I called them table_id) for the rows you're working on.
Then, your server process does whatever it needs to do.
When it's done, it needs to throw the row back into the queue for a later time. To do that, the server process needs to set the time_scheduled to whatever it needs to be, then to set working = 0. So, for example, you could run this query for each row you're processing.
UPDATE table
SET time_scheduled = CURRENT_TIME + INTERVAL 5 MINUTE,
working = 0
WHERE table_id = ?table_id_from_previous_query
That's it.
Except for one thing. In the real world these queuing systems get fouled up sometimes. Server processes crash. Etc. Etc. See Murphy's Law. You need a monitoring query. That's easy in this system.
This query will give a list of all jobs that are more than five minutes overdue, along with the server that's supposed to be working on them.
SELECT working, COUNT(*) stale_jobs
FROM table
WHERE time_scheduled < CURRENT_TIME - INTERVAL 5 MINUTE
GROUP BY WORKING
If this query comes up empty, all is well. If it comes up with lots of jobs with working set to zero, your servers aren't keeping up. If it comes up with jobs with working set to some server's id number, that server is taking a lunch break.
You can reset all the jobs assigned to the server that's gone to lunch with this query, if need be.
UPDATE table
SET working=0
WHERE working=?server_id_at_lunch
By the way, a compound index on (working, time_scheduled) will probably help this perform well.
I have to update a 100,000 + MySQL database from PHP that pulls data from an API. It fails if I try and do more than 5,000 at the time.
I'm thinking the best approach might be to do 5,000 by using an update query with a limit 0, 5000 and then timestamping these records with the time they are updated. Then, select the next 5,000 where the time last updated is over 20 minues since current time.
Can anyone please offer any help on how to construct this query? Or is this approach not optimal?
So this is the solution I have gone with, rightly or wrongly it works. So to recap the problem, I have 100k rows, I need to loop through these and pass a userid to an API that returns a json feed.
I use the data returned to update each record. For some reason this fails either becasue of a timeout or server 500 error which I believe to be due to the API. So instead of selecting all 100k reords, I just select 5k (limit 0, 5000) and add a column called 'updated' and mark this as true once it has updated.
I keep doing this until all records are updated. When this happens I set the updated column to false and start the process again. This script runs on a chron job every 30 minutes and seems to work fine. I guess I could discover why it was timing out in the first place but I suspect it could be a php ini issue (timeout setting) which I don'thave access to.
Thanks
Jonathan
Create a temporary table, multi insert the update data and then
UPDATE `table`, `tmp`
SET `table`.`column` = `tmp`.`column`
WHERE `table`.`id` = `tmp`.`id`;
I would like to do a lot of inserts, but could it be possible to update mysql after a while.
For example if there is a query such as
Update views_table SET views = views + 1 WHERE id = 12;
Could it not be possible to maybe store this query until the views have gone up to 100 and then run the following instead of running the query from above 100 times.
Update views_table SET views = views + 100 WHERE id = 12;
Now, lets say that is done, then comes the problem of data integrity. Let's say, there are 100 php files open which are all about to run the same query. Now unless there is a locking mechanism on incrementing the cached views, there is a possibility that multiple files may have a same value of the cached view, so lets say process 1 may have 25 cached views and php process 2 may have 25 views and process 3 may have 27 views from the file. Now lets say process 3 finishes and increments the counter to 28. Then lets say php process is about finish and it finished just after process 3, which means that the counter would be brought back down to 26.
So do you guys have any solutions that are fast but are data secure as well.
Thanks
As long as your queries use relative values views=views+5, there should be no problems.
Only if you store the value somewhere in your script, and then calculate the new value yourself,you might run into trouble. But why would you want to do this? Actually, why do you want to do all of this in the first place? :)
If you don't want to overload the database, you could use UPDATE LOW_PRIORITY table set ..., the LOW_PRIORITY keyword will put the update action in a queue and wait for the table to no longer be used by reads or inserts.
First of all: with these queries: regardless of when a process starts, the UPDATE .. SET col = col + 1 is a safe operation, so it will not 'decrease' the counter, ever.
Regarding to 'store this query until the views have gone up to 100 and then run the following instead of running the query from above 100 times': not really. You can store a counter in faster memory (memcached comes to mind), with a process that transfers it to the database once in a while, or store it in another table with a AFTER UPDATE trigger, but I don't really see a point doing that.