How to update 100,000 record MySQL database efficiently

How to update 100,000 record MySQL database efficiently - php

I have to update a 100,000 + MySQL database from PHP that pulls data from an API. It fails if I try and do more than 5,000 at the time.
I'm thinking the best approach might be to do 5,000 by using an update query with a limit 0, 5000 and then timestamping these records with the time they are updated. Then, select the next 5,000 where the time last updated is over 20 minues since current time.
Can anyone please offer any help on how to construct this query? Or is this approach not optimal?

So this is the solution I have gone with, rightly or wrongly it works. So to recap the problem, I have 100k rows, I need to loop through these and pass a userid to an API that returns a json feed.
I use the data returned to update each record. For some reason this fails either becasue of a timeout or server 500 error which I believe to be due to the API. So instead of selecting all 100k reords, I just select 5k (limit 0, 5000) and add a column called 'updated' and mark this as true once it has updated.
I keep doing this until all records are updated. When this happens I set the updated column to false and start the process again. This script runs on a chron job every 30 minutes and seems to work fine. I guess I could discover why it was timing out in the first place but I suspect it could be a php ini issue (timeout setting) which I don'thave access to.
Thanks
Jonathan

Create a temporary table, multi insert the update data and then
UPDATE `table`, `tmp`
SET `table`.`column` = `tmp`.`column`
WHERE `table`.`id` = `tmp`.`id`;

Related

Mysql unable to update a row, when multiple selects are in process or taking too much time

I have a table called Settings which has only one row. The settings are very important in all the cases for my program, The Settings is been read by 200 to 300 users every second. I haven't used any caching yet. I cannot update the settings table for a value like Limit. Change the limit from 5 -10 Or anything from an API.
Ex: Limit Products 5 - 10. The update query runs forever.
From the Workbench, I can update the record, But from Admin Panel through API it's not updating or take too much time. Table - InnoDB
1. Already Tried Locking With Read - Write.
2. Transaction.
3. Made a View of the table and Tried to update the table, the Same Issue remains.
4. The Update query is fine from Workbench, But through an API. It runs all day.
Is there anyway, I can lock the read operations on the table and update the table. I have only one row in a table.
Any help would be highly appreciated, Thanks in advance.

This sounds like a really good use case for using query cache.
The query cache stores the text of a SELECT statement together with the corresponding result that was sent to the client. If an identical statement is received later, the server retrieves the results from the query cache rather than parsing and executing the statement again. The query cache is shared among sessions, so a result set generated by one client can be sent in response to the same query issued by another client.
The query cache can be useful in an environment where you have tables that do not change very often and for which the server receives many identical queries.
To enable the query cache, you can run:
SET GLOBAL query_cache_size = 1000000;
And then edit your mysql config file (typically /etc/my.cnf or /etc/mysql/my.cnf):
query_cache_size=1000000
query_cache_type=2
query_cache_limit=100000
And then for your query you can change it to:
SELECT SQL_CACHE * FROM your_table;
And that should make it so you are able to update the table (as it won't be constantly locked).
You would need to restart the server.
As an alternative, you could implement cacheing in your PHP application. I would use something like memcached, but as a very simplistic solution you could do something like:
$settings = json_decode(file_get_contents("/path/to/settings.json"), true);
$minute = intval(date('i'));
if (isset($settings['minute']) && $settings['minute'] !== $minute) {
$settings = get_settings_from_mysql();
$settings['minute'] = intval(date('i'));
file_put_contents("/path/to/settings.json", json_encode($settings), LOCK_EX);
}

Are the queries being run in the context of transactions with say a transaction isolation level for repeatable read? It sounds like the update isn't able to complete due to a lock on the table, in which case caching isn't likely to be able to help you, as on a write the cache will be purged. More information on repeatable reads can be found at https://www.percona.com/blog/2012/08/28/differences-between-read-committed-and-repeatable-read-transaction-isolation-levels/.

Unique Codes - Given to two users who hit script in same second

Hi have a bunch of unique codes in a database which should only be used once.
Two users hit a script which assigns them at the same time and got the same codes!
The script is in Magento and the user can order multiple codes. The issue is if one customer orders 1000 codes the script grabs the top 1000 codes from the DB into an array and then runs through them setting them to "Used" and assigning them to an order. If a second user hits the same script at a similar time the script then grabs the top 1000 codes in the DB at that point in time which crosses over as the first script hasn't had a chance to finish assigning them.
This is unfortunate but has happened quite a few times!
My idea was to create a new table, once the user hits the script a row is made with "order_id" "code_type". Then in the same script a check is done so if a row is in this new table and the "code_type" matches that of which the user is ordering it will wait 60 seconds and check again until the previous codes are issued and the table is empty where it will then create a row and off it goes.
I am not sure if this is the best way or if two users hit at the same second again whether two rows will just be inserted and off we go with the same problem!
Any advice is much appreciated!

The correct answer depends on the database you use.
For example in MySQL with InnoDB the possible solution is a transaction with SELECT ... LOCK IN SHARE MODE.
Schematically it works this by firing following queries:
START TRANSACTION;
SELECT * FROM codes WHERE used = 0 LIMIT 1000 LOCK IN SHARE MODE;
// save ids
UPDATE codes SET used=1 WHERE id IN ( ...ids....);
COMMIT;
More information at http://dev.mysql.com/doc/refman/5.7/en/innodb-locking-reads.html

SELECT+UPDATE to avoid returning the same result

I have a cron task running every x seconds on n servers. It will "SELECT FROM table WHERE time_scheduled<CURRENT_TIME" and then perform a lengthy task on this result set.
My problem is now: How do I avoid having two seperate servers perform the same task at the same time?
The idea is to update *time_scheduled* with a set interval after selecting it. But if two servers happen to run the query at the same time, that will be too late, no?
All ideas are welcome. It doesnt have to be a strict MySQL solution.
Thanks!

I am guessing you have a single MySQL instance, and connections from your n servers to run this processing job. You're implementing a job queue here.
The table you mention needs to use the InnoDB access method (or one of the other transaction-friendly access methods offered by Percona or MariaDB).
Do these items in your table need to be processed in batches? That is, are they somehow inter-related? Or is it possible for your server processes to handle them one-by-one? This is an important question, because you'll get better load balancing between your server processes if you can handle them individually or in small batches. Let's assume the small batches.
The idea is to prevent any server process from grabbing onto a row in your table if some other server process has that row. I've had to do this kind of thing a lot, and here is my suggestion; I know this works.
First, add an integer column to your table. Call it "working" or some such thing. Give it a default value of zero.
Second, assign a permanent id number to each server. The last part of the server's IP address (for example, if the server's IP address is 10.1.0.123, the id number is 123) is a good choice, because it's probably unique in your environment.
Then, when a server's grabbing work to do, use these two SQL queries.
UPDATE table
SET working = :this_server_id
WHERE working = 0
AND time_scheduled < CURRENT_TIME
ORDER BY time_scheduled
LIMIT 1
SELECT table_id, whatever, whatever
FROM table
WHERE working = :this_server_id
The first query will consistently grab a batch of rows to work on. If another server process comes in at the same time, it won't ever grab the same rows, because no process can grab rows unless working = 0. Notice that the LIMIT 1 will limit your batch size. You don't have to do this, but you can. I also threw in ORDER BY to process the rows first that have been waiting the longest. That's probably a useful way to do things.
The second query retrieves the information you need to do the work. Don't forget to retrieve the primary key values (I called them table_id) for the rows you're working on.
Then, your server process does whatever it needs to do.
When it's done, it needs to throw the row back into the queue for a later time. To do that, the server process needs to set the time_scheduled to whatever it needs to be, then to set working = 0. So, for example, you could run this query for each row you're processing.
UPDATE table
SET time_scheduled = CURRENT_TIME + INTERVAL 5 MINUTE,
working = 0
WHERE table_id = ?table_id_from_previous_query
That's it.
Except for one thing. In the real world these queuing systems get fouled up sometimes. Server processes crash. Etc. Etc. See Murphy's Law. You need a monitoring query. That's easy in this system.
This query will give a list of all jobs that are more than five minutes overdue, along with the server that's supposed to be working on them.
SELECT working, COUNT(*) stale_jobs
FROM table
WHERE time_scheduled < CURRENT_TIME - INTERVAL 5 MINUTE
GROUP BY WORKING
If this query comes up empty, all is well. If it comes up with lots of jobs with working set to zero, your servers aren't keeping up. If it comes up with jobs with working set to some server's id number, that server is taking a lunch break.
You can reset all the jobs assigned to the server that's gone to lunch with this query, if need be.
UPDATE table
SET working=0
WHERE working=?server_id_at_lunch
By the way, a compound index on (working, time_scheduled) will probably help this perform well.

What is the maximum records I should fetch from a MySQL database?

My server is running slow as I am trying to fetch 200 records from a MySQl database (using PHP). They are posts that I need to display and I know this is my error because when I try to fetch 1 record it is fast, 200 slows it down tremendously.
Is this a known problem, fetching for too many entries causes a problem?

Your PHP code must be a complicated function looping every time for each record. So it should be running 200 times.. That will slow the page response time. Fetching 200 records in MYSQL is not problem at all. It will run instantly if you run in MySql Terminal..

There are three possibilities that might slow you server down from your side.
Your database is not optimized. Optimizing your database can give you a tremendous performance increase
Your query is doing something wrong. We need to see what query you are running to get the 200 rows.
You are running an individual query for each row in a loop.
What i would suggest though is base your query on this eg.
SELECT fields FROM table WHERE condition = required condition LIMIT 200
Also if that query runs slowly then do an explain to see what indexing its using
EXPLAIN SELECT fields FROM table WHERE condition = required condition LIMIT 200
Because to get 200 rows should take milliseconds

Number of records you can store in your table, that number of records you can fetch.
for unsigned int largest value is 4,294,967,295
for unsigned big int largest value is 18,446,744,073,709,551,615
for access records fast you need to define LIMIT in query.

You should fetch the records needed for displaying, no more no less. Do not fetch records for (simple) calculations, as that can be done in the query.
I would say that displaying 50 ~ 100 records is the max a users brain can scan, getting all the info in the records.
I am an exception, when seeing more then 15 records, my brain tilts :)

Cached mysql inserts - Preserving Data integrity

I would like to do a lot of inserts, but could it be possible to update mysql after a while.
For example if there is a query such as
Update views_table SET views = views + 1 WHERE id = 12;
Could it not be possible to maybe store this query until the views have gone up to 100 and then run the following instead of running the query from above 100 times.
Update views_table SET views = views + 100 WHERE id = 12;
Now, lets say that is done, then comes the problem of data integrity. Let's say, there are 100 php files open which are all about to run the same query. Now unless there is a locking mechanism on incrementing the cached views, there is a possibility that multiple files may have a same value of the cached view, so lets say process 1 may have 25 cached views and php process 2 may have 25 views and process 3 may have 27 views from the file. Now lets say process 3 finishes and increments the counter to 28. Then lets say php process is about finish and it finished just after process 3, which means that the counter would be brought back down to 26.
So do you guys have any solutions that are fast but are data secure as well.
Thanks

As long as your queries use relative values views=views+5, there should be no problems.
Only if you store the value somewhere in your script, and then calculate the new value yourself,you might run into trouble. But why would you want to do this? Actually, why do you want to do all of this in the first place? :)
If you don't want to overload the database, you could use UPDATE LOW_PRIORITY table set ..., the LOW_PRIORITY keyword will put the update action in a queue and wait for the table to no longer be used by reads or inserts.

First of all: with these queries: regardless of when a process starts, the UPDATE .. SET col = col + 1 is a safe operation, so it will not 'decrease' the counter, ever.
Regarding to 'store this query until the views have gone up to 100 and then run the following instead of running the query from above 100 times': not really. You can store a counter in faster memory (memcached comes to mind), with a process that transfers it to the database once in a while, or store it in another table with a AFTER UPDATE trigger, but I don't really see a point doing that.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.