MySQL Inndob Delete/Purge rows from very large databases

MySQL Inndob Delete/Purge rows from very large databases - php

I am having some issues with deleting data from innodb tables, from what I am reading most people are saying the only way to free up space is to export the wanted data create a new tale and import it.. this seems a very rubbish way of doing it, especially on a data which is nearly 3tbs.
The issue I am having is deleting data older then 3 months to try and free up disk space, once the data is deleted the disk space does not seem to be freed up. Is there a way to purge or permanently delete rows/data to free up disk space?
Is there a more reliable way without dropping the database and restarting the service to free up disk space.
Please could some body advise me on the best approach to handling deletion of large database.
Much appreciate your time in advanced.
Thanks :)

One relatively efficient approach is using database partitions and dropping old data by deleting partitions. It certainly requires more complicated maintenance, but it does work.
First, enable innodb_file_per_table so that each table (and partition) goes to its own file instead of a single huge ibdata file.
Then, create a partitioned table, having one partition per range of time (day, month, week, you pick it), which results in files of some sensible size for your data set.
create table foo(
tid INT(7) UNSIGNED NOT NULL,
yearmonth INT(6) UNSIGNED NOT NULL,
data varbinary(255) NOT NULL,
PRIMARY KEY (tid, yearmonth)
) engine=InnoDB
PARTITION BY RANGE(yearmonth) (
PARTITION p201304 VALUES LESS THAN (201304),
PARTITION p201305 VALUES LESS THAN (201305),
PARTITION p201306 VALUES LESS THAN (201306)
);
Looking in the database data directory you'll find a file for each partition. In this example, partition 'p201304' will contain all rows having yearmonth < 201304, 'p201305' will have rows for 2013-04, 'p201306' will contain all rows for 2013-05.
In practice I have actually used an integer column containing an UNIX timestamp as the partitioning key - that way it's easier to adjust the size of the partitions as time goes by. The partition edges do not need to match any calendar boundaries, they can happen every 100000 seconds or whatever results in a sensible amount of partitions (tens of partitions) while still having small enough files with your data.
Then, set up a maintenance process which creates new partitions for new data: ALTER TABLE foo ADD PARTITION (PARTITION p201307 VALUES LESS THAN (201307)) and deletes old partitions: ALTER TABLE foo DROP PARTITION p201304. Deletion of a large partition is almost as fast as deleting the file, and it'll actually free up disk space. Also, it won't fragment the other partitions by leaving empty space scattered inside them.
If possible, make sure your frequent queries only access one or a few partitions by specifying the partition key (yearmonth in the example above), or a range of it, in the WHERE clause - that'll make them run much faster as the database won't need to look inside all the partitions to find your data.

Even if you use the file_per_table option you will still have this issue. The only way to "fix" it is to rebuild individual tables:
OPTIMIZE TABLE bloated_table
Note that this will lock the table during the rebuild operation, and you must have enough free space to accommodate the new table. On some systems this is impractical.
If you're frequently deleting data, you probably need to rotate the entire table periodically. Dropping a table under InnoDB with file_per_table will liberate the disk space almost immediately. If you have one table per month, you can simply drop tables representing data from three months ago.
Is it ugly to work with these? Yes. Is there an alternative? Not really. You can try going down the table partitioning rabbit hole, but that often ends up more trouble than it's worth.

Related

Database with 40000+ records per day

I am creating a database for keeping track of water usage per person for a city in South Florida.
There are around 40000 users, each one uploading daily readouts.
I was thinking of ways to set up the database and it would seem easier to give each user separate a table. This should ease the download of data because the server will not have to sort through a table with 10's of millions of entries.
Am I false in my logic?
Is there any way to index table names?
Are there any other ways of setting up the DB to both raise the speed and keep the layout simple enough?
-Thank you,
Jared
p.s.
The essential data for the readouts are:
-locationID (table name in my idea)
-Reading
-ReadDate
-ReadTime
p.p.s. during this conversation, i uploaded 5k tables and the server froze. ~.O
thanks for your help, ya'll

Setting up thousands of tables in not a good idea. You should maintain one table and put all entries in that table. MySQL can handle a surprisingly large amount of data. The biggest issue that you will encounter is the amount of queries that you can handle at a time, not the size of the database. For instances where you will be handling numbers use int with attribute unsigned, and instances where you will be handling text use varchar of appropriate size (unless text is large use text).
Handling users
If you need to identify records with users, setup another table that might look something like this:
user_id INT(10) AUTO_INCREMENT UNSIGNED PRIMARY
name VARCHAR(100) NOT NULL
When you need to link a record to the user, just reference the user's user_id. For the record information I would setup the SQL something like:
id INT(10) AUTO_INCREMENT UNSIGNED PRIMARY
u_id INT(10) UNSIGNED
reading Im not sure what your reading looks like. If it's a number use INT if its text use VARCHAR
read_time TIMESTAMP
You can also consolidate the date and time of the reading to a TIMESTAMP.

Do NOT create a seperate table for each user.
Keep indexes on the columns that identify a user and any other common contraints such as date.
Think about how you want to query the data at the end. How on earth would you sum up the data from ALL users for a single day?
If you are worried about primary key, I would suggest keeping a LocationID, Date composite key.
Edit: Lastly, (and I do mean this in a nice way) but if you are asking these sorts of questions about database design, are you sure that you are qualified for this project? It seems like you might be in over your head. Sometimes it is better to know your limitations and let a project pass by, rather than implement it in a way that creates too much work for you and folks aren't satisfied with the results. Again, I am not saying don't, I am just saying have you asked yourself if you can do this to the level they are expecting. It seems like a large amount of users constantly using it. I guess I am saying that learning certain things while at the same time delivering a project to thousands of users may be an exceptionally high pressure environment.

Generally speaking tables should represent sets of things. In your example, it's easy to identify the sets you have: users and readouts; there the theoretical best structure would be having those two tables, where the readouts entries have a reference to the id of the user.
MySQL can handle very large amounts of data, so your best bet is to just try the user-readouts structure and see how it performs. Alternatively you may want to look into a document based NoSQL database such as MongoDB or CouchDB, since storing readouts reports as individual documents could be a good choice aswell.

If you create a summary table that contains the monthly total per user, surely that would be the primary usage of the system, right?
Every month, you crunch the numbers and store the totals into a second table. You can prune the log table on a rolling 12 month period. i.e., The old data can be stuffed in the corner to keep the indexes smaller, since you'll only need to access it when the city is accused of fraud.
So exactly how you store the daily readouts isn't that big of a concern that you need to be freaking out about it. Giving each user his own table is not the proper solution. If you have tons and tons of data, then you might want to consider sharding via something like MongoDB.

When to Disable Keys

I have 35 large (1M plus rows with 35 columns) databases and each one gets updated with per row imports based on the primary key.
I am thinking about grouping these updates into blocks, disabling the keys and then re-enabling them.
Does anyone know when disabling the keys is recommended. i.e. If I was going to update a single record it'd be a terrible idea but if I wanted to update every record, it would be a good idea. Are there any mathematical formulae to follow for this or should I just keep benchmarking?

I would disable my keys when I notice that there are particular performance effects on inserts / updates. These are the most prone to getting bogged down in foreign-key problems. Inserting a row into a fully keyed/indexed table with tens of millions of records can be a nightmare, if there are alot of columns and non-null attributes in the insert. I wouldnt worry about keys/indices in a small table --- in smaller tables (lets say ~500,000 rows or less with maybe 6 or 7 columns) the keys probably aren't going to kill you.
As hinted above, you must also consider disabling the real-time management of indices when you are doing this. Indices, if maintained by the database in real-time, will slow down operations that change the tables in the database as well.
Regarding mathematical forumlae : You can look at the trends in your insert/update speed when you do / do not have indices, with respect to database size. At some point (i.e. one your db reaches a certain size) you might find that the time for an insert starts increasing geometrically .... Or that it takes a steep "jump". If you can find these points in your system, you'll know when you are pushing it to the limit --- and a good admin might even be able to tell you WHY , at those points, the system performance is dropping.
Ironically -- sometimes keys/indices speed things up ! Indices and keys can speed up some updates and inserts by making any subqueries or other operations EXTREMELY (linear-time) fast. So if an operation is slow you might ask yourself "Is there some static data that i can index to speed lookup operation up " ?

Keeping track of on/offs without index-only tables

I'm looking for the best, most scaleable way of keeping track of a large number of on/offs. The on/offs apply to items, numbering from 1 to about 60 million. (In my case the on/off is whether a member's book has been indexed or not, a separate process.)
The on/offs must be searched rapidly by item number. They change constantly, so re-indexing costs can't be high. New items are added to the end of the table less often.
The idea solution would, I think, be an index-only table--a table where every field was part of the primary key. I gather ORACLE has this, but no engine for MySQL has it.
If I use MySQL I think my choice is between:
a two-field table--the item and the "on/off" field. Changes would be handled with UPDATE.
a one-field table--the item. Being in the table means being "on." Changes are handled with INSERT and DELETE.
I am open to other technologies. Storing the whole thing bitwise in a file?

You may have more flexibility by using option #1, but both would work effectively. However, if speed is an issue, you might want to consider creating a HEAP table that is pre-populated on mysql startup and maintained in-situ with your other processes. Also, use int, and enum field types in the table. Since it'll all be held in memory, it should be lightning fast, and because there is not a lot of data stored in the table, 60 million records shouldn't be a huge burden, memory-wise. If I had to roughly estimate:
int(8) (for growth, assuming you'll exceed 100million records someday)
enum(0,1)
So let's round up to 10 bytes per record:
10 * 60,000,000 = 600,000,000
That's about 572 MB worth of data, plus the index and additional overhead, so let's roughly say.. a 600 MB table. If you have that kind of memory to spare on your server, then a HEAP table might be the way to go.

60 million rows with an ID and an on/off bit should be no problem at all for MySQL if you are using InnoDB.
I have an InnoDB table that tracks which forum topics users have read and which post they've read up to. It contains 250 million rows, is 14 bytes wide, and it is updated constantly... It's doing 50 updates a second right now and it is midnight so peak time could be 100-200?.
The indexed columns themselves are not updated after insert. The primary key is (user_id, topic_id) and I add new last_read information by using INSERT ... ON DUPLICATE KEY UPDATE.
I measure constantly and I don't see any contention or performance problems but I do cache reads a lot in memcached since deciding when to expire the cache is very straightforward. I've been considering sharding this table by user in order to keep growth in check but I may not even bother storing it in MySQL forever.
I am open to other technologies. Storing the whole thing bitwise in a file?
Redis would be a great alternative. In particular, its sets and sorted sets would work for this (sorted sets might be nice if you need to grab a range of values using something other than the item ID - like last update time)
Redis might be worth checking out if you haven't already - it can be a great addition to an application that relies on MySQL and you'll likely find other good uses for it that simplify your life.

Will a MySQL database result be slowed down in relation to the number of columns in a table?

Using PHP, I am building an application that is MySQL database resource heavy, but I also need it's data to be very flexible. Currently there are a number of tables which have an array of different columns (including some text, longtext, int, etc), and in the future I would like to expand on the number of columns of these tables, whenever new data-groups are required.
My question is, if I have a table with, say, 10 columns, and I expand this to 40 columns in the future, would a SQL query (via PHP) be slowed down considerably?
As long as the initial, small query that is only looking up the initial 10 columns is not a SELECT-all (*) query, I would like to know if more resources or processing is used because the source table is now much larger.
Also, will the database in general run slower or be much larger due to many columns now constantly remaining as NULL values (eg, whenever a new entry that only requires the first 10 columns is inserted)?

MyISAM and InnoDB behave differently in this regard, for various reasons.
For instance, InnoDB will partition disk space for each column on disk regardless of whether it has data in it, while MyISAM will compress the tables on disk. In a case where there are large amounts of empty columns, InnoDB will be wasting a lot of space. On the other hand, InnoDB does row-level locking, which means that (with caveats) concurrent read / writes to the same table will perform better (MyISAM does a table-level lock on write).
Generally speaking, it's probably not a good idea to have many columns in one table, particularly for volatility reasons. For instance, in InnoDB (possibly MyISAM also?), re-arranging columns or changing types of columns (i.e. varchar 128 -> varchar 255) in the middle of a table requires that all data in columns to the right be moved around on disk to make (or remove) space for the altered column.
With respect to your overall database design, it's best to aim for as many columns as possible to be not null, which saves space (you don't need the null flag on the column, and you don't store empty data) and also increases query and index performance. If many records will have a particular column set to null, you should probably move it to a foreign key relationship and use a JOIN. That way disk space and index overhead is only incurred for records that are actually holding information.

Likely, the best solution would be to create a new table with the additional fields and JOIN the tables when necessary. The original table remains unchanged, keeping it's speed, but you can still get to the extra fields.

Optimization is not a trivia question. Nothing can be predicted.
In general short answer is: yes, it will be slower (because DBMS at least need to read from the disk and send more data, obviously).
But, it is very dependent on each particular case how much slower it will be. You can either even don't see the difference, or get it 10x times slower.

In all likelihood, no it won't be slowed down considerably.
However, a better question to ask is: Which method of adding more fields results in a more elegant, understandable, maintainable, cost effective solution?
Usually the answer is "It depends." It depends on how the data is accessed, how the requirements will change, how the data is updated, and how fast the tables grow.

you can divide one master table into multiple TRANSACTION tables so you will get much faster result than you getting now. and also make the primary key as UNIQUE KEY also in all the transaction as well as master tables. its really help you to make your query faster.
Thanks.

How to do monthly refresh of large DB tables without interrupting user access to them

I have four DB tables in an Oracle database that need to be rewritten/refreshed every week or every month. I am writing this script in PHP using the standard OCI functions, that will read new data in from XML and refresh these four tables. The four tables have the following properties
TABLE A - up to 2mil rows, one primary key (One row might take max 2K data)
TABLE B - up to 10mil rows, one foreign key pointing to TABLE A (One row might take max 1100 bytes of data)
TABLE C - up to 10mil rows, one foreign key pointing to TABLE A (One row might take max 1100 bytes of data)
TABLE D - up to 10mil rows, one foreign key pointing to TABLE A (One row might take max 120 bytes of data)
So I need to repopulate these tables without damaging the user experience. I obviously can't delete the tables and just repopulate them as it is a somewhat lengthy process.
I've considered just a big transaction where I DELETE FROM all of the tables and just regenerate them. I get a little concerned about the length of the transaction (don't know yet but it could take an hour or so).
I wanted to create temp table replicas of all of the tables and populate those instead. Then I could DROP the main tables and rename the temp tables. However you can't do the DROP and ALTER table statements within a transaction as they always do an auto commit. This should be able to be done quickly (four DROP and and four ALTER TABLE statements), but it can't guarantee that a user won't get an error within that short period of time.
Now, a combination of the two ideas, I'm considering doing the temp tables, then doing a DELETE FROM on all four original tables and then and INSERT INTO from the temp tables to repopulate the main tables. Since there are no DDL statements here, this would all work within a transaction. Then, however, I wondering if the memory it takes to process some 60 million records within a transaction is going to get me in trouble (this would be a concern for the first idea as well).
I would think this would be a common scenario. Is there a standard or recommended way of doing this? Any tips would be appreciated. Thanks.

You could have a synonym for each of your big tables. Create new incarnations of your tables, populate them, drop and recreate the synonyms, and finally drop your old tables. This has the advantage of (1) only one actual set of DML (the inserts) avoiding redo generation for your deletes and (2) the synonym drop/recreate is very fast, minimizing the potential for a "bad user experience".
Reminds me of a minor peeve of mine about Oracle's synonyms: why isn't there an ALTER SYNONYM command?

I'm assuming your users don't actually modify the data in these tables since it is deleted from another source every week, so it doesn't really matter if you lock the tables for a full hour. The users can still query the data, you just have to size you rollback segment appropriately. A simple DELETE+INSERT therefore should work fine.
Now if you want to speed things up AND if the new data has little difference with the previous data you could load the new data into temporary tables and updating the tables with the delta with a combination of MERGE+DELETE like this:
Setup:
CREATE TABLE a (ID NUMBER PRIMARY KEY, a_data CHAR(200));
CREATE GLOBAL TEMPORARY TABLE temp_a (
ID NUMBER PRIMARY KEY, a_data CHAR(200)
) ON COMMIT PRESERVE ROWS;
-- Load A
INSERT INTO a
(SELECT ROWNUM, to_char(ROWNUM) FROM dual CONNECT BY LEVEL <= 10000);
-- Load TEMP_A with extra rows
INSERT INTO temp_a
(SELECT ROWNUM + 100, to_char(ROWNUM + 100)
FROM dual
CONNECT BY LEVEL <= 10000);
UPDATE temp_a SET a_data = 'x' WHERE mod(ID, 1000) = 0;
This MERGE statement will insert the new rows and update the old rows only if they are different:
SQL> MERGE INTO a
2 USING (SELECT temp_a.id, temp_a.a_data
3 FROM temp_a
4 LEFT JOIN a ON (temp_a.id = a.id)
5 WHERE decode(a.a_data, temp_a.a_data, 1) IS NULL) temp_a
6 ON (a.id = temp_a.id)
7 WHEN MATCHED THEN
8 UPDATE SET a.a_data = temp_a.a_data
9 WHEN NOT MATCHED THEN
10 INSERT (id, a_data) VALUES (temp_a.id, temp_a.a_data);
Done
You will then need to delete the rows that aren't in the new set of data:
SQL> DELETE FROM a WHERE a.id NOT IN (SELECT temp_a.id FROM temp_a);
100 rows deleted
You would insert into A then into the child tables and deleting in reverse order.

Am I the only one (except Vincent) who would first test the simplest possible solution, i.e. DELETE/INSERT, before trying to build something more advanced?
Then, however, I wondering if the memory it takes to process some 60 million records within a transaction is going to get me in trouble (this would be a concern for the first idea as well).
Oracle manages memory quite well, it hasn't been written by a bunch of Java novices (oops it just came out of my mouth!). So the real question is, do you have to worry about the performance penalties of thrashing REDO and UNDO log files... In other words, build a performance test case and run it on your server and see how long it takes. During the DELETE / INSERT the system will be not as responsive as usual but other sessions can still perform SELECTs without any fears of deadlocks, memory leaks or system crashes. Hint: DB servers are usually disk-bound, so getting a proper RAID array is usually a very good investment.
On the other hand, if the performance is critical, you can select one of the alternative approaches described in this thread:
partitioning if you have the license
table renaming if you don't, but be mindful that DDLs on the fly can cause some side effects such as object invalidation, ORA-06508...

In Oracle your can partition your tables and indexes based on a Date or time column that way to remove a lot of data you can simply drop the partition instead of performing a delete command.
We used to use this to manage monthly archives of 100 Million+ records and not have downtime.
http://www.oracle.com/technology/oramag/oracle/06-sep/o56partition.html is a super handy page for learning about partitioning.

I assume that this refreshing activity is the only way of data changing in these tables, so that you don't need to worry about inconsistencies due to other writing processes during the load.
All that deleting and inserting will be costly in terms of undo usage; you also would exclude the option of using faster data loading techniques. For example, your inserts will go much, much faster if you insert into the tables with no indexes, then apply the indexes after the load is done. There are other strategies as well, but both of them preclude the "do it all in one transaction" technique.
Your second choice would be my choice - build the new tables, then rename the old ones to a dummy name, rename the temps to the new name, then drop the old tables. Since the renames are fast, you'd have a less than one second window when the tables were unavailable, and you'd then be free to drop the old tables at your leisure.
If that one second window is unacceptable, one method I've used in situations like this is to use an additional locking object - specifically, a table with a single row that users would be required to select from before they access the real tables, and that your load process could lock in exclusive mode before it it does the rename operation.
Your PHP script would use two connections to the db - one where you do the lock, the other where you do the loading, renaming and dropping. This way the implicit commits in the work connection won't terminate the lock in the other table.
So, in the script, you'd do something like:
Connection 1:
Create temp tables, load them, create new indexes
Connection 2:
LOCK TABLE Load_Locker IN SHARE ROW EXCLUSIVE MODE;
Connection 1:
Perform renaming swap of old & new tables
Connection 2:
Rollback;
Connection 1:
Drop old tables.
Meanwhile, your clients would issue the following command immediately after starting a transaction (or a series of selects):
LOCK TABLE Load_Locker IN SHARE MODE;
You can have as many clients locking the table this way - your process above will block behind them until they have all released the lock, at which point subsequent clients will block until you perform your operations. Since the only thing you're doing inside the context of the SHARE ROW EXCLUSIVE lock is renaming tables, your clients would only ever block for an instant. Additionally, putting this level of granularity allows you to control how long the clients would have a read consistent view of the old table; without it, if you had a client that did a series of reads that took some time, you might end up changing the tables mid-stream and wind up with weird results if the early queries pulled old data & the later queries pulled new data. Using SET TRANSACTION SET ISOLATION LEVEL READ ONLY would be another way of addressing this issue if you weren't using my approach.
The only real downside to this approach is that if your client read transactions take some time, you run the risk of other clients being blocked for longer than an instant, since any locks in SHARE MODE that occur after your load process issues its SHARE ROW EXCLUSIVE lock will block until the load process finishes its task. For example:
10:00 user 1 issues SHARE lock
10:01 user 2 issues SHARE lock
10:03 load process issues SHARE ROW EXCLUSIVE lock (and is blocked)
10:04 user 3 issues SHARE lock (and is blocked by load's lock)
10:10 user 1 releases SHARE
10:11 user 2 releases SHARE (and unblocks loader)
10:11 loader renames tables & releases SHARE ROW EXCLUSIVE (and releases user 3)
10:11 user 3 commences queries, after being blocked for 7 minutes
However, this is really pretty kludgy. Kinlan's solution of partitioning is most likely the way to go. Add an extra column to your source tables that contains a version number, partition your data based on that version, then create views that look like your current tables that only show data that shows the current version (determined by the value of a row in a "CurrentVersion" table). Then just do your load into the table, update your CurrentVersion table, and drop the partition for the old data.

Why not add a version column? That way you can add the new rows with a different version number. Create a view against the table that specifies the current version. After the new rows are added recompile the view with the new version number. When that's done, go back and delete the old rows.

What we do in some cases is have two versions of the tables, say SalesTargets1 and SalesTargets2 (an active and inactive one.) Truncate the records from the inactive one and populate it. Since no one but you uses the inactive one, there should be no locking issues or impact on the users while it is populating. Then have view that selcts all the information from the active table (it should be named what the current table is now, say SalesTargets in my example). Then to switch to the refreshed data, all you have to do is run an alter view statement.

Have you evaluated the size of the delta (of changes).
If the number of rows that get updated (as opposed to inserted) every time you put up a new rowset it not too high, then I think you should consider importing the new set of data into a set of staging tables and do an update-where-exists and insert-where-not-exists (UPSERT) solution and just refresh your indexes (ok ok indices).
Treat it like ETL.

I'm going with an upsert method here.
I added an additional "delete" column to each of the tables.
When I begin processing the feed, I set the delete field for every record to '1'.
Then I go through a serious of updates if the record exists, or inserts if it does not. For each of those inserts/updates, the delete field is then set to zero.
At the end of the process I delete all records that still have a delete value of '1'.
Thanks everybody for your answers. I found it very interesting/educational.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.