I have 35 large (1M plus rows with 35 columns) databases and each one gets updated with per row imports based on the primary key.
I am thinking about grouping these updates into blocks, disabling the keys and then re-enabling them.
Does anyone know when disabling the keys is recommended. i.e. If I was going to update a single record it'd be a terrible idea but if I wanted to update every record, it would be a good idea. Are there any mathematical formulae to follow for this or should I just keep benchmarking?
I would disable my keys when I notice that there are particular performance effects on inserts / updates. These are the most prone to getting bogged down in foreign-key problems. Inserting a row into a fully keyed/indexed table with tens of millions of records can be a nightmare, if there are alot of columns and non-null attributes in the insert. I wouldnt worry about keys/indices in a small table --- in smaller tables (lets say ~500,000 rows or less with maybe 6 or 7 columns) the keys probably aren't going to kill you.
As hinted above, you must also consider disabling the real-time management of indices when you are doing this. Indices, if maintained by the database in real-time, will slow down operations that change the tables in the database as well.
Regarding mathematical forumlae : You can look at the trends in your insert/update speed when you do / do not have indices, with respect to database size. At some point (i.e. one your db reaches a certain size) you might find that the time for an insert starts increasing geometrically .... Or that it takes a steep "jump". If you can find these points in your system, you'll know when you are pushing it to the limit --- and a good admin might even be able to tell you WHY , at those points, the system performance is dropping.
Ironically -- sometimes keys/indices speed things up ! Indices and keys can speed up some updates and inserts by making any subqueries or other operations EXTREMELY (linear-time) fast. So if an operation is slow you might ask yourself "Is there some static data that i can index to speed lookup operation up " ?
Related
On a daily basis, I get a source csv file that has 250k rows and 40 columns. It’s 290MB. I will need to filter it because it has more rows than I need and more columns than I need.
For each row that meets the filtering criteria, we want to update that into the destination system 1 record at a time using its PHP API.
What will be the best approach for everything up until the API call (the reading / filtering / loading) for the fastest performance?
Iterating through each row of the file, deciding if it’s a row I want, grabbing only the columns I need, and then passing it to the API?
Loading ALL records into a temporary MySQL table using LOAD DATA INFILE. Then querying the table for the rows and fields I want, and iterating through the resultset passing each record to the API?
Is there a better option?
Thanks!
I need make an assumption first, majority of the 250K rows will go to database. If only a very small percentage, then iterate over the file and send all the rows in batch is definitely faster.
Different configurations could affect both approaches, but general speaking, the 2nd approach performs better with less scripting effort.
Approach 1: the worst is to send each row to server. More round trip and more small commits.
What you can improve here is to send rows in batch, maybe a few hundreds together. You will see a much better result.
Approach 2: MyISAM will be faster than InnoDB because of all the overheads and complexity of ACID. If MyISAM is acceptable to you, try it first.
For InnoDB, there is a better Approach 3 (which is actually a mix of approach 1 and approach 2).
Because InnoDB don't do table lock and you can try to import multiple files concurrently, i.e., separate the CSV files to multiple files and execute Load Data from your scripts. Don't add auto_increment key into the table first to avoid auto_inc lock.
LOAD DATA, but say #dummy1, #dummy2 etc for columns that you don't need to keep. That gets rid of the extra columns. Load into a temp table. (1 SQL statement.)
Do any cleansing of the data. (Some number of SQL statements, but no loop, if possible.)
Do one INSERT INTO real_table SELECT ... FROM tmp_table WHERE ... to both filter out unnecessary rows and copy the desired ones into the real table. (1 SQL statement.)
You did not mention any need for step 2. Some things you might need:
Computing one column from other(s).
Normalization.
Parsing dates into the right format.
In one project I did:
1GB of data came in every hour.
Load into a temp table.
Normalize several columns (2 SQLs per column)
Some other processing.
Summarize the data into about 6 summary tables. Sample: INSERT INTO summary SELECT a,b,COUNT(*),SUM(c) FROM tmp GROUP BY a,b;. Or an INSERT ON DUPLICATE KEY UPDATE to deal with rows already existing in summary.
Finally copy the normalized, cleaned, data into the 'real' table.
The entire process took 7 minutes.
Whether to use MyISAM, InnoDB, or MEMORY -- You would need to benchmark your own case. No index is needed on the tmp table since each processing step does a full table scan.
So, 0.3GB each 24 hours -- Should be no sweat.
I am having some issues with deleting data from innodb tables, from what I am reading most people are saying the only way to free up space is to export the wanted data create a new tale and import it.. this seems a very rubbish way of doing it, especially on a data which is nearly 3tbs.
The issue I am having is deleting data older then 3 months to try and free up disk space, once the data is deleted the disk space does not seem to be freed up. Is there a way to purge or permanently delete rows/data to free up disk space?
Is there a more reliable way without dropping the database and restarting the service to free up disk space.
Please could some body advise me on the best approach to handling deletion of large database.
Much appreciate your time in advanced.
Thanks :)
One relatively efficient approach is using database partitions and dropping old data by deleting partitions. It certainly requires more complicated maintenance, but it does work.
First, enable innodb_file_per_table so that each table (and partition) goes to its own file instead of a single huge ibdata file.
Then, create a partitioned table, having one partition per range of time (day, month, week, you pick it), which results in files of some sensible size for your data set.
create table foo(
tid INT(7) UNSIGNED NOT NULL,
yearmonth INT(6) UNSIGNED NOT NULL,
data varbinary(255) NOT NULL,
PRIMARY KEY (tid, yearmonth)
) engine=InnoDB
PARTITION BY RANGE(yearmonth) (
PARTITION p201304 VALUES LESS THAN (201304),
PARTITION p201305 VALUES LESS THAN (201305),
PARTITION p201306 VALUES LESS THAN (201306)
);
Looking in the database data directory you'll find a file for each partition. In this example, partition 'p201304' will contain all rows having yearmonth < 201304, 'p201305' will have rows for 2013-04, 'p201306' will contain all rows for 2013-05.
In practice I have actually used an integer column containing an UNIX timestamp as the partitioning key - that way it's easier to adjust the size of the partitions as time goes by. The partition edges do not need to match any calendar boundaries, they can happen every 100000 seconds or whatever results in a sensible amount of partitions (tens of partitions) while still having small enough files with your data.
Then, set up a maintenance process which creates new partitions for new data: ALTER TABLE foo ADD PARTITION (PARTITION p201307 VALUES LESS THAN (201307)) and deletes old partitions: ALTER TABLE foo DROP PARTITION p201304. Deletion of a large partition is almost as fast as deleting the file, and it'll actually free up disk space. Also, it won't fragment the other partitions by leaving empty space scattered inside them.
If possible, make sure your frequent queries only access one or a few partitions by specifying the partition key (yearmonth in the example above), or a range of it, in the WHERE clause - that'll make them run much faster as the database won't need to look inside all the partitions to find your data.
Even if you use the file_per_table option you will still have this issue. The only way to "fix" it is to rebuild individual tables:
OPTIMIZE TABLE bloated_table
Note that this will lock the table during the rebuild operation, and you must have enough free space to accommodate the new table. On some systems this is impractical.
If you're frequently deleting data, you probably need to rotate the entire table periodically. Dropping a table under InnoDB with file_per_table will liberate the disk space almost immediately. If you have one table per month, you can simply drop tables representing data from three months ago.
Is it ugly to work with these? Yes. Is there an alternative? Not really. You can try going down the table partitioning rabbit hole, but that often ends up more trouble than it's worth.
Part of my project involves storing and retrieving loads of ips in my database. I have estimated that my database will have millions of ips within months of starting the project. That been the case I would like to know how slow simple queries to a big database can get? What will be the approximate speeds of the following queries:
SELECT * FROM table where ip= '$ip' LIMIT 1
INSERT INTO table(ip, xxx, yyy)VALUES('$ip', '$xxx', '$yyy')
on a table with 265 million rows?
Could I speed query speeds up by having 255^2 tables created that would have names corresponding to all the 1st two numbers of all possible ipv4 ip addresses, then each table would have a maximum of 255^2 rows that would accommodate all possible 2nd parts to the ip. So for example to query the ip address "216.27.61.137" it would be split into 2 parts, "216.27"(p1) and "61.137"(p2). First the script would select the table with the name, p1, then it would check to see if there are any rows called "p2", if so it would then pull the required data from the row. The same process would be used to insert new ips into the database.
If the above plan would not work what would be a good way to speed up queries in a big database?
The answers to both your questions hinge on the use of INDEXES.
If your table is indexed on ip your first query should execute more or less immediately, regardless of the size of your table: MySQL will use the index. Your second query will slow as MySQL will have to update the index on each INSERT.
If your table is not indexed then the second query will execute almost immediately as MySQL can just add the row at the end of the table. Your first query may become unusable as MySQL will have to scan the entire table each time.
The problem is balance. Adding an index will speed the first query but slow the second. Exactly what happens will depend on server hardware, which database engine you choose, configuration of MySQL, what else is going on at the time. If performance is likely to be critical, do some tests first.
Before doing any of that sort, read this question (and more importantly) its answers: How to store an IP in mySQL
It is generally not a good idea to split data among multiple tables. Database indexes are good at what they do, so just make sure you create them accordingly. A binary column to store IPv4 addresses will work rather nicely - it is more a question of query load than of table size.
First and foremost, you can't predict how long will a query will take, even if we knew all information about the database, the database server, the network performance and another thousands of variables.
Second, if you are using a decent database engine, you don't have to split the data into different tables. It knows how to handle big data. Leave the database functionality to the database itself.
There are several workarounds to deal with large datasets. Using the right data types and creating the right indexes will help a lot.
When you begin to have problems with your database, then search for something specific to the problem you are having.
There are no silver bullets to big data problems.
I'm looking for the best, most scaleable way of keeping track of a large number of on/offs. The on/offs apply to items, numbering from 1 to about 60 million. (In my case the on/off is whether a member's book has been indexed or not, a separate process.)
The on/offs must be searched rapidly by item number. They change constantly, so re-indexing costs can't be high. New items are added to the end of the table less often.
The idea solution would, I think, be an index-only table--a table where every field was part of the primary key. I gather ORACLE has this, but no engine for MySQL has it.
If I use MySQL I think my choice is between:
a two-field table--the item and the "on/off" field. Changes would be handled with UPDATE.
a one-field table--the item. Being in the table means being "on." Changes are handled with INSERT and DELETE.
I am open to other technologies. Storing the whole thing bitwise in a file?
You may have more flexibility by using option #1, but both would work effectively. However, if speed is an issue, you might want to consider creating a HEAP table that is pre-populated on mysql startup and maintained in-situ with your other processes. Also, use int, and enum field types in the table. Since it'll all be held in memory, it should be lightning fast, and because there is not a lot of data stored in the table, 60 million records shouldn't be a huge burden, memory-wise. If I had to roughly estimate:
int(8) (for growth, assuming you'll exceed 100million records someday)
enum(0,1)
So let's round up to 10 bytes per record:
10 * 60,000,000 = 600,000,000
That's about 572 MB worth of data, plus the index and additional overhead, so let's roughly say.. a 600 MB table. If you have that kind of memory to spare on your server, then a HEAP table might be the way to go.
60 million rows with an ID and an on/off bit should be no problem at all for MySQL if you are using InnoDB.
I have an InnoDB table that tracks which forum topics users have read and which post they've read up to. It contains 250 million rows, is 14 bytes wide, and it is updated constantly... It's doing 50 updates a second right now and it is midnight so peak time could be 100-200?.
The indexed columns themselves are not updated after insert. The primary key is (user_id, topic_id) and I add new last_read information by using INSERT ... ON DUPLICATE KEY UPDATE.
I measure constantly and I don't see any contention or performance problems but I do cache reads a lot in memcached since deciding when to expire the cache is very straightforward. I've been considering sharding this table by user in order to keep growth in check but I may not even bother storing it in MySQL forever.
I am open to other technologies. Storing the whole thing bitwise in a file?
Redis would be a great alternative. In particular, its sets and sorted sets would work for this (sorted sets might be nice if you need to grab a range of values using something other than the item ID - like last update time)
Redis might be worth checking out if you haven't already - it can be a great addition to an application that relies on MySQL and you'll likely find other good uses for it that simplify your life.
Using PHP, I am building an application that is MySQL database resource heavy, but I also need it's data to be very flexible. Currently there are a number of tables which have an array of different columns (including some text, longtext, int, etc), and in the future I would like to expand on the number of columns of these tables, whenever new data-groups are required.
My question is, if I have a table with, say, 10 columns, and I expand this to 40 columns in the future, would a SQL query (via PHP) be slowed down considerably?
As long as the initial, small query that is only looking up the initial 10 columns is not a SELECT-all (*) query, I would like to know if more resources or processing is used because the source table is now much larger.
Also, will the database in general run slower or be much larger due to many columns now constantly remaining as NULL values (eg, whenever a new entry that only requires the first 10 columns is inserted)?
MyISAM and InnoDB behave differently in this regard, for various reasons.
For instance, InnoDB will partition disk space for each column on disk regardless of whether it has data in it, while MyISAM will compress the tables on disk. In a case where there are large amounts of empty columns, InnoDB will be wasting a lot of space. On the other hand, InnoDB does row-level locking, which means that (with caveats) concurrent read / writes to the same table will perform better (MyISAM does a table-level lock on write).
Generally speaking, it's probably not a good idea to have many columns in one table, particularly for volatility reasons. For instance, in InnoDB (possibly MyISAM also?), re-arranging columns or changing types of columns (i.e. varchar 128 -> varchar 255) in the middle of a table requires that all data in columns to the right be moved around on disk to make (or remove) space for the altered column.
With respect to your overall database design, it's best to aim for as many columns as possible to be not null, which saves space (you don't need the null flag on the column, and you don't store empty data) and also increases query and index performance. If many records will have a particular column set to null, you should probably move it to a foreign key relationship and use a JOIN. That way disk space and index overhead is only incurred for records that are actually holding information.
Likely, the best solution would be to create a new table with the additional fields and JOIN the tables when necessary. The original table remains unchanged, keeping it's speed, but you can still get to the extra fields.
Optimization is not a trivia question. Nothing can be predicted.
In general short answer is: yes, it will be slower (because DBMS at least need to read from the disk and send more data, obviously).
But, it is very dependent on each particular case how much slower it will be. You can either even don't see the difference, or get it 10x times slower.
In all likelihood, no it won't be slowed down considerably.
However, a better question to ask is: Which method of adding more fields results in a more elegant, understandable, maintainable, cost effective solution?
Usually the answer is "It depends." It depends on how the data is accessed, how the requirements will change, how the data is updated, and how fast the tables grow.
you can divide one master table into multiple TRANSACTION tables so you will get much faster result than you getting now. and also make the primary key as UNIQUE KEY also in all the transaction as well as master tables. its really help you to make your query faster.
Thanks.