So I'm trying to import some sales data into my MySQL database. The data is originally in the form of a raw CSV file, which my PHP application needs to first process, then save the processed sales data to the database.
Initially I was doing individual INSERT queries, which I realized was incredibly inefficient (~6000 queries taking almost 2 minutes). I then generated a single large query and INSERTed the data all at once. That gave us a 3400% increase in efficiency, and reduced the query time to just over 3 seconds.
But as I understand it, LOAD DATA INFILE is supposed to be even quicker than any sort of INSERT query. So now I'm thinking about writing the processed data to a text file and using LOAD DATA INFILE to import it into the database. Is this the optimal way to insert large amounts of data to a database? Or am I going about this entirely the wrong way?
I know a few thousand rows of mostly numeric data isn't a lot in the grand scheme of things, but I'm trying to make this intranet application as quick/responsive as possible. And I also want to make sure that this process scales up in case we decide to license the program to other companies.
UPDATE:
So I did go ahead and test LOAD DATA INFILE out as suggested, thinking it might give me only marginal speed increases (since I was now writing the same data to disk twice), but I was surprised when it cut the query time from over 3300ms down to ~240ms. The page still takes about ~1500ms to execute total, but it's still noticeably better than before.
From here I guess I'll check to see if I have any superfluous indexes in the database, and, since all but two of my tables are InnoDB, I will look into optimizing the InnoDB buffer pool to optimize the overall performance.
LOAD DATA INFILE is very fast, and is the right way to import text files into MySQL. It is one of the recommended methods for speeding up the insertion of data -up to 20 times faster, according to this:
https://dev.mysql.com/doc/refman/8.0/en/insert-optimization.html
Assuming that writing the processed data back to a text file is faster than inserting it into the database, then this is a good way to go.
LOAD DATA or multiple inserts are going to be much better than single inserts; LOAD DATA saves you a tiny little bit you probably don't care about that much.
In any case, do quite a lot but not too much in one transaction - 10,000 rows per transaction generally feels about right (NB: this is not relevant to non-transactional engines). If your transactions are too small then it will spend all its time syncing the log to disc.
Most of the time doing a big insert is going to come from building indexes, which is an expensive and memory-intensive operation.
If you need performance,
Have as few indexes as possible
Make sure the table and all its indexes fit in your innodb buffer pool (Assuming innodb here)
Just add more ram until your table fits in memory, unless that becomes prohibitively expensive (64G is not too expensive nowadays)
If you must use MyISAM, there are a few dirty tricks there to make it better which I won't discuss further.
Guys, i had the same question, my needs might have been a little more specific than general, but i have written a post about my findings here.
http://www.mediabandit.co.uk/blog/215_mysql-bulk-insert-vs-load-data
For my needs load data was fast, but the need to save to a flat file on the fly meant the average load times took longer than a bulk insert. Moreover i wasnt required to do more than say 200 queries, where before i was doing this one at a time, i'm now bulking them up, the time savings are in the region of seconds.
Anyway, hopefully this will help you?
You should be fine with your approach. I'm not sure how much faster LOAD DATA INFILE is compared to bulk INSERT, but I've heard the same thing, that it's supposed to be faster.
Of course, you'll want to do some benchmarks to be sure, but I'd say it's worth writing some test code.
Related
I have trouble understanding specific things and I will share my experiments:
From what I have learned from my previous jobs where the directions from the CEO included: all mysql queries must include parameters I need and query them all at once to get the accurate data, am I right? So this will do a perfect job if I had an e-commerce store where I load 10 items for a query. My conclusion here is - this will work great for pages where I don't need to load a lot of data all at once. doesn't matter how big is my database. (considered: good practice)
I have a website where I have to return a (long, big, heavy, hell of a) CSV report. I wrote all the queries required with INNER JOINs -
BUT this time my database includes millions of rows, and my report loops 50,000 times (through each customer I have), and uses INNER JOINs to gather data out of 4~6 tables which include millions of rows each. and in some even does some calculations. All the system is, off-course, OOP, so each object of a single user (which is a query by itself).
So my code has a lot of small queries to the database requesting data for each users, and while looping, having around 4~6 big-INNER JOIN queries. This took a few minutes to run.
I thought it doesn't make sense so I decided to experiment with it.
I decided to separate everything, and not to use everything as an "object" but rather get all data at once from the table, without any joins, and manage it via PHP. So I got all users with the relevant data to 1 $users array. then got more data from table A. then from table B. organized table B. got data from table C. made calculations on table C. ect'. Looped again while matching my data to one final array, and outputting to a csv.
AND THIS TOOK LESS THAN 1 minute to run!
Instead of eating memory out of my database, it did affect the CPU. and after re-writing my code for more efficiency - it took less than 30 secs. and didn't have any CPU bumps.
If all my code is based on OOP and this way of "direct" scripting works faster: Is that ok to continue with it for specific big heavy outputs? (in terms of "good or bad" practice).
PS: I would have used summary tables but that's not what the CEO wants for now.
PS2: Tables are indexed properly.
Regarding your CEOs advice on parameterizing the queries, he’s correct that it will be faster. Adding params will let the server plan the query and cache the plan for future use, saving execution time. It also helps prevent sql injection attacks by properly handling any ill-intentioned user text. Doing everything in a single query is best because the overhead of connecting and issuing the query can add significant time to the query.
I surprises me that you got such a significant performance increase with the manual coding. You may want to analyze your DB structure to make sure there’s not a more efficient way to organize or query the data.
However, writing custom code to do the data construction is a valid approach as long as you are willing to take on the additional responsibility of maintaining it. I assume there was a significantly larger investment of time to write and debug that code than there was in writing the query. There will probably be an equal amount of time necessary to maintain it when requirements change.
((Opinion))
Usually it is possible to improve the query enough to make it faster than shoveling data back and forth between the database and the application.
Better indexes (especially 'composite' indexes)
Reformulating the SELECT
Normalizing, but not overnormalizing
Building and maintaining "Summary" tables - In some situations this speeds up the query 10-fold.
I am doing a booking site using PHP and MySql where i will get lots of data for insertion for a single insertion. Means if i get 1000 booking at a time i will be very slow. So what i am thinking to dump those data in MongoDb and run task to save in MySql. Also i am thing to use Redis for caching most viewed data.
Right now i am directly inserting in db.
Please suggest any one has any idea/suggestion about it.
In pure insert terms, it's REALLY hard to outrun MySQL... It's one of the fastest pure-append engines out there (that flushes consistently to disk).
1000 rows is nothing in MySQL insert performance. If you are falling at all behind, reduce the number of secondary indexes.
Here's a pretty useful benchmark: https://www.percona.com/blog/2012/05/16/benchmarking-single-row-insert-performance-on-amazon-ec2/, showing 10,000-25,000 inserts individual inserts per second.
Here is another comparing MySQL and MongoDB: DB with best inserts/sec performance?
My question really revolves around the repetitive use of a large amount of data.
I have about 50mb of data that I need to cross reference repetitively during a single php page execution. This task is most easily solved by using sql queries with table joins. The problem is the sheer volume of data that I need to process in an very short amount of time and the number of queries required to do it.
What I am currently doing is dumping the relevant part of each table (usually in excess of 30% or 10k rows) into an array and looping. The table joins are always on a single field, so I built a really basic 'index' of sorts to identify which rows are relevant.
The system works. It's been in my production environment for over a year, but now I'm trying to squeeze even more performance out of it. On one particular page I'm profiling, the second highest total time is attributed to the increment line that loops though these arrays. It's hit count is 1.3 million, for a total execution time of 30 seconds. This represents the work that would have been preformed by about 8200 sql queries it to achieve the same result.
What I'm looking for is anyone else that has run a situation like this. I really can't belive that I'm anywhere near the first person to have large amounts of data that needs to be processed in PHP.
Thanks!
Thank you very much to everyone that offered some advice here. It looks like there's isn't really a sliver bullet here like I was hoping. I think what I'm going to end up doing is using a mix of mysql memory tables and some version of a paged memcache.
This solution depends closely on what are you doing with the data, but I found that working unique-value columns inside array keys accelerate things a lot when you are trying to look up for a row given certain value on a column.
This is because php uses a hash table to store the keys for fast lookups. It's hundreds of times faster than iterating over the array, or using array_search.
But without seeing a code example is hard to say.
Added from comment:
The next step is use some memory database. You can use memory tables in mysql, or SQLite. Also depends on how much of your running environment you control, because those methods would need more memory than a shared hosting provider would usually allow. It would probably also simplify your code because of grouping, sorting, aggregate functions, etc.
Well, I'm looking at a similar situation in which I have a large amount of data to process, and a choice to try to do as much via MySQL queries, or off-loading it to PHP.
So far, my experience has been this:
PHP is a lot slower than using MySQL queries.
MySQL query speed is only acceptable if I cram the logic into a single call, as the latency between calls is severe.
I'm particularly shocked by how slow PHP is for looping over an even modest amount of data. I keep thinking/hoping I'm doing something wrong...
I have a PHP script that in every run, inserts a new row to a Mysql db (with a relative small amount of data..)
I have more than 20 requests per second, and this is causing my CPU to scream for help..
I'm using the sql INSERT DELAYED method with a MyISAM engine (although I just notice that INSERT DELAYED is not working with MyISAM).
My main concern is my CPU load and I started to look for ways to store this data with more CPU friendly solutions.
My first idea was to write this data to an hourly log files and once an hour to retrieve the data from the logs and insert it to the DB at once.
Maybe a better idea is to use NoSQL DB instead of log files and then once an hour to insert the data from the NoSQL to the Mysql..
I didn't test yet any of these ideas, so I don't really know if this will manage to decrease my CPU load or not. I wanted to ask if someone can help me find the right solution that will have the lowest affect over my CPU.
I recently had a very similar problem and my solution was to simply batch the requests. This sped things up about 50 times because of the reduced overhead of mysql connections and also the greatly decreased amount of reindexing. Storing them to a file then doing one larger (100-300 individual inserts) statement at once probably is a good idea. To speed things up even more turn off indexing for the duration of the insert with
ALTER TABLE tablename DISABLE KEYS
insert statement
ALTER TABLE tablename ENABLE KEYS
doing the batch insert will reduce the number of instances of the php script running, it will reduce the number of currently open mysql handles (large improvement) and it will decrease the amount of indexing.
Ok guys, I manage to lower the CPU load dramatically with APC-cache
I'm doing it like so:
storing the data in memory with APC-cache, with TTL of 70 seconds:
apc_store('prfx_SOME_UNIQUE_STRING', $data, 70);
once a minute I'm looping over all the records in the cache:
$apc_list=apc_cache_info('user');
foreach($apc_list['cache_list'] as $apc){
if((substr($apc['info'],0,5)=='prfx_') && ($val=apc_fetch($apc['info']))){
$values[]=$val;
apc_delete($apc['info']);
}
}
inserting the $values to the DB
and the CPU continues to smile..
enjoy
I would insert a sleep(1); function at the top of your PHP script, before every insert at the top of your loop where 1 = 1 second. This only allows the loop to cycle once per second.
This way it will regulate a bit just how much load the CPU is getting, this would be ideal assuming your only writing a small number of records in each run.
You can read more about the sleep function here : http://php.net/manual/en/function.sleep.php
It's hard to tell without profiling both methods, if you write to a log file first you could end up just making it worse as your turning your operation count from N to N*2. You gain a slight edge by writing it all to a file and doing a batch insert but bear in mind that as the log file fills up it's load/write time increases.
To reduce database load, look at using mem cache for database reads if your not already.
All in all though your probably best of just trying both and seeing what's faster.
Since you are trying INSERT DELAYED, I assume you don't need up to the second data. If you want to stick with MySQL, you can try using replication and the BLACKHOLE table type. By declaring a table as type BLACKHOLE on one server, then replicating it to a MyISAM or other table type on another server, you can smooth out CPU and io spikes. BLACKHOLE is really just a replication log file, so "inserts" into it are very fast and light on the system.
I do not know what is your table size or your server capabilities but I guess you need to make a lot of inserts per single table. In such a situation I would recommend checking for the construction of vertical partitions that will reduce the physical size of each partition and significantly reduce the insertion time to the table.
Lets assume the same environments for PHP5 working with MySQL5 and CSV files. MySQL is on the same host as hosted scripts.
Will MySQL always be faster than retriving/searching/changing/adding/deleting records to CSV?
Or is there some amount of data below which PHP+CSV performance is better than using database server?
CSV won't let you create indexes for fast searching.
If you always need all data from a single table (like for application settings), CSV is faster, otherwise not.
I don't even consider SQL queries, transactions, data manipulation or concurrent access here, as CSV is certainly not for these things.
No, MySQL will probably be slower for inserting (appending to a CSV is very fast) and table-scan (non-index based) searches.
Updating or deleting from a CSV is nontrivial - I leave that as an exercise for the reader.
If you use a CSV, you need to be really careful to handle multiple threads / processes correctly, otherwise you'll get bad data or corrupt your file.
However, there are other advantages too. Care to work out how you do ALTER TABLE on a CSV?
Using a CSV is a very bad idea if you ever need UPDATEs, DELETEs, ALTER TABLE or to access the file from more than one process at once.
As a person coming from the data industry, I've dealt with exactly this situation.
Generally speaking, MySQL will be faster.
However, you don't state the type of application that you are developing. Are you developing a data warehouse application that is mainly used for searching and retrieval of records? How many fields are typically present in your records? How many records are typically present in your data files? Do these files have any relational properties to each other, i.e. do you have a file of customers and a file of customer orders? How much time do you have to develop a system?
The answer will depend on the answer to the questions listed previously. However, you can generally use the following as a guidelines:
If you are building a data warehouse application with records exceeding one million, you may want to consider ditching both and moving to a Column Oriented Database.
CSV will probably be faster for smaller data sets. However, rolling your own insert routines in CSV could be painful and you lose the advantages of database indexing.
My general recommendation would be to just use MySql, as I said previously, in most cases it will be faster.
From a pure performance standpoint, it completely depends on the operation you're doing, as #MarkR says. Appending to a flat file is very fast. As is reading in the entire file (for a non-indexed search or other purposes).
The only way to know for sure what will work better for your use cases on your platform is to do actual profiling. I can guarantee you that doing a full table scan on a million row database will be slower than grep on a million line CSV file. But that's probably not a realistic example of your usage. The "breakpoints" will vary wildly depending on your particular mix of retrieve, indexed search, non-indexed search, update, append.
To me, this isn't a performance issue. Your data sounds record-oriented, and MySQL is vastly superior (in general terms) for dealing with that kind of data. If your use cases are even a little bit complicated by the time your data gets large, dealing with a 100k line CSV file is going to be horrific compared to a 100k record db table, even if the performance is marginally better (which is by no means guaranteed).
Depends on the use. For example for configuration or language files CSV might do better.
Anyway, if you're using PHP5, you have 3rd option -- SQLite, which comes embedded in PHP. It gives you ease of use like regular files, but robustness of RDBMS.
Databases are for storing and retrieving data. If you need anything more than plain line/entry addition or bulk listing, why not go for the database way? Otherwise you'd basically have to code the functionality (incl. deletion, sorting etc) yourself.
CSV is an incredibly brittle format and requires your app to do all the formatting and calcuations. If you need to update a spesific record in a csv you will have to first read the entire csv file, find the entry in memory would need to change, then write the whole file out again. This gets very slow very quickly. CSV is only useful for write once, readd once type apps.
If you want to import swiftly like a thief in the night, use SQL format.
If you are working in production server, CSV is slow but it is the safest.
Just make sure the CSV file doesn't have a Primary Key which will override your existing data.