Large CSV file import to mysql, best practice - php

Looking for insight on the best approach for large csv file imports to mysql and managing the dataset. This is for an ecommerce storefront "startup". All product data will be read from csv files which are download via curl (server to server).
Each csv file represents a different supplier/warehouse with up to 100,000 products. In total there are roughly 1.2 million products spread over 90-100 suppliers. At least 75% of the row data (51 columns) is redundant garbage and will not be needed.
Would it be better to use mysqli LOAD DATA LOCAL INFILE to 'temp_products' table. Then, make the needed data adjustments per row, then insert to the live 'products' table or simply use fgetcsv() and go row by row? The import will be handled by a CronJob using the sites php.ini with a memory limit of 128M.
Apache V2.2.29
PHP V5.4.43
MySQL V5.5.42-37.1-log
memory_limit 128M
I'm not looking for "How to's". I'm simply looking for the "best approach" from the communities perspective and experience.

I have direct experience of doing something virtually identical to what you describe -- lots of third party data sources in different formats all needing to go into a single master table.
I needed to take different approaches for different data sources, because some were in XML, some in CSV, some large, some small, etc. For the large CSV ones, I did indeed follow roughly your suggested routed:
I used LOAD DATA INFILE to dump the raw contents into a temporary table.
I took the opportunity to transform or discard some of the data within this query; LOAD DATA INFILE allows some quite complex queries. This allowed me to use the same temp table for several of the import processes even though they had quite different CSV data, which made the next step easier.
I then used a set of secondary SQL queries to pull the temp data into the various main tables. All told, I had about seven steps to the process.
I had a set of PHP classes to do the imports, which all implemented a common interface. This meant that I could have a common front-end program which could run any of the importers.
Since a lot of the importers did similar tasks, I put the commonly used code in traits so that the code could be shared.
Some thoughts based on the things you said in your question:
LOAD DATA INFILE will be orders of magnitude quicker than fgetcsv() with a PHP loop.
LOAD DATA INFILE queries can be very complex and achieve very good data mapping without ever having to run any other code, as long as the imported data is going into a single table.
Your memory limit is likely to need to be raised. However, using LOAD DATA INFILE means that it will be MySQL which will use the memory, not PHP, so the PHP limit won't come into play for that. 128M is still likely to be too low for you though.
-If you struggle to import the whole thing in one go, try using some simple Linux shell commands to split the file into several smaller chunks. CSV data format should make that fairly simple.

Related

how to handle large sets of data with PHP?

My web application lets user import an excel file and writes the data from the file into the mysql database.
The problem is, when the excel file has lots of entries, even 1000 rows, i get an error saying PHP ran out of memory. This occurs while reading the file.
I have assigned 1024MB to PHP in the php.ini file.
My question is, how to go about importing such large data in PHP.
I am using CodeIgniter.
for reading the excel file, i am using this library.
SOLVED. I used CSV instead of xls. and I could import 10,000 rows of data within seconds.
Thank you all for your help.
As others have said, 1000 records is not much. Make sure you process the records one at a time, or a few at a time, and that the variables you use for each iteration go out of scope after you're finished with that row or you're reusing the variables.
If you can avoid the necessity of processing excel files by exporting them to csv, that's even greater, cause then you wouldn't need such a library (which might or might not have its own memory issues).
Don't be afraid of increasing memory usage if you need to and that solves the problem, buying memory is the cheapest option sometimes. And don't let the 1 GB scare you, it is a lot for such a simple task, but if you have the memory and that's all you need to do, then its good enough for the moment.
And as a plus, if you are using an old version of PHP, try updating to PHP 5.4 which handles memory much better than its predecessors.
Instead of inserting one a time in a loop. Insert 100 row at a time.
You can always run
INSERT INTO myTable (clo1, col2, col2) VALUES
(val1, val2), (val3, val4), (val5, val6) ......
This way number of network transaction will reduce thus reducing resource usage.

inserting huge set of data [PHP, MySQL]

I have a big data set into MySQL (users, companies, contacts)? about 1 million records.
And now I need to make import new users, companies, contacts from import file (csv) with about 100000 records. I records from file has all info for all three essences (user, company, contacts).
Moreover on production i can't use LOAD DATA (just do not have so many rights :( ).
So there are three steps which should be applied to that data set.
- compare with existing DB data
- update it (if we will find something on previous step)
- and insert new, records
I'm using php on server for doing that. I can see two approaches:
reading ALL data from file at once and then work with this BIG array and apply those steps.
or reading line by line from the file and pass each line through steps
which approach is more efficient ? by CPU, memory or time usage
Can I use transactions ? or it will slow down whole production system ?
Thanks.
CPU time/time there won't be much in it, although reading the whole file will be slightly faster. However, for such a large data set, the additional memory required to read all records into memory will vastly outstrip the time advantage - I would definitely process one line at a time.
Did you know that phpMyAdmin has that nifty feature of "resumable import" for big SQL files ?
Just check "Allow interrupt of import" in the Partial Import section. And voila, PhpMyAdmin will stop and loop until all requests are executed.
It may be more efficient to just "use the tool" rather than "reinvent the wheel"
I think, 2nd approach is more acceptable:
Create change list (it would be a separate table)
Make updates line by line (and mark each line as updated using "updflag" field, for example)
Perform this process in background using transactions.

import bulk data into MySQL

So I'm trying to import some sales data into my MySQL database. The data is originally in the form of a raw CSV file, which my PHP application needs to first process, then save the processed sales data to the database.
Initially I was doing individual INSERT queries, which I realized was incredibly inefficient (~6000 queries taking almost 2 minutes). I then generated a single large query and INSERTed the data all at once. That gave us a 3400% increase in efficiency, and reduced the query time to just over 3 seconds.
But as I understand it, LOAD DATA INFILE is supposed to be even quicker than any sort of INSERT query. So now I'm thinking about writing the processed data to a text file and using LOAD DATA INFILE to import it into the database. Is this the optimal way to insert large amounts of data to a database? Or am I going about this entirely the wrong way?
I know a few thousand rows of mostly numeric data isn't a lot in the grand scheme of things, but I'm trying to make this intranet application as quick/responsive as possible. And I also want to make sure that this process scales up in case we decide to license the program to other companies.
UPDATE:
So I did go ahead and test LOAD DATA INFILE out as suggested, thinking it might give me only marginal speed increases (since I was now writing the same data to disk twice), but I was surprised when it cut the query time from over 3300ms down to ~240ms. The page still takes about ~1500ms to execute total, but it's still noticeably better than before.
From here I guess I'll check to see if I have any superfluous indexes in the database, and, since all but two of my tables are InnoDB, I will look into optimizing the InnoDB buffer pool to optimize the overall performance.
LOAD DATA INFILE is very fast, and is the right way to import text files into MySQL. It is one of the recommended methods for speeding up the insertion of data -up to 20 times faster, according to this:
https://dev.mysql.com/doc/refman/8.0/en/insert-optimization.html
Assuming that writing the processed data back to a text file is faster than inserting it into the database, then this is a good way to go.
LOAD DATA or multiple inserts are going to be much better than single inserts; LOAD DATA saves you a tiny little bit you probably don't care about that much.
In any case, do quite a lot but not too much in one transaction - 10,000 rows per transaction generally feels about right (NB: this is not relevant to non-transactional engines). If your transactions are too small then it will spend all its time syncing the log to disc.
Most of the time doing a big insert is going to come from building indexes, which is an expensive and memory-intensive operation.
If you need performance,
Have as few indexes as possible
Make sure the table and all its indexes fit in your innodb buffer pool (Assuming innodb here)
Just add more ram until your table fits in memory, unless that becomes prohibitively expensive (64G is not too expensive nowadays)
If you must use MyISAM, there are a few dirty tricks there to make it better which I won't discuss further.
Guys, i had the same question, my needs might have been a little more specific than general, but i have written a post about my findings here.
http://www.mediabandit.co.uk/blog/215_mysql-bulk-insert-vs-load-data
For my needs load data was fast, but the need to save to a flat file on the fly meant the average load times took longer than a bulk insert. Moreover i wasnt required to do more than say 200 queries, where before i was doing this one at a time, i'm now bulking them up, the time savings are in the region of seconds.
Anyway, hopefully this will help you?
You should be fine with your approach. I'm not sure how much faster LOAD DATA INFILE is compared to bulk INSERT, but I've heard the same thing, that it's supposed to be faster.
Of course, you'll want to do some benchmarks to be sure, but I'd say it's worth writing some test code.

suggested ways to import various pipe delimited file into a db based around buffer table using php/mysql?

I am trying to import various pipe delimited files using php 5.2 into a mysql database. I am importing various formats of piped data and my end goal is to try put the different data into a suitably normalised data structure but need to do some post processing on the data to put it into my model correctly.
I thought the best way to do this is to import into a table called buffer and map out the data then import into various tables. I am planning to create a table just called "buffer" with fields that represent each columns (there will be up to 80 columns) then apply some data transforms/mapping to get it to the right table.
My planned approach is to create a base class that generically reads the the pipe data into the buffer table then extend this class by having a function that contain various prepared statements to do the SQL magic, allowing me the flexibility to check the format is the same by reading the headers on the first row and changing it for one format.
My questions are:
Whats the best way to do step one of reading the data from a local file saved into the table? I'm not too sure if i should use the LOAD DATA of mysql (as suggested in Best Practice : Import CSV to MYSQL Database using PHP 5.x) or just fopen then insert the data line by line.
is this the best approach? How have other people approach this?
Is there anything in the zen framework that may help?
Additional : I am planning to do this in a scheduled task.
You don't need any PHP code to do that, IMO. Don't waste time on classes. MySQL LOAD DATA INFILE clause allows a lot of ways to import data, for 95% of your needs. Whatever delimiters, whatever columns to skip/pick. Read the manual attentively, it's worth to know what you CAN do with it. After importing the data, it can be already in a good shape if you write the query properly. The buffer table can be a temporary one. Then normalize or denormalize it and drop the initial table. Save the script in a file to reproduce the sequence of scripts if there's a mistake.
The best way is to write a SQL script, test if finally the data is in proper shape, seek for mistakes, modify, re-run the script. If there's a lot of data, do tests on a smaller set of rows.
[added] Another reason for sql-mostly approach is that if you're not fluent in SQL, but are going to work with a database, it's better to learn SQL earlier. You'll find a lot of uses for it later and will avoid the common pitfalls of programmers who know it superficially.
I personally use the free ETL software Kettle by Pentaho (this bit of software is commonly referred to as kettle). While this software is far from perfect, I've found that I can often import data in a fraction of the time I would have to spend writing a script for one specific file. You can select a text file input and specify the delimiters, fixed width, etc.. and then simply export directly into your SQL server (they support MySql, SQLite, Oracle, and much more).
There are dozens and dozens of ways. If you have local filesystem access to the MySQL instance, LOAD DATA. Otherwise you can just as easily transform each line into SQL (or a VALUES line) for periodic submittal to MySQL via PHP.
In the end i used dataload AND modified this http://codingpad.maryspad.com/2007/09/24/converting-csv-to-sql-using-php/ for different situations.

CSV vs MySQL performance

Lets assume the same environments for PHP5 working with MySQL5 and CSV files. MySQL is on the same host as hosted scripts.
Will MySQL always be faster than retriving/searching/changing/adding/deleting records to CSV?
Or is there some amount of data below which PHP+CSV performance is better than using database server?
CSV won't let you create indexes for fast searching.
If you always need all data from a single table (like for application settings), CSV is faster, otherwise not.
I don't even consider SQL queries, transactions, data manipulation or concurrent access here, as CSV is certainly not for these things.
No, MySQL will probably be slower for inserting (appending to a CSV is very fast) and table-scan (non-index based) searches.
Updating or deleting from a CSV is nontrivial - I leave that as an exercise for the reader.
If you use a CSV, you need to be really careful to handle multiple threads / processes correctly, otherwise you'll get bad data or corrupt your file.
However, there are other advantages too. Care to work out how you do ALTER TABLE on a CSV?
Using a CSV is a very bad idea if you ever need UPDATEs, DELETEs, ALTER TABLE or to access the file from more than one process at once.
As a person coming from the data industry, I've dealt with exactly this situation.
Generally speaking, MySQL will be faster.
However, you don't state the type of application that you are developing. Are you developing a data warehouse application that is mainly used for searching and retrieval of records? How many fields are typically present in your records? How many records are typically present in your data files? Do these files have any relational properties to each other, i.e. do you have a file of customers and a file of customer orders? How much time do you have to develop a system?
The answer will depend on the answer to the questions listed previously. However, you can generally use the following as a guidelines:
If you are building a data warehouse application with records exceeding one million, you may want to consider ditching both and moving to a Column Oriented Database.
CSV will probably be faster for smaller data sets. However, rolling your own insert routines in CSV could be painful and you lose the advantages of database indexing.
My general recommendation would be to just use MySql, as I said previously, in most cases it will be faster.
From a pure performance standpoint, it completely depends on the operation you're doing, as #MarkR says. Appending to a flat file is very fast. As is reading in the entire file (for a non-indexed search or other purposes).
The only way to know for sure what will work better for your use cases on your platform is to do actual profiling. I can guarantee you that doing a full table scan on a million row database will be slower than grep on a million line CSV file. But that's probably not a realistic example of your usage. The "breakpoints" will vary wildly depending on your particular mix of retrieve, indexed search, non-indexed search, update, append.
To me, this isn't a performance issue. Your data sounds record-oriented, and MySQL is vastly superior (in general terms) for dealing with that kind of data. If your use cases are even a little bit complicated by the time your data gets large, dealing with a 100k line CSV file is going to be horrific compared to a 100k record db table, even if the performance is marginally better (which is by no means guaranteed).
Depends on the use. For example for configuration or language files CSV might do better.
Anyway, if you're using PHP5, you have 3rd option -- SQLite, which comes embedded in PHP. It gives you ease of use like regular files, but robustness of RDBMS.
Databases are for storing and retrieving data. If you need anything more than plain line/entry addition or bulk listing, why not go for the database way? Otherwise you'd basically have to code the functionality (incl. deletion, sorting etc) yourself.
CSV is an incredibly brittle format and requires your app to do all the formatting and calcuations. If you need to update a spesific record in a csv you will have to first read the entire csv file, find the entry in memory would need to change, then write the whole file out again. This gets very slow very quickly. CSV is only useful for write once, readd once type apps.
If you want to import swiftly like a thief in the night, use SQL format.
If you are working in production server, CSV is slow but it is the safest.
Just make sure the CSV file doesn't have a Primary Key which will override your existing data.

Categories