At my company, I use a program read CSV files and insert the data into a database. I am having trouble with it because it needs to able to insert a large amount of data ( up to 10,000 rows ) of a time. At first I had it looping through and inserting each record one at a time. That is slow because it calls an insert function 10,000 times... Next I tried to group it together so it inserted 50 rows at a time by concatenating the SQL call. I have tried grouping the SQL calls into up to 1,000 rows at a time, but it is still too slow.
Another thing that I have to do is change data. The client gives a spreadsheet with a data such as their username and password, but sometimes the usernames are the same, so I change them by adding a number at the end. i.e. JoDoe, JoDoe1. Sometimes the case is that there is no password or username, so I have to generate one. The reason I bring this up is that I read that using LOAD DATA INFILE reads a file really fast and puts it into a table, but I need to edit it before going into the table.
It will time out after 120 seconds, and what doesn't get finished in that time is inserted as all 0's. I need to speed it up so it doesn't take as long. I do NOT want to change the time limit because it is a company thing. What is an efficient way to insert many rows of a CSV file into a database?
LOAD DATA INFILE can perform numerous preprocessing operations as it loads the data. That might be enough. If not, run a PHP script to process from one CSV file to another, temporary, CSV file, editing as you go. Then use LOAD DATA INFILE on the newly created file.
Related
Please, if somebody can give me support.
My problem is:
I have a table with 8 fields and about 510 000 records. In a web form, the user select an Excel file and it's read it with SimpleXLSX. The file has about 340 000 lines. With PHP and SimpleXLSX library this file is loaded in memory, then with a for cicle the script read line by line, taken one data of ecah line and search this value in the table, if the data exists in the table, then does not insert the value, other wise, the values read it are stored in the table.
This process takes days to finish.
Can somebody suggest me some operation to speed up the process?
Thanks a lot.
if you have many users, and they maybe use the web at the same time:
you must change SimpleXLSX to js-xlsx, in webbrowser do all work but only write database in server
if you have few users (i think you in this case)
and search this value in the table
this is cost the must time, if your single-to-single compare memory and database, then add/not-add to database.
so you can read all database info in memory, (must use hash-list for compare),then compare all
and add it to memory and mark newable
at last
add memory info to database
because you database and xls have most same count, so...database become almost valueless
just forget database, this is most fast in memory
in memory use hash-list for compare
of course, you can let above run in database if you can use #Barmar's idea.. don't insert single, but batch
Focus on speed on throwing the data into the database. Do not try to do all the work during the INSERT. Then use SQL queries to further clean up the data.
Use the minimal XLS to get the XML into the database. Use some programming language if you need to massage the data a lot. Neither XLS nor SQL is the right place for complex string manipulations.
If practical, use LOAD DATA ... XML to get the data loaded; it is very fast.
SQL is excellent for handling entire tables at once; it is terrible at handling one row at a time. (Hence, my recommendation of putting the data into a staging table, not directly into the target table.)
If you want to discuss further, we need more details about the conversions involved.
I receive files in a streamed manner once every 30 seconds. The files may have up to 40 columns and 50,000 rows. The files are txt files and tab seperated. Right now, I'm saving the file temporally, save the contents with load data infile to a temporary table in the database and delete the file afterwards.
I would like to avoid the save and delete process and instead save the data directly to the database. The stream is the $output here:
protected function run(OutputInterface $output)
{
$this->readInventoryReport($this->interaction($output));
}
I've been googling around all the time trying to find a "performance is a big issue" - proof answer to this, but I can't find a good way of doing this without saving the data to a file and using load data infile. I need to have the contents available quickly and work with thoses after they are saved to a temporary table. (Update other tables with the contents...)
Is there a good way of handling this, or will the file save and delete method together with load data infile be better than other solutions?
The server I'm running this on has SSDs and 32GB of RAM.
LOAD DATA INFILE is your fastest way to do low-latency ingestion of tonnage of data into MySQL.
You can write yourself a php program that will, using prepared statements and the like, do a pretty good job of inserting rows into your database. If you arrange to do a COMMIT every couple of hundred rows, and use prepared statements, and write your code carefully, it will be fairly fast, but not as fast as LOAD DATA INFILE. Why? individual row operations have to be serialized onto the network wire, then deserialized, and processed one (or two or ten) at a time. LOAD DATA just slurps up your data locally.
It sounds like you have a nice MySQL server machine. But the serialization is still a bottleneck.
50K records every 30 seconds, eh? That's a lot! Is any of that data redundant? That is, do any of the rows in a later batch of data overwrite rows in an earlier batch? If so, you might be able to write a program that would skip rows that have become obsolete.
We build a link for our offline program to our website. In our offline program we have 50.000 records we want to push to our website. What we do now is the following:
In the offline program we build an xml file with 1500 records and post it to a php file on our webserver. On the webserver we read the xml and push it to the mysql database, before we do that we first check if the record already exist and then we update the record or insert it as a new record.
When thats done, we give back a message to our offline program that the batch is completed . The offline program builds a new xml file with the next 1500 records. This process repeats till it reached the last 1500 records.
The problems is that the webserver become very slow while pushing the records to the database. Probably thats because we first check the records that already exist (that's one query) and then write it into the database (that's the second query). So for each batch we have to run 3000 queries.
I hope you guys have some tips to speed up this process.
Thanks in advance!
Before starting the import, read all the data ids you have, do not make checking queries on every item insert, but check it in existed php array.
Fix keys on your database tables.
Make all inserts on one request, or use Transactions.
there is no problems to import a lot of data such way, i had a lot of experience with it.
A good thing to do is write a single query composed of the concatenation of all of the insert statements separated by a semicolon:
INSERT INTO table_name
(a,b,c)
VALUES
(1,2,3)
ON DUPLICATE KEY
UPDATE a = 1, b = 2, c = 3;
INSERT INTO table_name
...
You could do concatenate 100-500 insert statements and wrap them in a transaction.
Wrapping many statements into a transaction can help by the fact that it doesn't immediately commit the data to disk after each row inserted, it keeps the whole 100-500 batch in memory and when they are all finished it writes them all to disk - which means less intermittent disk-IO.
You need to find a good batch size, I exemplified 100-500 but depending on your server configurations, on the amount of data per statement and on the number of inserts vs. updates you'll have to fine tune it.
Read some information about Mysql Unique Index Constraints. This should help:
Mysql Index Tutorial
I had the same problem 4 months ago and I got more performance coding in java rather than php and avoiding xml documents.
My tip: you can read the whole table (if you do it once is faster than make many queries 1 by 1) and keep this table in memory (in a HashMap for example). And before inserting a record, you can check if it exists in your structure localy (you do not bother the DB).
You can improve your performance this way.
I want to upload a large csv file approx 10,000,000 records in mysql table which also contain same or more no. of records and also some duplicate records.
I tried Local data infile but it is also taking more time.
How can I resolve this without waiting for a long time.
If it can't be resolved then how can I do it with AJAX to send some records and process it at a time and will do it till the whole csv get uploaded/proccessed.
LOAD DATA INFILE isn't going to be beat speed-wise. There are a few things you can do to speed it up:
Drop or disable some indexes (but of course, you'll get to wait for them to build after the load. But this is often faster). If you're using MyISAM, you can ALTER TABLE *foo* DISABLE KEYS, but InnoDB doesn't support that, unfortunately. You'll have to drop them instead.
Optimize your my.cnf settings. In particular, you may be able to disable a lot of safety things (like fsync). Of course, if you take a crash, you'll have to restore a backup and start the load over again. Also, if you're running the default my.cnf, last I checked its pretty sub-optimal for a database machine. Plenty of tuning guides are around.
Buy faster hardware. Or rent some (e.g., try a fast Amazon ECC instance).
As #ZendDevel mentions, consider other data storage solutions, if you're not locked into MySQL. For example, if you're just storing a list of telephone numbers (and some data with them), a plain hash table is going to be many times faster.
If the problem is that its killing a database performance, you can split your CSV file into multiple CSV files, and load them in chunks.
Try this:
load data local infile '/yourcsvfile.csv' into table yourtable fields terminated by ',' lines terminated by '\r\n'
Depending on your storage engine this can take a long time. I've noticed that with MYISAM it goes a bit faster. I've just tested with the exact same dataset and I finally went with PostgreSQL because it was more robust at loading the file. Innodb was so slow I aborted it after two hours with the same size dataset but it was 10,000,000 records by 128 columns full of data.
As this is a white list being updated on a daily basis does this not mean that there are a very large number of duplicates (after the first day)? If this is the case it would make the upload a lot faster to do a simple script which checks if the record already exists before inserting it.
Try this query:
$sql="LOAD DATA LOCAL INFILE '../upload/csvfile.csv'
INTO TABLE table_name FIELDS
TERMINATED BY ','
ENCLOSED BY ''
LINES TERMINATED BY '\n' "
I was realize the same problem and find out a way out. You can check the process to upload large CSV file using AJAX.
How to use AJAX to upload large CSV file?
I have a set of approximately 1.1 million unique IDs and I need to determine which do not have a corresponding record in my application's database. The set of IDs comes from a database as well, but not the same one. I am using PHP and MySQL and have plenty of memory - PHP is running on a server with 15GB RAM and MySQL runs on its own server which has 7.5GB RAM.
Normally I'd simply load all the IDs in one query and then use them with the IN clause of a SELECT query to do the comparison in one shot.
So far my attempts have resulted in scripts that either take an unbearably long time or that spike the CPU to 100%.
What's the best way to load such a large data set and do this comparison?
Generate a dump of the IDs from the first database into a file, then re-load it into a temporary table on the second database, and do a join between that temporary table and the second database table to identify those ids that don't have a matching record. Once you've generated that list, you can drop the temporary table.
That way, you're not trying to work with large volumes of data in PHP itself, so you shouldn't have any memory issues.
Assuming you can't join the tables since they are not on the same DB server, and that your server can handle this, I would populate an array with all the IDs from one DB, then loop over the IDs from the other and use in_array to see if each one exists in the array.
BTW - according to this, you can make the in_array more efficient.