I've got an application which needs to run a daily script; the daily script consists in downloading a CSV file with 1,000,000 rows, and inserting those rows into a table.
I host my application in Dreamhost. I created a while loop that goes through all the CSV's rows and performs an INSERT query for each one. The thing is that I get a "500 Internal Server Error". Even if I chop it out in 1000 files with 1000 rows each, I can't insert more than 40 or 50 thousand rows in the same loop.
Is there any way that I could optimize the input? I'm also considering going with a dedicated server; what do you think?
Thanks!
Pedro
Most databases have an optimized bulk insertion process - MySQL's is the LOAD DATA FILE syntax.
To load a CSV file, use:
LOAD DATA INFILE 'data.txt' INTO TABLE tbl_name
FIELDS TERMINATED BY ',' ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES;
Insert multiple values, instead of doing
insert into table values(1,2);
do
insert into table values (1,2),(2,3),(4,5);
Up to an appropriate number of rows at a time.
Or do bulk import, which is the most efficient way of loading data, see
http://dev.mysql.com/doc/refman/5.0/en/load-data.html
Normally I would say just use LOAD DATA INFILE, but it seems you can't with your shared hosting environment.
I haven't used MySQL in a few years, but they have a very good document which describes how to speed up insertions for bulk insertions:
http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html
A few ideas that can be gleaned from this:
Disable/enable keys around the insertions:
ALTER TABLE tbl_name DISABLE KEYS;
ALTER TABLE tbl_name ENABLE KEYS;
Use many values in your insert statements.
I.e.: INSERT INTO table (col1, col2) VALUES (val1, val2),(.., ..), ...
If I recall correctly, you can have up to 4096 values per insertion statement.
Run a FLUSH TABLES command before you even start, to ensure that there are no pending disk writes that may hurt your insertion performance.
I think this will make things fast. I would suggest using LOCK TABLES, but I think disabling the keys makes that moot.
UPDATE
I realized after reading this that by disabling your keys you may remove consistency checks that are important for your file loading. You can fix this by:
Ensuring that your table has no data that "collides" with the new data being loaded (if you're starting from scratch, a TRUNCATE statement will be useful here).
Writing a script to clean your input data to ensure no duplicates locally. Checking for duplicates is probably costing you a lot of database time anyway.
If you do this, ENABLE KEYS should not fail.
You can create cronjob script which adds x records to the database at one request.
Cronjob script will check if last import have not addded all needed rows he takes another x rows.
So you can add as many you need rows.
If you have your dedicated server it's more easier. You just run loop with all insert queries.
Of course you can try to set time_limit to 0 (if it's working on dreamhost) or make it bigger.
Your PHP script is most likely being terminated because it exceeded the script time limit. Since you're on a shared host, you're pretty much out of luck.
If you do switch to a dedicated server and if you get shell access, the best way would be to use the mysql command-line tool to insert the data.
OMG Ponies suggestion is great, but I've also 'manually' formatted data into the same format that mysqldump uses, then loaded it that way. Very fast.
Have you tried doing transactions? Just send the command BEGIN to MySQL, do all your inserts then do COMMIT. This would speed it up significantly,but like casablanca said, your script is probably timing out as well.
I've ran into this problem myself before and nos pretty much got it right on the head, but you'll need to do a bit more to get it to perform the best.
I found that in my situation that I couldn't MySQL to accept one large INSERT statement, but found that if I split it up into groups of about 10k INSERTS at a time like how nos suggested then it'll do it's job pretty quickly. One thing to note is that when doing multiple INSERTs like this that you will most likely hit PHP's timeout limit, but this can be avoided by resetting the timout with set_time_limit($seconds), I found that doing this after each successful INSERT worked really well.
You have to be careful about doing this, because you could find yourself in a loop on accident with an unlimited timout and for that I would suggest testing to make sure that each INSERT was successful by either checking for errors reported by MySQL with mysql_errno() or mysql_error(). You could also catch errors by checking the number of rows affected by the INSERT with mysql_affected_rows(). You could then stop after the first error happens.
It would be better if you use sqlloader.
You would need two things first control file that specifies the actions which SQL Loader should do and second csv file that you want to be loaded
Here is the below link that would help you out.
http://www.oracle-dba-online.com/sql_loader.htm
Go to phpmyadmin and select the table you would like to insert into.
Under the "operations" tab, and then the ' table options' option /section , change the storage engine from InnoDB to MyISAM.
I once had a similar challenge.
Have a good time.
Related
I have a laravel application which must insert/update thousands of records per second in a for loop. my problem is that my Database insert/update rate is 100-150 writes per second . I have increased the amount of RAM dedicated to my database but got no luck.
is there any way to increase the write rate for mysql to thousands of records per second ?
please provide me optimum configurations for performance tuning
and PLEASE do not down mark the question . my code is correct . Its not a code problem because I have no problem with MONGODB . but I have to use mysql .
My Storage Engine is InnoDB
Inserting rows one at a time, and autocommitting each statement, has two overheads.
Each transaction has overhead, probably more than one insert. So inserting multiple rows in one transaction is the trick. This requires a code change, not a configuration change.
Each INSERT statement has overhead. One insert has about 90% over head and 10% actual insert.
The optimal is 100-1000 rows being inserted per transaction.
For rapid inserts:
Best is LOAD DATA -- if you are starting with a .csv file. If you must build the .csv file first, then it is debatable whether that overhead makes this approach lose.
Second best is multi-row INSERT statements: INSERT INTO t (a,b) VALUES (1,2), (2,3), (44,55), .... I recommend 1000 per statement, and COMMIT each statement. This is likely to get you past 1000 rows per second being inserted.
Another problem... Since each index is updated as the row is inserted, you may run into trouble with thrashing I/O to achieve this task. InnoDB automatically "delays" updates to non-unique secondary indexes (no need for INSERT DELAYED), but the work is eventually done. (So RAM size and innodb_buffer_pool_size come into play.)
If the "thousands" of rows/second is a one time task, then you can stop reading here. If you expect to do this continually 'forever', there are other issues to contend with. See High speed ingestion .
For insert, you might want to look into the INSERT DELAYED syntax. That will increase insert performance, but it won't help with update and the syntax will eventually be deprecated. This post offers an alternative for updates, but it involves custom replication.
One way my company's succeeded in speeding up inserts is by writing the SQL to a file, and then doing using a MySQL LOAD DATA INFILE command, but I believe we found that required the server's command line to have the mysql application installed.
I've also found that inserting and updating in a batch is often faster. So if you're calling INSERT 2k times, you might be better off running 10 inserts of 200 rows each. This would decrease the lock requirements and decrease information/number of calls sent over the wire.
We build a link for our offline program to our website. In our offline program we have 50.000 records we want to push to our website. What we do now is the following:
In the offline program we build an xml file with 1500 records and post it to a php file on our webserver. On the webserver we read the xml and push it to the mysql database, before we do that we first check if the record already exist and then we update the record or insert it as a new record.
When thats done, we give back a message to our offline program that the batch is completed . The offline program builds a new xml file with the next 1500 records. This process repeats till it reached the last 1500 records.
The problems is that the webserver become very slow while pushing the records to the database. Probably thats because we first check the records that already exist (that's one query) and then write it into the database (that's the second query). So for each batch we have to run 3000 queries.
I hope you guys have some tips to speed up this process.
Thanks in advance!
Before starting the import, read all the data ids you have, do not make checking queries on every item insert, but check it in existed php array.
Fix keys on your database tables.
Make all inserts on one request, or use Transactions.
there is no problems to import a lot of data such way, i had a lot of experience with it.
A good thing to do is write a single query composed of the concatenation of all of the insert statements separated by a semicolon:
INSERT INTO table_name
(a,b,c)
VALUES
(1,2,3)
ON DUPLICATE KEY
UPDATE a = 1, b = 2, c = 3;
INSERT INTO table_name
...
You could do concatenate 100-500 insert statements and wrap them in a transaction.
Wrapping many statements into a transaction can help by the fact that it doesn't immediately commit the data to disk after each row inserted, it keeps the whole 100-500 batch in memory and when they are all finished it writes them all to disk - which means less intermittent disk-IO.
You need to find a good batch size, I exemplified 100-500 but depending on your server configurations, on the amount of data per statement and on the number of inserts vs. updates you'll have to fine tune it.
Read some information about Mysql Unique Index Constraints. This should help:
Mysql Index Tutorial
I had the same problem 4 months ago and I got more performance coding in java rather than php and avoiding xml documents.
My tip: you can read the whole table (if you do it once is faster than make many queries 1 by 1) and keep this table in memory (in a HashMap for example). And before inserting a record, you can check if it exists in your structure localy (you do not bother the DB).
You can improve your performance this way.
I want to upload a large csv file approx 10,000,000 records in mysql table which also contain same or more no. of records and also some duplicate records.
I tried Local data infile but it is also taking more time.
How can I resolve this without waiting for a long time.
If it can't be resolved then how can I do it with AJAX to send some records and process it at a time and will do it till the whole csv get uploaded/proccessed.
LOAD DATA INFILE isn't going to be beat speed-wise. There are a few things you can do to speed it up:
Drop or disable some indexes (but of course, you'll get to wait for them to build after the load. But this is often faster). If you're using MyISAM, you can ALTER TABLE *foo* DISABLE KEYS, but InnoDB doesn't support that, unfortunately. You'll have to drop them instead.
Optimize your my.cnf settings. In particular, you may be able to disable a lot of safety things (like fsync). Of course, if you take a crash, you'll have to restore a backup and start the load over again. Also, if you're running the default my.cnf, last I checked its pretty sub-optimal for a database machine. Plenty of tuning guides are around.
Buy faster hardware. Or rent some (e.g., try a fast Amazon ECC instance).
As #ZendDevel mentions, consider other data storage solutions, if you're not locked into MySQL. For example, if you're just storing a list of telephone numbers (and some data with them), a plain hash table is going to be many times faster.
If the problem is that its killing a database performance, you can split your CSV file into multiple CSV files, and load them in chunks.
Try this:
load data local infile '/yourcsvfile.csv' into table yourtable fields terminated by ',' lines terminated by '\r\n'
Depending on your storage engine this can take a long time. I've noticed that with MYISAM it goes a bit faster. I've just tested with the exact same dataset and I finally went with PostgreSQL because it was more robust at loading the file. Innodb was so slow I aborted it after two hours with the same size dataset but it was 10,000,000 records by 128 columns full of data.
As this is a white list being updated on a daily basis does this not mean that there are a very large number of duplicates (after the first day)? If this is the case it would make the upload a lot faster to do a simple script which checks if the record already exists before inserting it.
Try this query:
$sql="LOAD DATA LOCAL INFILE '../upload/csvfile.csv'
INTO TABLE table_name FIELDS
TERMINATED BY ','
ENCLOSED BY ''
LINES TERMINATED BY '\n' "
I was realize the same problem and find out a way out. You can check the process to upload large CSV file using AJAX.
How to use AJAX to upload large CSV file?
I am going to change my application so that it does bulk inserts instead of individual ones to ease the load on my server. I am not sure the best way to go about this. Thoughts so far are:
Use a text file and write all the insert / update statements to this file and process it every 5 mins - I am not sure of the best way to handle this one. Would reading from one process (to create the bulk insert) cause issues when the main process is still trying to add more statements to it? Would I need to create a new file every 5 mins and delete it when its processed.
Store the inserts in a session and then just process them. Would this cause any problems with memory ect?
I am using PHP and MySQL with MyISAM tables. I am open to all ideas on the best way to handle this, I just know I need to stop doing single inserts / updates.
Thanks.
The fastest way to get data into the database is to use load data infile on a text file.
See: http://dev.mysql.com/doc/refman/5.1/en/load-data.html
You can also use bulk inserts of course, if you want them to queue behind selects, use a syntax like:
INSERT LOW PRIORITY INTO table1 (field1, field2) VALUES (1,1),(2,2),(3,3),...
or
INSERT DELAYED INTO .....
Note that delayed does not work with InnoDB.
Also note that low priority is not recommended when using MyISAM.
See:
http://dev.mysql.com/doc/refman/5.5/en/insert.html
http://dev.mysql.com/doc/refman/5.5/en/insert-delayed.html
I think you should create a new file for each 5 minutes fro insert and update separately and remove the ones after processing.
For bulk insert
You can use LOAD DATA INFILE with disabled keys on table.
If you use innodb you should run all inserts in transaction for preventing flushing indexes on each query and use form with multiple VALUES(),(),().
If you use MyIsam you should insert with DELAYED option. Also if you don't remove rows from table it's possible to have concurrent read/write.
For bulk updates you should use transaction because you will get the same effect.
I have batch process when i need to update my db table, around 100000-500000 rows, from uploaded CVS file. Normally it takes 20-30 minutes, sometimes longer.
What is the best way to do ? any good practice on that ? Any suggest would be appreciated
Thanks.
It takes 30 minutes to import 500.000 rows from a CSV?
Have you considered letting MySQL do the hard work? There is LOAD DATA INFILE, which supports dealing with CSV files:
LOAD DATA INFILE 'data.txt' INTO TABLE tbl_name
FIELDS TERMINATED BY ',' ENCLOSED BY '"'
LINES TERMINATED BY '\n';
If the file is not quite in the right shape to be imported right to the target table, you can either use PHP to transform it beforehand, or LOAD it into a "staging" table and let MySQL handle the necessary transformation — whichever is faster and more convenient.
As an additional option, there seems to be a possibility to run MySQL queries asynchronously through the MySQL Native Driver for PHP (MYSQLND). Maybe you can explore that option as well. It would enable you to retain snappy UI performance.
If you're doing a lot of inserts, are you doing bulk inserts? i.e. like this:
INSERT INTO table (col1 col2) VALUES (val1a, val2a), (val1b, val2b), (....
That will dramatically speed up inserts.
Another thing you can do is disable indexing while you make the changes, then let it rebuild the indexes in one go when you're finished.
A bit more detail about what you're doing and you might get more ideas
The PEAR has a package called Benchmark has a Benchmark_Profiler class that can help you find the slowest section of your code so you can optimize.
We had a feature like that in a big application. We had the issue of inserting millions of rows from a csv into a table with 9 indexes. After lots of refactoring we found the ideal way to insert the data was to load it into a [temporary] table with the mysql LOAD DATA INFILE command, do the transformations there and copy the result with multiple insert queries into the actual table (INSERT INTO ... SELECT FROM) processing only 50k lines or so with each query (which performed better than issuing a single insert but YMMV).
I cant do it with cron, coz this is under user control, A user click process buttons and later on can check logs to see process status
When the user presses said button, set a flag in a table in the database. Then have your cron job check for this flag. If it's there, start processing, otherwise don't. I applicable, you could use the same table to post some kind of status update (eg. xx% done), so the user has some feedback about the progress.