I have batch process when i need to update my db table, around 100000-500000 rows, from uploaded CVS file. Normally it takes 20-30 minutes, sometimes longer.
What is the best way to do ? any good practice on that ? Any suggest would be appreciated
Thanks.
It takes 30 minutes to import 500.000 rows from a CSV?
Have you considered letting MySQL do the hard work? There is LOAD DATA INFILE, which supports dealing with CSV files:
LOAD DATA INFILE 'data.txt' INTO TABLE tbl_name
FIELDS TERMINATED BY ',' ENCLOSED BY '"'
LINES TERMINATED BY '\n';
If the file is not quite in the right shape to be imported right to the target table, you can either use PHP to transform it beforehand, or LOAD it into a "staging" table and let MySQL handle the necessary transformation — whichever is faster and more convenient.
As an additional option, there seems to be a possibility to run MySQL queries asynchronously through the MySQL Native Driver for PHP (MYSQLND). Maybe you can explore that option as well. It would enable you to retain snappy UI performance.
If you're doing a lot of inserts, are you doing bulk inserts? i.e. like this:
INSERT INTO table (col1 col2) VALUES (val1a, val2a), (val1b, val2b), (....
That will dramatically speed up inserts.
Another thing you can do is disable indexing while you make the changes, then let it rebuild the indexes in one go when you're finished.
A bit more detail about what you're doing and you might get more ideas
The PEAR has a package called Benchmark has a Benchmark_Profiler class that can help you find the slowest section of your code so you can optimize.
We had a feature like that in a big application. We had the issue of inserting millions of rows from a csv into a table with 9 indexes. After lots of refactoring we found the ideal way to insert the data was to load it into a [temporary] table with the mysql LOAD DATA INFILE command, do the transformations there and copy the result with multiple insert queries into the actual table (INSERT INTO ... SELECT FROM) processing only 50k lines or so with each query (which performed better than issuing a single insert but YMMV).
I cant do it with cron, coz this is under user control, A user click process buttons and later on can check logs to see process status
When the user presses said button, set a flag in a table in the database. Then have your cron job check for this flag. If it's there, start processing, otherwise don't. I applicable, you could use the same table to post some kind of status update (eg. xx% done), so the user has some feedback about the progress.
Related
Please, if somebody can give me support.
My problem is:
I have a table with 8 fields and about 510 000 records. In a web form, the user select an Excel file and it's read it with SimpleXLSX. The file has about 340 000 lines. With PHP and SimpleXLSX library this file is loaded in memory, then with a for cicle the script read line by line, taken one data of ecah line and search this value in the table, if the data exists in the table, then does not insert the value, other wise, the values read it are stored in the table.
This process takes days to finish.
Can somebody suggest me some operation to speed up the process?
Thanks a lot.
if you have many users, and they maybe use the web at the same time:
you must change SimpleXLSX to js-xlsx, in webbrowser do all work but only write database in server
if you have few users (i think you in this case)
and search this value in the table
this is cost the must time, if your single-to-single compare memory and database, then add/not-add to database.
so you can read all database info in memory, (must use hash-list for compare),then compare all
and add it to memory and mark newable
at last
add memory info to database
because you database and xls have most same count, so...database become almost valueless
just forget database, this is most fast in memory
in memory use hash-list for compare
of course, you can let above run in database if you can use #Barmar's idea.. don't insert single, but batch
Focus on speed on throwing the data into the database. Do not try to do all the work during the INSERT. Then use SQL queries to further clean up the data.
Use the minimal XLS to get the XML into the database. Use some programming language if you need to massage the data a lot. Neither XLS nor SQL is the right place for complex string manipulations.
If practical, use LOAD DATA ... XML to get the data loaded; it is very fast.
SQL is excellent for handling entire tables at once; it is terrible at handling one row at a time. (Hence, my recommendation of putting the data into a staging table, not directly into the target table.)
If you want to discuss further, we need more details about the conversions involved.
A CRM is hitting my server via web hooks at-least 1000 times and I cannot process all the requests at once.
So I am thinking about saving it (in Mysql or csv file) and then process 1 record at a time.
which method is faster if there are approx 100 000 records and I have to process one record at a time.
Different methods are available to perform such operation:
You can store data in MySQL and write a PHP script which will fetch request from MySQL database and process one by one. That script you can run automatically using crontab or scheduler after specific interval.
You can implement custom queues functionality using PHP + MySQL
It sounds like you need the following:
1) An incoming queue table where all the new rows get inserted without processing. An appropriately configured InnoDB table should be able to handle 1000 INSERTs/second unless you are running on a Raspberry Pi or something similarly underspecified. You should probably have this partitioned so that instead of deleting the records after processing, you can drop partitions instead (ALTER TABLE ... DROP PARTITION is much, much cheaper than a large DELETE operation).
2) A scheduled event that processes the data in the background, possibly in batches, and cleans up the original queue table.
As you definitely know, CSV won't let you create indexes for fast searching. Setting indexes to your table columns speeds up the searching to a really great great degree, and you cannot ignore this fact.
If you need all your data from a single table (for instance, app config), CSV is faster, otherwise not. Hence, For simple inserting and table-scan (non-index based) searches, CSV is faster. Also, Consider that updating or deleting from a CSV is nontrivial. If you use CSV, you need to be really careful to handle multiple threads / processes correctly, otherwise you'll get bad data or corrupt your file.
Mysql offers a lot of capabilities such as SQL queries, transactions, data manipulation or concurrent access here, yet CSV is certainly not for these things. Mysql, as mentioned by Simone Rossaini, is also safe. You may not overlook this fact as well.
SUMMARY
If you are going for simple inserting and table-scan (non-index based) searches, CSV is definitely faster. Yet, it has many shortcomings when you compare it with the countless capabilities of MySql.
I am going to change my application so that it does bulk inserts instead of individual ones to ease the load on my server. I am not sure the best way to go about this. Thoughts so far are:
Use a text file and write all the insert / update statements to this file and process it every 5 mins - I am not sure of the best way to handle this one. Would reading from one process (to create the bulk insert) cause issues when the main process is still trying to add more statements to it? Would I need to create a new file every 5 mins and delete it when its processed.
Store the inserts in a session and then just process them. Would this cause any problems with memory ect?
I am using PHP and MySQL with MyISAM tables. I am open to all ideas on the best way to handle this, I just know I need to stop doing single inserts / updates.
Thanks.
The fastest way to get data into the database is to use load data infile on a text file.
See: http://dev.mysql.com/doc/refman/5.1/en/load-data.html
You can also use bulk inserts of course, if you want them to queue behind selects, use a syntax like:
INSERT LOW PRIORITY INTO table1 (field1, field2) VALUES (1,1),(2,2),(3,3),...
or
INSERT DELAYED INTO .....
Note that delayed does not work with InnoDB.
Also note that low priority is not recommended when using MyISAM.
See:
http://dev.mysql.com/doc/refman/5.5/en/insert.html
http://dev.mysql.com/doc/refman/5.5/en/insert-delayed.html
I think you should create a new file for each 5 minutes fro insert and update separately and remove the ones after processing.
For bulk insert
You can use LOAD DATA INFILE with disabled keys on table.
If you use innodb you should run all inserts in transaction for preventing flushing indexes on each query and use form with multiple VALUES(),(),().
If you use MyIsam you should insert with DELAYED option. Also if you don't remove rows from table it's possible to have concurrent read/write.
For bulk updates you should use transaction because you will get the same effect.
I've got an application which needs to run a daily script; the daily script consists in downloading a CSV file with 1,000,000 rows, and inserting those rows into a table.
I host my application in Dreamhost. I created a while loop that goes through all the CSV's rows and performs an INSERT query for each one. The thing is that I get a "500 Internal Server Error". Even if I chop it out in 1000 files with 1000 rows each, I can't insert more than 40 or 50 thousand rows in the same loop.
Is there any way that I could optimize the input? I'm also considering going with a dedicated server; what do you think?
Thanks!
Pedro
Most databases have an optimized bulk insertion process - MySQL's is the LOAD DATA FILE syntax.
To load a CSV file, use:
LOAD DATA INFILE 'data.txt' INTO TABLE tbl_name
FIELDS TERMINATED BY ',' ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES;
Insert multiple values, instead of doing
insert into table values(1,2);
do
insert into table values (1,2),(2,3),(4,5);
Up to an appropriate number of rows at a time.
Or do bulk import, which is the most efficient way of loading data, see
http://dev.mysql.com/doc/refman/5.0/en/load-data.html
Normally I would say just use LOAD DATA INFILE, but it seems you can't with your shared hosting environment.
I haven't used MySQL in a few years, but they have a very good document which describes how to speed up insertions for bulk insertions:
http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html
A few ideas that can be gleaned from this:
Disable/enable keys around the insertions:
ALTER TABLE tbl_name DISABLE KEYS;
ALTER TABLE tbl_name ENABLE KEYS;
Use many values in your insert statements.
I.e.: INSERT INTO table (col1, col2) VALUES (val1, val2),(.., ..), ...
If I recall correctly, you can have up to 4096 values per insertion statement.
Run a FLUSH TABLES command before you even start, to ensure that there are no pending disk writes that may hurt your insertion performance.
I think this will make things fast. I would suggest using LOCK TABLES, but I think disabling the keys makes that moot.
UPDATE
I realized after reading this that by disabling your keys you may remove consistency checks that are important for your file loading. You can fix this by:
Ensuring that your table has no data that "collides" with the new data being loaded (if you're starting from scratch, a TRUNCATE statement will be useful here).
Writing a script to clean your input data to ensure no duplicates locally. Checking for duplicates is probably costing you a lot of database time anyway.
If you do this, ENABLE KEYS should not fail.
You can create cronjob script which adds x records to the database at one request.
Cronjob script will check if last import have not addded all needed rows he takes another x rows.
So you can add as many you need rows.
If you have your dedicated server it's more easier. You just run loop with all insert queries.
Of course you can try to set time_limit to 0 (if it's working on dreamhost) or make it bigger.
Your PHP script is most likely being terminated because it exceeded the script time limit. Since you're on a shared host, you're pretty much out of luck.
If you do switch to a dedicated server and if you get shell access, the best way would be to use the mysql command-line tool to insert the data.
OMG Ponies suggestion is great, but I've also 'manually' formatted data into the same format that mysqldump uses, then loaded it that way. Very fast.
Have you tried doing transactions? Just send the command BEGIN to MySQL, do all your inserts then do COMMIT. This would speed it up significantly,but like casablanca said, your script is probably timing out as well.
I've ran into this problem myself before and nos pretty much got it right on the head, but you'll need to do a bit more to get it to perform the best.
I found that in my situation that I couldn't MySQL to accept one large INSERT statement, but found that if I split it up into groups of about 10k INSERTS at a time like how nos suggested then it'll do it's job pretty quickly. One thing to note is that when doing multiple INSERTs like this that you will most likely hit PHP's timeout limit, but this can be avoided by resetting the timout with set_time_limit($seconds), I found that doing this after each successful INSERT worked really well.
You have to be careful about doing this, because you could find yourself in a loop on accident with an unlimited timout and for that I would suggest testing to make sure that each INSERT was successful by either checking for errors reported by MySQL with mysql_errno() or mysql_error(). You could also catch errors by checking the number of rows affected by the INSERT with mysql_affected_rows(). You could then stop after the first error happens.
It would be better if you use sqlloader.
You would need two things first control file that specifies the actions which SQL Loader should do and second csv file that you want to be loaded
Here is the below link that would help you out.
http://www.oracle-dba-online.com/sql_loader.htm
Go to phpmyadmin and select the table you would like to insert into.
Under the "operations" tab, and then the ' table options' option /section , change the storage engine from InnoDB to MyISAM.
I once had a similar challenge.
Have a good time.
We have 2 servers, which one of them is customer's. Our customer is providing us an URLs of XML/JSON exports of his clients informations from his CMS and our task is to write some import scripts for importing data to webapp, which we're developing.
I've always been doing that like this:
INSERT INTO customers (name,address) VALUES ('John Doe', 'NY') ON DUPLICATE KEY UPDATE name='John Doe', address='NY'
This solution is best in the way of permormace, as far as i know...
But this solution is NOT solving the problem of deleting records. What if some client is deleted from the database and isn't now in the export - how should i do that?
Shoud I firstly TRUNCATE the whole table and then fill it again?
Or should I fill some array in PHP with all records and then walk through it again and delete records, which aren't in XML/JSON?
I think there must be better solution.
I'm interested in the best solution in the way of performace, 'cause we have to import many thousands of records and the process of whole import may take a lot of time.
I'm interested in the best solution in the way of performace
If its mysql at the client, use mysql replication - the client as the master and your end as the slave. You can either use a direct feed (you'd probably want to run this across a VPN) or in disconnected mode (they send you the bin logs to roll forward).
Our customer is providing us an URLs of XML/JSON exports of his clients informations from his CMS
This is a really dumb idea - and sounds like you're trying to make the solution fit the problem (which it doesn't). HTTP is not the medium for transferring large data files across the internet. It also means that the remote server must do rather a lot of work just to make the data available (assuming it can even identify what data needs to be replicated - and as you point out, that is currently failing to work for deleted records). The latter point is true regardless of the network protocol.
You caertainly can't copy large amounts of data directly across at a lower level in the stack than the database (e.g. trying to use rsync to replicate data files) because the local mirror will nearly always be inconsistent.
C.
Assuming you are using MySQL, the only SQL I know anything about:
Is it true that the export of your customer's CMS always contains all of his current customer data? If it is true, then yes it is best imo to drop or truncate the 'customers' table; that is, to just throw away yesterday's customer table and reconstruct it today from the beginning.
But you can't use 'insert': it will take ~28 hours per day to insert thousands of customer rows. So forget about 'insert'.
Instead, add rows into 'customers' with 'load data local infile': first write a temp disk file 'cust_data.txt' of all the customer data, with column data separated somehow (perhaps by commas), and then say something like:
load data local infile 'cust_data.txt' replace into table customers fields terminated by ',' lines terminated by '\n';
Can you structure the query such that you can use your client's output file directly, without first staging it into 'cust_data.txt'? That would be the answer to a maiden's prayer.
It should be fast enough for you: you will be amazed!
ref: http://dev.mysql.com/doc/refman/5.0/en/load-data.html
If your customer can export data as csv file, you can use SQL Data Examiner
http://www.sqlaccessories.com/SQL_Data_Examiner to update records in the target database (insert/update/delete) using csv file as source.