We build a link for our offline program to our website. In our offline program we have 50.000 records we want to push to our website. What we do now is the following:
In the offline program we build an xml file with 1500 records and post it to a php file on our webserver. On the webserver we read the xml and push it to the mysql database, before we do that we first check if the record already exist and then we update the record or insert it as a new record.
When thats done, we give back a message to our offline program that the batch is completed . The offline program builds a new xml file with the next 1500 records. This process repeats till it reached the last 1500 records.
The problems is that the webserver become very slow while pushing the records to the database. Probably thats because we first check the records that already exist (that's one query) and then write it into the database (that's the second query). So for each batch we have to run 3000 queries.
I hope you guys have some tips to speed up this process.
Thanks in advance!
Before starting the import, read all the data ids you have, do not make checking queries on every item insert, but check it in existed php array.
Fix keys on your database tables.
Make all inserts on one request, or use Transactions.
there is no problems to import a lot of data such way, i had a lot of experience with it.
A good thing to do is write a single query composed of the concatenation of all of the insert statements separated by a semicolon:
INSERT INTO table_name
(a,b,c)
VALUES
(1,2,3)
ON DUPLICATE KEY
UPDATE a = 1, b = 2, c = 3;
INSERT INTO table_name
...
You could do concatenate 100-500 insert statements and wrap them in a transaction.
Wrapping many statements into a transaction can help by the fact that it doesn't immediately commit the data to disk after each row inserted, it keeps the whole 100-500 batch in memory and when they are all finished it writes them all to disk - which means less intermittent disk-IO.
You need to find a good batch size, I exemplified 100-500 but depending on your server configurations, on the amount of data per statement and on the number of inserts vs. updates you'll have to fine tune it.
Read some information about Mysql Unique Index Constraints. This should help:
Mysql Index Tutorial
I had the same problem 4 months ago and I got more performance coding in java rather than php and avoiding xml documents.
My tip: you can read the whole table (if you do it once is faster than make many queries 1 by 1) and keep this table in memory (in a HashMap for example). And before inserting a record, you can check if it exists in your structure localy (you do not bother the DB).
You can improve your performance this way.
Related
Please, if somebody can give me support.
My problem is:
I have a table with 8 fields and about 510 000 records. In a web form, the user select an Excel file and it's read it with SimpleXLSX. The file has about 340 000 lines. With PHP and SimpleXLSX library this file is loaded in memory, then with a for cicle the script read line by line, taken one data of ecah line and search this value in the table, if the data exists in the table, then does not insert the value, other wise, the values read it are stored in the table.
This process takes days to finish.
Can somebody suggest me some operation to speed up the process?
Thanks a lot.
if you have many users, and they maybe use the web at the same time:
you must change SimpleXLSX to js-xlsx, in webbrowser do all work but only write database in server
if you have few users (i think you in this case)
and search this value in the table
this is cost the must time, if your single-to-single compare memory and database, then add/not-add to database.
so you can read all database info in memory, (must use hash-list for compare),then compare all
and add it to memory and mark newable
at last
add memory info to database
because you database and xls have most same count, so...database become almost valueless
just forget database, this is most fast in memory
in memory use hash-list for compare
of course, you can let above run in database if you can use #Barmar's idea.. don't insert single, but batch
Focus on speed on throwing the data into the database. Do not try to do all the work during the INSERT. Then use SQL queries to further clean up the data.
Use the minimal XLS to get the XML into the database. Use some programming language if you need to massage the data a lot. Neither XLS nor SQL is the right place for complex string manipulations.
If practical, use LOAD DATA ... XML to get the data loaded; it is very fast.
SQL is excellent for handling entire tables at once; it is terrible at handling one row at a time. (Hence, my recommendation of putting the data into a staging table, not directly into the target table.)
If you want to discuss further, we need more details about the conversions involved.
I have a laravel application which must insert/update thousands of records per second in a for loop. my problem is that my Database insert/update rate is 100-150 writes per second . I have increased the amount of RAM dedicated to my database but got no luck.
is there any way to increase the write rate for mysql to thousands of records per second ?
please provide me optimum configurations for performance tuning
and PLEASE do not down mark the question . my code is correct . Its not a code problem because I have no problem with MONGODB . but I have to use mysql .
My Storage Engine is InnoDB
Inserting rows one at a time, and autocommitting each statement, has two overheads.
Each transaction has overhead, probably more than one insert. So inserting multiple rows in one transaction is the trick. This requires a code change, not a configuration change.
Each INSERT statement has overhead. One insert has about 90% over head and 10% actual insert.
The optimal is 100-1000 rows being inserted per transaction.
For rapid inserts:
Best is LOAD DATA -- if you are starting with a .csv file. If you must build the .csv file first, then it is debatable whether that overhead makes this approach lose.
Second best is multi-row INSERT statements: INSERT INTO t (a,b) VALUES (1,2), (2,3), (44,55), .... I recommend 1000 per statement, and COMMIT each statement. This is likely to get you past 1000 rows per second being inserted.
Another problem... Since each index is updated as the row is inserted, you may run into trouble with thrashing I/O to achieve this task. InnoDB automatically "delays" updates to non-unique secondary indexes (no need for INSERT DELAYED), but the work is eventually done. (So RAM size and innodb_buffer_pool_size come into play.)
If the "thousands" of rows/second is a one time task, then you can stop reading here. If you expect to do this continually 'forever', there are other issues to contend with. See High speed ingestion .
For insert, you might want to look into the INSERT DELAYED syntax. That will increase insert performance, but it won't help with update and the syntax will eventually be deprecated. This post offers an alternative for updates, but it involves custom replication.
One way my company's succeeded in speeding up inserts is by writing the SQL to a file, and then doing using a MySQL LOAD DATA INFILE command, but I believe we found that required the server's command line to have the mysql application installed.
I've also found that inserting and updating in a batch is often faster. So if you're calling INSERT 2k times, you might be better off running 10 inserts of 200 rows each. This would decrease the lock requirements and decrease information/number of calls sent over the wire.
I've a Cronjob script, written in PHP with following requirements:
Step 1 (DB server 1): Get some data from multiple tables (We have lot of data here)
Step 2 (Application server): Perform some calculation
Step 3 (DB Server 2): After calculation, insert that data in another database(MySQL)/table(InnoDB) for reporting purpose. This table contains 97 columns, actually different rates, which can not be normalized further. This is different physical DB server and have only one DB.
Script worked fine during development but on production, Step 1 returned approx 50 million records. Result, as obvious, script run for around 4 days and then failed. (Rough estimation, with current rate, it would have taken approx 171 days to finish)
Just for note, We were using prepared statements and Step 1 is getting data in bunch of 1000 records at a time.
What we did till now
Optimization Step 1: Multiple values in insert & drop all indexes
Some tests showed insert (Step 3 above) is taking maximum time (More then 95% time). To optimize, after some googling, we dropped all indexes from table, and instead of one insert query/row, we are not having one insert query/100 rows. This gave us a bit faster insert but still, as per rough estimate, it will take 90 days to run cron once, and we need to run it once every month as new data will be available every month.
Optimization step 2, instead of writing to DB, write to csv file and then import in mysql using linux command.
This step seems not working. Writing 30000 rows in CSV file took 16 minutes and we still need to import that CSV file in MySQL. We have single file handler for all write operations.
Current state
It seems I'm now clueless on what else can be done. Some key requirements:
Script need to insert approx 50,000,000 records (will increase with time)
There are 97 columns for each records, we can skip some but 85 columns at the minimum.
Based on input, we can break script into three different cron to run on three different server but insert had to be done on one DB server (master) so not sure if it will help.
However:
We are open to change database/storage engine (including NoSQL)
On production, we could have multiple database servers but insert had to be done on master only. All read operations can be directed to slave, which are minimal and occasional (Just to generate reports)
Question
I don't need any descriptive answer but can someone in short suggest what could be possible solution. I just need some optimization hint and I'll do remaining R&D.
We are open for everything, change database/storage engine, Server optimization/ multiple servers (Both DB and application), change programming language or whatever is best configuration for above requirements.
Final expectation, cron must finish in maximum 24 hours.
Edit in optimization step 2
To further understand why generating csv is taking time, I've created a replica of my code, with only necessary code. That code is present on git https://github.com/kapilsharma/xz
Output file of experiment is https://github.com/kapilsharma/xz/blob/master/csv/output500000_batch5000.txt
If you check above file, I'm inserting 500000 records and getting 5000 records form database at a time, making loop running 100 times. Time taken in first loop was 0.25982284545898 seconds but in 100th loop was 3.9140808582306. I assume its because of system resource and/or file size of csv file. In that case, it becomes more of programming question then DB optimization. Still, can someone suggest why it is taking more time in next loops?
If needed, whole code is committed except csv files and sql file generated to create dummy DB as these files are very big. However they can be easily generated with code.
Using OFFSET and LIMIT to walk through a table is O(N*N), that is much slower than you want or expected.
Instead, walk through the table "remembering where you left off". It is best to use the PRIMARY KEY for such. Since the id looks like an AUTO_INCREMENT without gaps, the code is simple. My blog discusses that (and more complex chunking techniques).
It won't be a full 100 (500K/5K) times as fast, but it will be noticeably faster.
This is a very broad question. I'd start by working out what the bottleneck is with the "insert" statement. Run the code, and use whatever your operating system gives you to see what the machine is doing.
If the bottleneck is CPU, you need to find the slowest part and speed it up. Unlikely, given your sample code, but possible.
If the bottleneck is I/O or memory, you're almost certainly going to need either better hardware, or a fundamental re-design.
The obvious way to re-design this is to find a way to handle only deltas in the 50M records. For instance, if you can write to an audit table whenever a record changes, your cron job can look at that audit table and pick out any data that was modified since the last batch run.
I had a mailer cron job on CakePHP, which failed merely on 600 rows fetch and send email to the registered users. It couldn't even perform the job in batch operations. We finally opted for mandrill and since then it all went well.
I'd suggest (considering it a bad idea to touch the legacy system in production) :
Schedule a mirco solution in golang or node.js considering
performance benchmarks, as database interaction is involved -
you'll be fine with any of these. Have this micro solution perform
the cron job. (Fetch + Calculate)
Reporting from NoSQL will be
challenging, so you should try out using available services like
Google Big Query. Have the cron job store data to google big
query and you should get a huge performance improvement even in
generating reports.
or
With each row inserted into your original db server 1, set up a messaging mechanism which performs the operations of cron job everytime an insert is made (sort of trigger) and store it into your reporting server. Possible services you can use are : Google PubSub or Pusher. I think per insert time consumption will be pretty less. (You can also use a async service setup which does the task of storing into the reporting database).
Hope this helps.
I have to update a big table (products) in a MySQL database, every 10 minutes with PHP. I have to run the PHP script with cron job, and I get the most up to date products from a CSV file. The table has currently ~18000 rows, and unfortunately I can not tell how much it will change in a 10 min period. The most important thing is of course I do not want the users to notice the update in the background.
These are my ideas and fears:
Idea1: I know that there is a way to load a csv file into a table with MySQL, so maybe I can use a transaction to truncate the table, and import the CSV. But even if I use transactions, as long as the table is large, I'm afraid that there will be a little chance for some users to see the empty database.
Idea2: I could compare the old and the new csv file with a library and only update/add/remove the changed rows. This way I think there it's not possible for a user to see an empty database, but I'm afraid this method will cost a lot of RAM and CPU, and I'm on a shared hosting.
So basically I would like to know which method is the most secure to update a table completely without the users noticing it.
Assuming InnoDB and default isolation level, you can start a transaction, delete all rows, insert your new rows, then commit. Before the commit completes, users will see the previous state.
While the transaction is open (after the deletes), updates will block, but SELECTs will not. Since it's a read only table for the user, it won't be an issue. They'll still be able to SELECT while the transaction is open.
You can learn the details by reading about MVCC. The gist of it is that any time someone performs a SELECT, MySQL uses the data in the database plus the rollback segment to fetch the previous state until the transaction is committed or rolled back.
From MySQL docs:
InnoDB uses the information in the rollback segment to perform the
undo operations needed in a transaction rollback. It also uses the
information to build earlier versions of a row for a consistent read.
Only after the commit completes will the users see the new data instead of the old data, and they won't see the new data until their current transaction is over.
I've got an application which needs to run a daily script; the daily script consists in downloading a CSV file with 1,000,000 rows, and inserting those rows into a table.
I host my application in Dreamhost. I created a while loop that goes through all the CSV's rows and performs an INSERT query for each one. The thing is that I get a "500 Internal Server Error". Even if I chop it out in 1000 files with 1000 rows each, I can't insert more than 40 or 50 thousand rows in the same loop.
Is there any way that I could optimize the input? I'm also considering going with a dedicated server; what do you think?
Thanks!
Pedro
Most databases have an optimized bulk insertion process - MySQL's is the LOAD DATA FILE syntax.
To load a CSV file, use:
LOAD DATA INFILE 'data.txt' INTO TABLE tbl_name
FIELDS TERMINATED BY ',' ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES;
Insert multiple values, instead of doing
insert into table values(1,2);
do
insert into table values (1,2),(2,3),(4,5);
Up to an appropriate number of rows at a time.
Or do bulk import, which is the most efficient way of loading data, see
http://dev.mysql.com/doc/refman/5.0/en/load-data.html
Normally I would say just use LOAD DATA INFILE, but it seems you can't with your shared hosting environment.
I haven't used MySQL in a few years, but they have a very good document which describes how to speed up insertions for bulk insertions:
http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html
A few ideas that can be gleaned from this:
Disable/enable keys around the insertions:
ALTER TABLE tbl_name DISABLE KEYS;
ALTER TABLE tbl_name ENABLE KEYS;
Use many values in your insert statements.
I.e.: INSERT INTO table (col1, col2) VALUES (val1, val2),(.., ..), ...
If I recall correctly, you can have up to 4096 values per insertion statement.
Run a FLUSH TABLES command before you even start, to ensure that there are no pending disk writes that may hurt your insertion performance.
I think this will make things fast. I would suggest using LOCK TABLES, but I think disabling the keys makes that moot.
UPDATE
I realized after reading this that by disabling your keys you may remove consistency checks that are important for your file loading. You can fix this by:
Ensuring that your table has no data that "collides" with the new data being loaded (if you're starting from scratch, a TRUNCATE statement will be useful here).
Writing a script to clean your input data to ensure no duplicates locally. Checking for duplicates is probably costing you a lot of database time anyway.
If you do this, ENABLE KEYS should not fail.
You can create cronjob script which adds x records to the database at one request.
Cronjob script will check if last import have not addded all needed rows he takes another x rows.
So you can add as many you need rows.
If you have your dedicated server it's more easier. You just run loop with all insert queries.
Of course you can try to set time_limit to 0 (if it's working on dreamhost) or make it bigger.
Your PHP script is most likely being terminated because it exceeded the script time limit. Since you're on a shared host, you're pretty much out of luck.
If you do switch to a dedicated server and if you get shell access, the best way would be to use the mysql command-line tool to insert the data.
OMG Ponies suggestion is great, but I've also 'manually' formatted data into the same format that mysqldump uses, then loaded it that way. Very fast.
Have you tried doing transactions? Just send the command BEGIN to MySQL, do all your inserts then do COMMIT. This would speed it up significantly,but like casablanca said, your script is probably timing out as well.
I've ran into this problem myself before and nos pretty much got it right on the head, but you'll need to do a bit more to get it to perform the best.
I found that in my situation that I couldn't MySQL to accept one large INSERT statement, but found that if I split it up into groups of about 10k INSERTS at a time like how nos suggested then it'll do it's job pretty quickly. One thing to note is that when doing multiple INSERTs like this that you will most likely hit PHP's timeout limit, but this can be avoided by resetting the timout with set_time_limit($seconds), I found that doing this after each successful INSERT worked really well.
You have to be careful about doing this, because you could find yourself in a loop on accident with an unlimited timout and for that I would suggest testing to make sure that each INSERT was successful by either checking for errors reported by MySQL with mysql_errno() or mysql_error(). You could also catch errors by checking the number of rows affected by the INSERT with mysql_affected_rows(). You could then stop after the first error happens.
It would be better if you use sqlloader.
You would need two things first control file that specifies the actions which SQL Loader should do and second csv file that you want to be loaded
Here is the below link that would help you out.
http://www.oracle-dba-online.com/sql_loader.htm
Go to phpmyadmin and select the table you would like to insert into.
Under the "operations" tab, and then the ' table options' option /section , change the storage engine from InnoDB to MyISAM.
I once had a similar challenge.
Have a good time.