I have a rapidly growing, write-heavy PHP/MySql application that inserts new rows at a rate of a dozen or so per second into an INNODB table of several million rows.
I started out using realtime INSERT statements and then moved to PHP's file_put_contents to write entries to a file and LOAD DATA INFILE to get the data into the database. Which is the better approach?
Are there any alternatives I should consider? How can I expect the two methods to handle collisions and increased load in the future?
Thanks!
Think of LOAD DATA INFILE as a batch-method of inserting data. It eliminates the overhead of firing up an insert query for every statement therefore is much faster. However, you lose some of the control when handling errors. It's much easier to handle an error on a single insert query vs one row in the middle of a file.
Depending on whether you can afford to have the data inserted by the PHP not being instantly available in the table, then INSERT DELAYED might be an option.
MySQL will accept the data to be inserted and will deal with the insertion later on, putting it into a queue. So this won't block your PHP application while MySQL ensures the data to be inserted later on.
As it says in the manual:
Another major benefit of using INSERT DELAYED is that inserts from many clients are bundled together and written in one block. This is much faster than performing many separate inserts.
I have used this for logging data where a data loss is not fatal but if you want to be protected from server crashes when data from INSERT DELAYED hadn't been inserted yet, you could look into replicating the changes away to a dedicated slave machine.
The way we deal with our inserts is to have them sent to a message queue system like ActiveMQ. From there we have a separate application that loads the inserts using LOAD DATA INFILE in batches of about 5000. Error handling can still take place with the infile however it processes the inserts much faster. If setting up a message queue is outside of the scope of your application there is no reason that file_put_contents would not be an acceptable option -- Especially if it's already implemented and is working fine.
Additionally you may want to test disabling indexes during writes to see if that improves performance.
It doesn't sound like you should be using innoDB. Regardless, a dozen inserts per second should not be problematic even for crappy hardware - unless, possibly, your data model is very complex, but for that, LOAD DATA INFILE is very good because, among other things, it rebuilds the indexes only once, as opposed to on every insert. So using files is a decent approach, but do make sure you open them in append only mode.
in the long run (1k+ of writes/s), look at other databases - particularly cassandra for write heavy applications.
if you do go the sql insert route, wrap the pdo execute statements in a transaction. doing so will greatly speed up the process.
LOAD DATA is disabled on some servers for security reasons:
http://dev.mysql.com/doc/mysql-security-excerpt/5.0/en/load-data-local.html
Also I don't enjoy writing my applications upside down to maintain database integrity.
Related
A CRM is hitting my server via web hooks at-least 1000 times and I cannot process all the requests at once.
So I am thinking about saving it (in Mysql or csv file) and then process 1 record at a time.
which method is faster if there are approx 100 000 records and I have to process one record at a time.
Different methods are available to perform such operation:
You can store data in MySQL and write a PHP script which will fetch request from MySQL database and process one by one. That script you can run automatically using crontab or scheduler after specific interval.
You can implement custom queues functionality using PHP + MySQL
It sounds like you need the following:
1) An incoming queue table where all the new rows get inserted without processing. An appropriately configured InnoDB table should be able to handle 1000 INSERTs/second unless you are running on a Raspberry Pi or something similarly underspecified. You should probably have this partitioned so that instead of deleting the records after processing, you can drop partitions instead (ALTER TABLE ... DROP PARTITION is much, much cheaper than a large DELETE operation).
2) A scheduled event that processes the data in the background, possibly in batches, and cleans up the original queue table.
As you definitely know, CSV won't let you create indexes for fast searching. Setting indexes to your table columns speeds up the searching to a really great great degree, and you cannot ignore this fact.
If you need all your data from a single table (for instance, app config), CSV is faster, otherwise not. Hence, For simple inserting and table-scan (non-index based) searches, CSV is faster. Also, Consider that updating or deleting from a CSV is nontrivial. If you use CSV, you need to be really careful to handle multiple threads / processes correctly, otherwise you'll get bad data or corrupt your file.
Mysql offers a lot of capabilities such as SQL queries, transactions, data manipulation or concurrent access here, yet CSV is certainly not for these things. Mysql, as mentioned by Simone Rossaini, is also safe. You may not overlook this fact as well.
SUMMARY
If you are going for simple inserting and table-scan (non-index based) searches, CSV is definitely faster. Yet, it has many shortcomings when you compare it with the countless capabilities of MySql.
This seems like a pretty basic question but one I don't know the answer to.
I wrote a script in PHP that loops through some data and then performs an UPDATE to records in our database. There are roughly some 150,000 records, so the script certainly takes a while to complete.
Could I potentially harm or interfere with the data insertion if I run a basic SELECT statement?
Say...I want to ensure that the script is working properly so if I run a basic SELECT COUNT() to see if it's increasing in real time as the script runs. Is this possible or would it screw something up?
Thank you!
Generally a SELECT call is incapable of "causing harm" provided you're not talking about SQL injection problems.
The InnoDB engine, which you should be using, has what's called Multi-Version Concurrency Control or MVCC for short. It means that until your UPDATE statement is finished, or the transaction that the statement is a part of, the SELECT will be done against the last consistent database state.
If you're using MyISAM, which is a very bad idea in most production environments due to the limitations of that engine and the way the data is stored without a rollback journal, the SELECT call will probably block until the UPDATE is applied since it does not support MVCC.
So I'm trying to import some sales data into my MySQL database. The data is originally in the form of a raw CSV file, which my PHP application needs to first process, then save the processed sales data to the database.
Initially I was doing individual INSERT queries, which I realized was incredibly inefficient (~6000 queries taking almost 2 minutes). I then generated a single large query and INSERTed the data all at once. That gave us a 3400% increase in efficiency, and reduced the query time to just over 3 seconds.
But as I understand it, LOAD DATA INFILE is supposed to be even quicker than any sort of INSERT query. So now I'm thinking about writing the processed data to a text file and using LOAD DATA INFILE to import it into the database. Is this the optimal way to insert large amounts of data to a database? Or am I going about this entirely the wrong way?
I know a few thousand rows of mostly numeric data isn't a lot in the grand scheme of things, but I'm trying to make this intranet application as quick/responsive as possible. And I also want to make sure that this process scales up in case we decide to license the program to other companies.
UPDATE:
So I did go ahead and test LOAD DATA INFILE out as suggested, thinking it might give me only marginal speed increases (since I was now writing the same data to disk twice), but I was surprised when it cut the query time from over 3300ms down to ~240ms. The page still takes about ~1500ms to execute total, but it's still noticeably better than before.
From here I guess I'll check to see if I have any superfluous indexes in the database, and, since all but two of my tables are InnoDB, I will look into optimizing the InnoDB buffer pool to optimize the overall performance.
LOAD DATA INFILE is very fast, and is the right way to import text files into MySQL. It is one of the recommended methods for speeding up the insertion of data -up to 20 times faster, according to this:
https://dev.mysql.com/doc/refman/8.0/en/insert-optimization.html
Assuming that writing the processed data back to a text file is faster than inserting it into the database, then this is a good way to go.
LOAD DATA or multiple inserts are going to be much better than single inserts; LOAD DATA saves you a tiny little bit you probably don't care about that much.
In any case, do quite a lot but not too much in one transaction - 10,000 rows per transaction generally feels about right (NB: this is not relevant to non-transactional engines). If your transactions are too small then it will spend all its time syncing the log to disc.
Most of the time doing a big insert is going to come from building indexes, which is an expensive and memory-intensive operation.
If you need performance,
Have as few indexes as possible
Make sure the table and all its indexes fit in your innodb buffer pool (Assuming innodb here)
Just add more ram until your table fits in memory, unless that becomes prohibitively expensive (64G is not too expensive nowadays)
If you must use MyISAM, there are a few dirty tricks there to make it better which I won't discuss further.
Guys, i had the same question, my needs might have been a little more specific than general, but i have written a post about my findings here.
http://www.mediabandit.co.uk/blog/215_mysql-bulk-insert-vs-load-data
For my needs load data was fast, but the need to save to a flat file on the fly meant the average load times took longer than a bulk insert. Moreover i wasnt required to do more than say 200 queries, where before i was doing this one at a time, i'm now bulking them up, the time savings are in the region of seconds.
Anyway, hopefully this will help you?
You should be fine with your approach. I'm not sure how much faster LOAD DATA INFILE is compared to bulk INSERT, but I've heard the same thing, that it's supposed to be faster.
Of course, you'll want to do some benchmarks to be sure, but I'd say it's worth writing some test code.
I've got an application which needs to run a daily script; the daily script consists in downloading a CSV file with 1,000,000 rows, and inserting those rows into a table.
I host my application in Dreamhost. I created a while loop that goes through all the CSV's rows and performs an INSERT query for each one. The thing is that I get a "500 Internal Server Error". Even if I chop it out in 1000 files with 1000 rows each, I can't insert more than 40 or 50 thousand rows in the same loop.
Is there any way that I could optimize the input? I'm also considering going with a dedicated server; what do you think?
Thanks!
Pedro
Most databases have an optimized bulk insertion process - MySQL's is the LOAD DATA FILE syntax.
To load a CSV file, use:
LOAD DATA INFILE 'data.txt' INTO TABLE tbl_name
FIELDS TERMINATED BY ',' ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES;
Insert multiple values, instead of doing
insert into table values(1,2);
do
insert into table values (1,2),(2,3),(4,5);
Up to an appropriate number of rows at a time.
Or do bulk import, which is the most efficient way of loading data, see
http://dev.mysql.com/doc/refman/5.0/en/load-data.html
Normally I would say just use LOAD DATA INFILE, but it seems you can't with your shared hosting environment.
I haven't used MySQL in a few years, but they have a very good document which describes how to speed up insertions for bulk insertions:
http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html
A few ideas that can be gleaned from this:
Disable/enable keys around the insertions:
ALTER TABLE tbl_name DISABLE KEYS;
ALTER TABLE tbl_name ENABLE KEYS;
Use many values in your insert statements.
I.e.: INSERT INTO table (col1, col2) VALUES (val1, val2),(.., ..), ...
If I recall correctly, you can have up to 4096 values per insertion statement.
Run a FLUSH TABLES command before you even start, to ensure that there are no pending disk writes that may hurt your insertion performance.
I think this will make things fast. I would suggest using LOCK TABLES, but I think disabling the keys makes that moot.
UPDATE
I realized after reading this that by disabling your keys you may remove consistency checks that are important for your file loading. You can fix this by:
Ensuring that your table has no data that "collides" with the new data being loaded (if you're starting from scratch, a TRUNCATE statement will be useful here).
Writing a script to clean your input data to ensure no duplicates locally. Checking for duplicates is probably costing you a lot of database time anyway.
If you do this, ENABLE KEYS should not fail.
You can create cronjob script which adds x records to the database at one request.
Cronjob script will check if last import have not addded all needed rows he takes another x rows.
So you can add as many you need rows.
If you have your dedicated server it's more easier. You just run loop with all insert queries.
Of course you can try to set time_limit to 0 (if it's working on dreamhost) or make it bigger.
Your PHP script is most likely being terminated because it exceeded the script time limit. Since you're on a shared host, you're pretty much out of luck.
If you do switch to a dedicated server and if you get shell access, the best way would be to use the mysql command-line tool to insert the data.
OMG Ponies suggestion is great, but I've also 'manually' formatted data into the same format that mysqldump uses, then loaded it that way. Very fast.
Have you tried doing transactions? Just send the command BEGIN to MySQL, do all your inserts then do COMMIT. This would speed it up significantly,but like casablanca said, your script is probably timing out as well.
I've ran into this problem myself before and nos pretty much got it right on the head, but you'll need to do a bit more to get it to perform the best.
I found that in my situation that I couldn't MySQL to accept one large INSERT statement, but found that if I split it up into groups of about 10k INSERTS at a time like how nos suggested then it'll do it's job pretty quickly. One thing to note is that when doing multiple INSERTs like this that you will most likely hit PHP's timeout limit, but this can be avoided by resetting the timout with set_time_limit($seconds), I found that doing this after each successful INSERT worked really well.
You have to be careful about doing this, because you could find yourself in a loop on accident with an unlimited timout and for that I would suggest testing to make sure that each INSERT was successful by either checking for errors reported by MySQL with mysql_errno() or mysql_error(). You could also catch errors by checking the number of rows affected by the INSERT with mysql_affected_rows(). You could then stop after the first error happens.
It would be better if you use sqlloader.
You would need two things first control file that specifies the actions which SQL Loader should do and second csv file that you want to be loaded
Here is the below link that would help you out.
http://www.oracle-dba-online.com/sql_loader.htm
Go to phpmyadmin and select the table you would like to insert into.
Under the "operations" tab, and then the ' table options' option /section , change the storage engine from InnoDB to MyISAM.
I once had a similar challenge.
Have a good time.
I have a particular PHP page that, for various reasons, needs to save ~200 fields to a database. These are 200 separate insert and/or update statements. Now the obvious thing to do is reduce this number but, like I said, for reasons I won't bother going into I can't do this.
I wasn't expecting this problem. Selects seem reasonably performant in MySQL but inserts/updates aren't (it takes about 15-20 seconds to do this update, which is naturally unacceptable). I've written Java/Oracle systems that can happily do thousands of inserts/updates in the same time (in both cases running local databases; MySQL 5 vs OracleXE).
Now in something like Java or .Net I could quite easily do one of the following:
Write the data to an in-memory
write-behind cache (ie it would
know how to persist to the database
and could do so asynchronously);
Write the data to an in-memory cache
and use the PaaS (Persistence as a
Service) model ie a listener to the
cache would persist the fields; or
Simply start a background process
that could persist the data.
The minimal solution is to have a cache that I can simply update, which will separately go and upate the database in its own time (ie it'll return immediately after update the in-memory cache). This can either be a global cache or a session cache (although a global shared cache does appeal in other ways).
Any other solutions to this kind of problem?
mysql_query('INSERT INTO tableName VALUES(...),(...),(...),(...)')
Above given query statement is better. But we have another solution to improve the performance of insert statement.
Follow the following steps..
1. You just create a csv(comma separated delimited file)or simple txt file and write all the data that you want to insert using file writing mechanism (like FileOutputStream class in Java).
2. use this command
LOAD DATA INFILE 'data.txt' INTO TABLE table2
FIELDS TERMINATED BY '\t';
3 if you are not clear about this command then follow the link
You should be able to do 200 inserts relatively quickly, but it will depend on lots of factors. If you are using a transactional engine and doing each one in its own transaction, don't - that creates way too much I/O.
If you are using a non-transactional engine, it's a bit trickier. Using a single multi-row insert is likely to be better as the flushing policy of MySQL means that it won't need to flush its changes after each row.
You really want to be able to reproduce this on your production-spec development box and analyse exactly why it's happening. It should not be difficult to stop.
Of course, another possibility is that your inserts are slow because of extreme sized tables or large numbers of indexes - in which case you should scale your database server appropriately. Inserting lots of rows into a table whose indexes don't fit into RAM (or doesn't have RAM correctly configured to be used for caching those indexes) generally gets pretty smelly.
BUT don't try to look for a way of complicating your application when there is a way of easily turning it instead, keeping the current algorithm.
One more solution that you could use (instead of tuning mysql :) ) is to use some JMS server and STOMP connection driver for PHP for write data to database server in a asynchronous manner. ActiveMQ have built-in support for STOMP protocol. And there is StompConnect project which is STOMP proxy for any JMS compilant server (OpenMQ, JBossMQ etc).
You can update your local cache (hopefully memcached) and then push the write requests through beanstalkd.
I would suspect a problem with your SQL inserts - it really shouldn't take that long. Would prepared queries help? Does your mysql server need some more memory dedicated to the keyspace? I think some more questions need asked.
How are you doing the inserts, are you doing one insert per record
mysql_query('INSERT INTO tableName VALUES(...)');
mysql_query('INSERT INTO tableName VALUES(...)');
mysql_query('INSERT INTO tableName VALUES(...)');
mysql_query('INSERT INTO tableName VALUES(...)');
mysql_query('INSERT INTO tableName VALUES(...)');
or are you using a single query
mysql_query('INSERT INTO tableName VALUES(...),(...),(...),(...)');
The later of the two options is substantially faster, and from experience the first option will cause it to take much longer as PHP must wait for the first query to finish before moving to the second and so on.
Look at the statistics for your database while you do the inserts. I'm guessing that one of your updates locks the table and therefor all your statements are queued up and you experience this delay. Another thing to look into is your index creation/updating because the more indices you have on a table, the slower all UPDATE and INSERT statements get.
Another thing is that I think you use MYISAM (table engine) which locks the entire table on UPDATE.I suggest you use INNODB instead. INNODB is slower on SELECT-queries, but faster on INSERT and UPDATE because it only locks the row it's working on and not the entire table.
consider this:
mysql_query('start transaction');
mysql_query('INSERT INTO tableName VALUES(...)');
mysql_query('INSERT INTO tableName VALUES(...)');
mysql_query('INSERT INTO tableName VALUES(...)');
mysql_query('INSERT INTO tableName VALUES(...)');
mysql_query('INSERT INTO tableName VALUES(...)');
mysql_query('commit;')
Note that if your table is INSERT-ONLY (no deletes, and no updates on variable-length columns), then inserts will not lock or block reads when using MyISAM.
This may or may not improve insert performance, but it could help if you are having concurrent insert/read issues.
I'm using this, and only purging old records daily, followed by 'optimize table'.
you can use CURL with PHP to do Asynchronous database manipulations.
One possible solution is fork each query into a separate thread but, PHP doesnot support threads. We can use PCNTL functions but it’s a bit tricky for me to use them. I prefer to use this another solution to create fork and perform asynchronous operations.
Refer this
http://gonzalo123.wordpress.com/2010/10/11/speed-up-php-scripts-with-asynchronous-database-queries/