I have a simple importer, it goes through each line of a rather big csv and imports it to the database.
My question is: Should I call another method to insert each object (generating a DO and telling it's mapper to insert) or should I hardcode the insert process in the import method, duplicating the code?
I know the elegant thing to do is to call the second method, but I keep hearing in my head that function calls are expensive.
What do you think?
Many RDBMS brands support a special command to do bulk imports. For example:
MySQL: LOAD DATA INFILE
PostgreSQL: COPY
Microsoft SQL Server: BULK INSERT
Oracle: SQL*Loader
Using these commands is preferred over inserting one row at a time from a CSV data source because the bulk-loading command usually runs at least an order of magnitude faster.
I don't think this matters too much. Consider a bulk insert. At least make sure you're using a transaction, and consider to disable indices before inserting.
It shouldn't matter, as the insertion will take probably orders of magnitude longer than the php code.
As others have stated, bulk insert will give you much more benefit.
Those line-level optimizations will only make you blind for the good higher level optimizations.
If you are unsure, do a simple timing with both ways, it shouldn't take longer than a couple of minutes to find out.
Consider combining both approaches to make batch inserts, if all-at-once hits some memory/time/.... limits.
Related
I have a rapidly growing, write-heavy PHP/MySql application that inserts new rows at a rate of a dozen or so per second into an INNODB table of several million rows.
I started out using realtime INSERT statements and then moved to PHP's file_put_contents to write entries to a file and LOAD DATA INFILE to get the data into the database. Which is the better approach?
Are there any alternatives I should consider? How can I expect the two methods to handle collisions and increased load in the future?
Thanks!
Think of LOAD DATA INFILE as a batch-method of inserting data. It eliminates the overhead of firing up an insert query for every statement therefore is much faster. However, you lose some of the control when handling errors. It's much easier to handle an error on a single insert query vs one row in the middle of a file.
Depending on whether you can afford to have the data inserted by the PHP not being instantly available in the table, then INSERT DELAYED might be an option.
MySQL will accept the data to be inserted and will deal with the insertion later on, putting it into a queue. So this won't block your PHP application while MySQL ensures the data to be inserted later on.
As it says in the manual:
Another major benefit of using INSERT DELAYED is that inserts from many clients are bundled together and written in one block. This is much faster than performing many separate inserts.
I have used this for logging data where a data loss is not fatal but if you want to be protected from server crashes when data from INSERT DELAYED hadn't been inserted yet, you could look into replicating the changes away to a dedicated slave machine.
The way we deal with our inserts is to have them sent to a message queue system like ActiveMQ. From there we have a separate application that loads the inserts using LOAD DATA INFILE in batches of about 5000. Error handling can still take place with the infile however it processes the inserts much faster. If setting up a message queue is outside of the scope of your application there is no reason that file_put_contents would not be an acceptable option -- Especially if it's already implemented and is working fine.
Additionally you may want to test disabling indexes during writes to see if that improves performance.
It doesn't sound like you should be using innoDB. Regardless, a dozen inserts per second should not be problematic even for crappy hardware - unless, possibly, your data model is very complex, but for that, LOAD DATA INFILE is very good because, among other things, it rebuilds the indexes only once, as opposed to on every insert. So using files is a decent approach, but do make sure you open them in append only mode.
in the long run (1k+ of writes/s), look at other databases - particularly cassandra for write heavy applications.
if you do go the sql insert route, wrap the pdo execute statements in a transaction. doing so will greatly speed up the process.
LOAD DATA is disabled on some servers for security reasons:
http://dev.mysql.com/doc/mysql-security-excerpt/5.0/en/load-data-local.html
Also I don't enjoy writing my applications upside down to maintain database integrity.
So I'm trying to import some sales data into my MySQL database. The data is originally in the form of a raw CSV file, which my PHP application needs to first process, then save the processed sales data to the database.
Initially I was doing individual INSERT queries, which I realized was incredibly inefficient (~6000 queries taking almost 2 minutes). I then generated a single large query and INSERTed the data all at once. That gave us a 3400% increase in efficiency, and reduced the query time to just over 3 seconds.
But as I understand it, LOAD DATA INFILE is supposed to be even quicker than any sort of INSERT query. So now I'm thinking about writing the processed data to a text file and using LOAD DATA INFILE to import it into the database. Is this the optimal way to insert large amounts of data to a database? Or am I going about this entirely the wrong way?
I know a few thousand rows of mostly numeric data isn't a lot in the grand scheme of things, but I'm trying to make this intranet application as quick/responsive as possible. And I also want to make sure that this process scales up in case we decide to license the program to other companies.
UPDATE:
So I did go ahead and test LOAD DATA INFILE out as suggested, thinking it might give me only marginal speed increases (since I was now writing the same data to disk twice), but I was surprised when it cut the query time from over 3300ms down to ~240ms. The page still takes about ~1500ms to execute total, but it's still noticeably better than before.
From here I guess I'll check to see if I have any superfluous indexes in the database, and, since all but two of my tables are InnoDB, I will look into optimizing the InnoDB buffer pool to optimize the overall performance.
LOAD DATA INFILE is very fast, and is the right way to import text files into MySQL. It is one of the recommended methods for speeding up the insertion of data -up to 20 times faster, according to this:
https://dev.mysql.com/doc/refman/8.0/en/insert-optimization.html
Assuming that writing the processed data back to a text file is faster than inserting it into the database, then this is a good way to go.
LOAD DATA or multiple inserts are going to be much better than single inserts; LOAD DATA saves you a tiny little bit you probably don't care about that much.
In any case, do quite a lot but not too much in one transaction - 10,000 rows per transaction generally feels about right (NB: this is not relevant to non-transactional engines). If your transactions are too small then it will spend all its time syncing the log to disc.
Most of the time doing a big insert is going to come from building indexes, which is an expensive and memory-intensive operation.
If you need performance,
Have as few indexes as possible
Make sure the table and all its indexes fit in your innodb buffer pool (Assuming innodb here)
Just add more ram until your table fits in memory, unless that becomes prohibitively expensive (64G is not too expensive nowadays)
If you must use MyISAM, there are a few dirty tricks there to make it better which I won't discuss further.
Guys, i had the same question, my needs might have been a little more specific than general, but i have written a post about my findings here.
http://www.mediabandit.co.uk/blog/215_mysql-bulk-insert-vs-load-data
For my needs load data was fast, but the need to save to a flat file on the fly meant the average load times took longer than a bulk insert. Moreover i wasnt required to do more than say 200 queries, where before i was doing this one at a time, i'm now bulking them up, the time savings are in the region of seconds.
Anyway, hopefully this will help you?
You should be fine with your approach. I'm not sure how much faster LOAD DATA INFILE is compared to bulk INSERT, but I've heard the same thing, that it's supposed to be faster.
Of course, you'll want to do some benchmarks to be sure, but I'd say it's worth writing some test code.
Lets assume the same environments for PHP5 working with MySQL5 and CSV files. MySQL is on the same host as hosted scripts.
Will MySQL always be faster than retriving/searching/changing/adding/deleting records to CSV?
Or is there some amount of data below which PHP+CSV performance is better than using database server?
CSV won't let you create indexes for fast searching.
If you always need all data from a single table (like for application settings), CSV is faster, otherwise not.
I don't even consider SQL queries, transactions, data manipulation or concurrent access here, as CSV is certainly not for these things.
No, MySQL will probably be slower for inserting (appending to a CSV is very fast) and table-scan (non-index based) searches.
Updating or deleting from a CSV is nontrivial - I leave that as an exercise for the reader.
If you use a CSV, you need to be really careful to handle multiple threads / processes correctly, otherwise you'll get bad data or corrupt your file.
However, there are other advantages too. Care to work out how you do ALTER TABLE on a CSV?
Using a CSV is a very bad idea if you ever need UPDATEs, DELETEs, ALTER TABLE or to access the file from more than one process at once.
As a person coming from the data industry, I've dealt with exactly this situation.
Generally speaking, MySQL will be faster.
However, you don't state the type of application that you are developing. Are you developing a data warehouse application that is mainly used for searching and retrieval of records? How many fields are typically present in your records? How many records are typically present in your data files? Do these files have any relational properties to each other, i.e. do you have a file of customers and a file of customer orders? How much time do you have to develop a system?
The answer will depend on the answer to the questions listed previously. However, you can generally use the following as a guidelines:
If you are building a data warehouse application with records exceeding one million, you may want to consider ditching both and moving to a Column Oriented Database.
CSV will probably be faster for smaller data sets. However, rolling your own insert routines in CSV could be painful and you lose the advantages of database indexing.
My general recommendation would be to just use MySql, as I said previously, in most cases it will be faster.
From a pure performance standpoint, it completely depends on the operation you're doing, as #MarkR says. Appending to a flat file is very fast. As is reading in the entire file (for a non-indexed search or other purposes).
The only way to know for sure what will work better for your use cases on your platform is to do actual profiling. I can guarantee you that doing a full table scan on a million row database will be slower than grep on a million line CSV file. But that's probably not a realistic example of your usage. The "breakpoints" will vary wildly depending on your particular mix of retrieve, indexed search, non-indexed search, update, append.
To me, this isn't a performance issue. Your data sounds record-oriented, and MySQL is vastly superior (in general terms) for dealing with that kind of data. If your use cases are even a little bit complicated by the time your data gets large, dealing with a 100k line CSV file is going to be horrific compared to a 100k record db table, even if the performance is marginally better (which is by no means guaranteed).
Depends on the use. For example for configuration or language files CSV might do better.
Anyway, if you're using PHP5, you have 3rd option -- SQLite, which comes embedded in PHP. It gives you ease of use like regular files, but robustness of RDBMS.
Databases are for storing and retrieving data. If you need anything more than plain line/entry addition or bulk listing, why not go for the database way? Otherwise you'd basically have to code the functionality (incl. deletion, sorting etc) yourself.
CSV is an incredibly brittle format and requires your app to do all the formatting and calcuations. If you need to update a spesific record in a csv you will have to first read the entire csv file, find the entry in memory would need to change, then write the whole file out again. This gets very slow very quickly. CSV is only useful for write once, readd once type apps.
If you want to import swiftly like a thief in the night, use SQL format.
If you are working in production server, CSV is slow but it is the safest.
Just make sure the CSV file doesn't have a Primary Key which will override your existing data.
I have a PHP script that calls an API method that can easily return 6k+ results.
I use PEAR DB_DataObject to write each row in a foreach loop to the DB.
The above script is batch processing 20 users at a time - and although some will only have a few results from the API others will have more. Worst case is that all have 1000's of results.
The loop to call the API seems to be ok, batches of 20 every 5 minutes works fine. My only concern is 1000's of mysql INSERTs for each user (with a long pause between each user for fresh API calls)
Is there a good way to do this? Or am I doing it a good way?!
Well, the fastest way to do it would be to do one insert statement with lots of values, like this:
INSERT INTO mytable (col1, col2) VALUES ( (?,?), (?,?), (?,?), ...)
But that would probably require ditching the DB_DataObject method you are using now. You'll just have to weigh the performance benefits of doing it that way vs. the "ease of use" benefits of using DB_DataObject.
Like Kalium said, check where the bottleneck is.
If it is really the database, you could try the bulk import feature some DBMS offer.
In DB2, for example, it is called LOAD.
It works without SQL, but reads directly from a named pipe.
It is especially designed to be fast when you need to bring a large number of new rows
into the database.
It can be configured to skip checks and index building, making it even faster.
Well, is your method producing more load than you can handle? If it's working, then I don't see any reason to change it offhand.
Database abstraction layers usually add a pretty decent amount of overhead. I've found that, in PHP atleast, it's much easier to use a plain mysql_query for the sake of speed than it is to optimize your library of choice.
Like Eric P and weinzierl.name have said, using a multi-row insert or LOAD will give you the best direct performance.
I have a few ideas, but you will have to verify them with testing.
If the table you are inserting to has indexes, try to make sure they are optimized for inserts.
Check out optimization options here:
http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html
Consider mysqli directly, or Pear::MDB2 or PDO. I understand that Pear::DB is fairly slow, though I don't use PEAR myself, so can't verify.
MySQL LOAD DATA INFILE feature is probably the fastest way to do what you want.
You can take a look at the chapter Speed of INSERT statements on MySQL Documentation.
It talks about a lot of way to improve INSERTING in MySQL.
I don't think a few thousand records should put any strain on your database; even my laptop should handle it nicely. Your biggest concern might be(come) gigantic tables if you don't do any cleanup or partitioning. Avoid premature optimization on that part.
As for your method, make sure you do each user (or batch) in a separate transaction. If mysql, make sure you're using innodb to avoid unnecessary locking. If you're already using innodb/postgres/other database that supports transactions you might see a significant performance increase.
Consider using COPY (at least on postgres - unsure about mysql).
Make sure your table is properly indexed (including removing unused ones). Indexes hurt insert speed.
Remember to optimize/vacuum regularly.
Is it possible to do a simple count(*) query in a PHP script while another PHP script is doing insert...select... query?
The situation is that I need to create a table with ~1M or more rows from another table, and while inserting, I do not want the user feel the page is freezing, so I am trying to keep update the counting, but by using a select count(\*) from table when background in inserting, I got only 0 until the insert is completed.
So is there any way to ask MySQL returns partial result first? Or is there a fast way to do a series of insert with data fetched from a previous select query while having about the same performance as insert...select... query?
The environment is php4.3 and MySQL4.1.
Without reducing performance? Not likely. With a little performance loss, maybe...
But why are you regularily creating tables and inserting millions of row? If you do this only very seldom, can't you just warn the admin (presumably the only one allowed to do such a thing) that this takes a long time. If you're doing this all the time, are you really sure you're not doing it wrong?
I agree with Stein's comment that this is a red flag if you're copying 1 million rows at a time during a PHP request.
I believe that in a majority of cases where people are trying to micro-optimize SQL, they could get much greater performance and throughput by approaching the problem in a different way. SQL shouldn't be your bottleneck.
If you're doing a single INSERT...SELECT, then no, you won't be able to get intermediate results. In fact this would be a Bad Thing, as users should never see a database in an intermediate state showing only a partial result of a statement or transaction. For more information, read up on ACID compliance.
That said, the MyISAM engine may play fast and loose with this. I'm pretty sure I've seen MyISAM commit some but not all of the rows from an INSERT...SELECT when I've aborted it part of the way through. You haven't said which engine your table is using, though.
The other users can't see the insertion until it's committed. That's normally a good thing, since it makes sure they can't see half-done data. However, if you want them to see intermediate data, you could throw in an occassional call to "commit" while you're inserting.
By the way - don't let anybody tell you to turn autocommit on. That a HUGE time waster. I have a "delete and re-insert" job on my database that takes 1/3rd as long when I turn off autocommit.
Just to be clear, MySQL 4 isn't configured by default to use transactions. It uses the MyISAM table type which locks the entire table for each insert, if I remember correctly.
Your best bet would be to use one of the MySQL bulk insertion functions, such as LOAD DATA INFILE, as these are dramatically faster at inserting large amounts of data. As for the counting, well, you could break the inserts into N groups of 1000 (or Y) then divide your progress meter into N sections and just update it on each group's request.
Edit: Another thing to consider is, if this is static data for a template, then you could use a "select into" to create a new table with the same data. Not sure what your application is, or the intended functionality, but that could work as well.
If you can get to the console, you can ask various status questions that will give you the information you are looking for. There's a command that goes something like "SHOW processlist".