Here is the plan. I have a large CSV file extract from a DB with 10 000 entry. Those entry look like :
firstname
lastname
tel
fax
mobile
address
jav-2012-selltotal
fev-2012-selltotal
etc.. etc..
So, i have read about getting those CSV data to MySQL database, and query those database to know who have sold more in feb 2012, or what is the selling total of john.. or whatever i ask to...
But for optimization purpose, caching, optimizing and indexing query is a must... witch lead me to this question. Since i know the 2-3 query i will do ALL THE TIME to the DB... is it faster to take the CSV file, to make the request in PHP and write a result file on disk, so all my call will be readfile-load-it, display-it ?
another wording of the question...is making query to DB is faster or slower that reading file to disk ? because if the DB have 10 000 record, and the result of the selling of paul is 100 line, the file will ONLY contain 100 lines, it will be small... the query will always take about the same time
PLease help, i dont what to code it myself just to discover thing that is evident to you...
thanks in advance
If you stick to database normalization rules and have everything in database, you are just fine. 10k records is not really much and you should not have to worry about performance.
Database queries are faster because the data gets (partially) cached in the memory rather than on plain disc unless fully read into RAM.
A handful plain text files might be faster at first sight, but when you have 100k files and 100k datasets in the DB, database is so much better,.. you don't have unlimited (parallel) inode access and are slowing and killing your harddrive/ssd. The more files you have, the slower everything gets.
You'd also have to manually code a locking queue for read/write actions which is already integrated into MySQL (row- and table locking).
Consider in a few months you want to extend everything,... how would you implement JOINS in text files? All the aggregation functionality MySQL already has built in (GROUP BY, ORDER BY,...).
MySQL has a profiler (use EXPLAIN before each statement) and can optimize even bigger datasets so much.
When I went to school I said to my teacher: 'Plain files are much
faster than your MySQL'. I made a site with a directory for each user
and stored attributes in a textfile each inside that user folder, just
like: /menardmam/username.txt, /menardmam/password.txt,
/DanFromgermany/username.txt, .... I tried to benchmark this and ya
text file was faster, but only because it was just 1000 text files.
When it comes to real business, 1000000000 datasets, combined and
cross joined, there is no way to do this with text files and it is much better when applying for a job to present your work with MySQL than what you done with text files.
Related
I'm currently working on the project which is like e-commerce site. There are hundreds of thousands of records in the database tables. I also have to use join operations on them to get data as there is query builder in project to select criteria of data. It takes too much time to fetch data. So, I'm using limit as some no of records(e.g. 10) per page. Now I come to know the concept of memcached. So I thought to use memcached for my project as it will take too much time for only once. But still there are some doubts.
Will too many cache file affect? I mean there will be too many files will be created as for each page of each module, there will be one cache file. So digit will go approx 10000 cache file.
Let's assume that there is no any problem of no of files. But what about to update files using replace() when any row of table is being added or deleted from middle of the table. And here, table is being updated near about every week.
So I'm in dilemma that should I go for memcached or not? If any one can advice and answer with explanation, then it will be appreciated.
If your website executes many of the same MySQL queries that frequently return the same data, then yes, there is probably some benefit to running memcached.
Problem:
"There are hundreds of thousands of records...It takes too much time to fetch the data".
This probably indicates a problem with your schema. Properly indexed, even when using JOINs, the queries should be able to execute quickly (< 0.1 seconds). Run an EXPLAIN query on the queries that are taking a long time to run and see if they can be improved.
Answer to Question 1
There won't be an issue with too many cache files. Memcached stores all cached information in memory (hence the name), so no disk files are used. Cached objects are stored in RAM and accessed directly from RAM.
Answer to Question 2
Not exactly sure what you are asking here, but if your application updates or deletes information from the database, then it is critical that the cache items affected by the updates and deletes are deleted. If the application doesn't remove cached items affected by such operations, than the next time the data is queried, cached results which are no longer valid may be returned. Make sure any data cached either has appropriate expiration times set, or the application removes them from cache when the data in the database changes.
Hope that helps.
I would start not from Memcached but from figuring out what the bottleneck is. Your tables have roughly one millions rows. I don't know the size of a row but my educated guess is that it is less than 1K based on the fact that a browser window accommodates information from one record.
So it is probably 1G of information in your database. Correct me if I'm wrong. If that's true then the whole database should be automatically cached in RAM by MySQL.
Now that your database is totally in RAM then with proper organization of indexes complexity of a query should be linear with respect to the number of the result set which measured in a number of kilobytes because it fits the browser window.
So my advice is to determine the size of the database and to see the result of "top" command in order to know how much memory is consumed by MySQL. And if you make sure that your database sits totally in memory then run the explain command against your most popular queries and add some indexes to your database according to the result of the explain. Even if your database is bigger than the amount of RAM then I still recommend you to look into the results of the explain command cause it really helps a lot.
I've a Cronjob script, written in PHP with following requirements:
Step 1 (DB server 1): Get some data from multiple tables (We have lot of data here)
Step 2 (Application server): Perform some calculation
Step 3 (DB Server 2): After calculation, insert that data in another database(MySQL)/table(InnoDB) for reporting purpose. This table contains 97 columns, actually different rates, which can not be normalized further. This is different physical DB server and have only one DB.
Script worked fine during development but on production, Step 1 returned approx 50 million records. Result, as obvious, script run for around 4 days and then failed. (Rough estimation, with current rate, it would have taken approx 171 days to finish)
Just for note, We were using prepared statements and Step 1 is getting data in bunch of 1000 records at a time.
What we did till now
Optimization Step 1: Multiple values in insert & drop all indexes
Some tests showed insert (Step 3 above) is taking maximum time (More then 95% time). To optimize, after some googling, we dropped all indexes from table, and instead of one insert query/row, we are not having one insert query/100 rows. This gave us a bit faster insert but still, as per rough estimate, it will take 90 days to run cron once, and we need to run it once every month as new data will be available every month.
Optimization step 2, instead of writing to DB, write to csv file and then import in mysql using linux command.
This step seems not working. Writing 30000 rows in CSV file took 16 minutes and we still need to import that CSV file in MySQL. We have single file handler for all write operations.
Current state
It seems I'm now clueless on what else can be done. Some key requirements:
Script need to insert approx 50,000,000 records (will increase with time)
There are 97 columns for each records, we can skip some but 85 columns at the minimum.
Based on input, we can break script into three different cron to run on three different server but insert had to be done on one DB server (master) so not sure if it will help.
However:
We are open to change database/storage engine (including NoSQL)
On production, we could have multiple database servers but insert had to be done on master only. All read operations can be directed to slave, which are minimal and occasional (Just to generate reports)
Question
I don't need any descriptive answer but can someone in short suggest what could be possible solution. I just need some optimization hint and I'll do remaining R&D.
We are open for everything, change database/storage engine, Server optimization/ multiple servers (Both DB and application), change programming language or whatever is best configuration for above requirements.
Final expectation, cron must finish in maximum 24 hours.
Edit in optimization step 2
To further understand why generating csv is taking time, I've created a replica of my code, with only necessary code. That code is present on git https://github.com/kapilsharma/xz
Output file of experiment is https://github.com/kapilsharma/xz/blob/master/csv/output500000_batch5000.txt
If you check above file, I'm inserting 500000 records and getting 5000 records form database at a time, making loop running 100 times. Time taken in first loop was 0.25982284545898 seconds but in 100th loop was 3.9140808582306. I assume its because of system resource and/or file size of csv file. In that case, it becomes more of programming question then DB optimization. Still, can someone suggest why it is taking more time in next loops?
If needed, whole code is committed except csv files and sql file generated to create dummy DB as these files are very big. However they can be easily generated with code.
Using OFFSET and LIMIT to walk through a table is O(N*N), that is much slower than you want or expected.
Instead, walk through the table "remembering where you left off". It is best to use the PRIMARY KEY for such. Since the id looks like an AUTO_INCREMENT without gaps, the code is simple. My blog discusses that (and more complex chunking techniques).
It won't be a full 100 (500K/5K) times as fast, but it will be noticeably faster.
This is a very broad question. I'd start by working out what the bottleneck is with the "insert" statement. Run the code, and use whatever your operating system gives you to see what the machine is doing.
If the bottleneck is CPU, you need to find the slowest part and speed it up. Unlikely, given your sample code, but possible.
If the bottleneck is I/O or memory, you're almost certainly going to need either better hardware, or a fundamental re-design.
The obvious way to re-design this is to find a way to handle only deltas in the 50M records. For instance, if you can write to an audit table whenever a record changes, your cron job can look at that audit table and pick out any data that was modified since the last batch run.
I had a mailer cron job on CakePHP, which failed merely on 600 rows fetch and send email to the registered users. It couldn't even perform the job in batch operations. We finally opted for mandrill and since then it all went well.
I'd suggest (considering it a bad idea to touch the legacy system in production) :
Schedule a mirco solution in golang or node.js considering
performance benchmarks, as database interaction is involved -
you'll be fine with any of these. Have this micro solution perform
the cron job. (Fetch + Calculate)
Reporting from NoSQL will be
challenging, so you should try out using available services like
Google Big Query. Have the cron job store data to google big
query and you should get a huge performance improvement even in
generating reports.
or
With each row inserted into your original db server 1, set up a messaging mechanism which performs the operations of cron job everytime an insert is made (sort of trigger) and store it into your reporting server. Possible services you can use are : Google PubSub or Pusher. I think per insert time consumption will be pretty less. (You can also use a async service setup which does the task of storing into the reporting database).
Hope this helps.
So I'm trying to import some sales data into my MySQL database. The data is originally in the form of a raw CSV file, which my PHP application needs to first process, then save the processed sales data to the database.
Initially I was doing individual INSERT queries, which I realized was incredibly inefficient (~6000 queries taking almost 2 minutes). I then generated a single large query and INSERTed the data all at once. That gave us a 3400% increase in efficiency, and reduced the query time to just over 3 seconds.
But as I understand it, LOAD DATA INFILE is supposed to be even quicker than any sort of INSERT query. So now I'm thinking about writing the processed data to a text file and using LOAD DATA INFILE to import it into the database. Is this the optimal way to insert large amounts of data to a database? Or am I going about this entirely the wrong way?
I know a few thousand rows of mostly numeric data isn't a lot in the grand scheme of things, but I'm trying to make this intranet application as quick/responsive as possible. And I also want to make sure that this process scales up in case we decide to license the program to other companies.
UPDATE:
So I did go ahead and test LOAD DATA INFILE out as suggested, thinking it might give me only marginal speed increases (since I was now writing the same data to disk twice), but I was surprised when it cut the query time from over 3300ms down to ~240ms. The page still takes about ~1500ms to execute total, but it's still noticeably better than before.
From here I guess I'll check to see if I have any superfluous indexes in the database, and, since all but two of my tables are InnoDB, I will look into optimizing the InnoDB buffer pool to optimize the overall performance.
LOAD DATA INFILE is very fast, and is the right way to import text files into MySQL. It is one of the recommended methods for speeding up the insertion of data -up to 20 times faster, according to this:
https://dev.mysql.com/doc/refman/8.0/en/insert-optimization.html
Assuming that writing the processed data back to a text file is faster than inserting it into the database, then this is a good way to go.
LOAD DATA or multiple inserts are going to be much better than single inserts; LOAD DATA saves you a tiny little bit you probably don't care about that much.
In any case, do quite a lot but not too much in one transaction - 10,000 rows per transaction generally feels about right (NB: this is not relevant to non-transactional engines). If your transactions are too small then it will spend all its time syncing the log to disc.
Most of the time doing a big insert is going to come from building indexes, which is an expensive and memory-intensive operation.
If you need performance,
Have as few indexes as possible
Make sure the table and all its indexes fit in your innodb buffer pool (Assuming innodb here)
Just add more ram until your table fits in memory, unless that becomes prohibitively expensive (64G is not too expensive nowadays)
If you must use MyISAM, there are a few dirty tricks there to make it better which I won't discuss further.
Guys, i had the same question, my needs might have been a little more specific than general, but i have written a post about my findings here.
http://www.mediabandit.co.uk/blog/215_mysql-bulk-insert-vs-load-data
For my needs load data was fast, but the need to save to a flat file on the fly meant the average load times took longer than a bulk insert. Moreover i wasnt required to do more than say 200 queries, where before i was doing this one at a time, i'm now bulking them up, the time savings are in the region of seconds.
Anyway, hopefully this will help you?
You should be fine with your approach. I'm not sure how much faster LOAD DATA INFILE is compared to bulk INSERT, but I've heard the same thing, that it's supposed to be faster.
Of course, you'll want to do some benchmarks to be sure, but I'd say it's worth writing some test code.
I've got a gaming-oriented website with 200+ users. The site has a large database tracking user plays, and one of the motivations for continued participation is the extensive statistics and rankings (S&R) with which the site provides the user.
As the list of S&Rs tracked has grown, some of the more intricate calculations have been moved to tables within the database, rather than be generated on-the-fly in order to improve page loading speed.
However, I plan to move from extensive to exhaustive S&Rs by the end of the year, increasing the overall number of datapoints available to the user by a factor of 10. I've already decided to stop doing on-the-fly queries and to move all the calculations to a cron job, but I'm unsure where to store the data.
Given a user base <1000, would it make more sense to place this data within the database or read/write a text file for each user's stats?
These are the main pros and cons in my mind:
Storing S&Rs in the Database
+ cross-user comparisons are easy and fast
+ faster cron jobs because there's no need to write to many, many files
- database table count will jump from ~50 to 200+ (at least)
- one point of failure (database corruption) for all site data
- modifying S&R structure requires modifying database as well
Storing S&Rs in Text Files
+ neatly organized and distributes data corruption risk
+ database is easier to navigate
+ redesigning S&R structure is done by simply modifying script and
overwriting all text files, rather than adjusting database tables
- cron job will have to read/update XXX files each time
- cross-user comparisons are difficult and time-consuming
But I've never done something of this magnitude before, so I'm not really sure (for example) if a 200+ table MySQL database is even really a problem?
I'd appreciate any suggestions you can provide! :-)
Any popular database software should be able to handle millions of entries, having 200+ tables is not an issue on that end.
Corruption is unlikely, but on a site of that nature you should be doing backups fairly frequently, and preferable storing a copy outside the server - using individual files distributes and decreases the likelihood of a general failure, but there's also a small chance of problems ocurring.
Database software excels at performing tasks on it's data, using flat files would probably force you to write your own method to process them, and this could easily prove to be a major task, at the extra cost of a loss of speed compared to using a database (I'm just assuming this, I might be very wrong).
Lets assume the same environments for PHP5 working with MySQL5 and CSV files. MySQL is on the same host as hosted scripts.
Will MySQL always be faster than retriving/searching/changing/adding/deleting records to CSV?
Or is there some amount of data below which PHP+CSV performance is better than using database server?
CSV won't let you create indexes for fast searching.
If you always need all data from a single table (like for application settings), CSV is faster, otherwise not.
I don't even consider SQL queries, transactions, data manipulation or concurrent access here, as CSV is certainly not for these things.
No, MySQL will probably be slower for inserting (appending to a CSV is very fast) and table-scan (non-index based) searches.
Updating or deleting from a CSV is nontrivial - I leave that as an exercise for the reader.
If you use a CSV, you need to be really careful to handle multiple threads / processes correctly, otherwise you'll get bad data or corrupt your file.
However, there are other advantages too. Care to work out how you do ALTER TABLE on a CSV?
Using a CSV is a very bad idea if you ever need UPDATEs, DELETEs, ALTER TABLE or to access the file from more than one process at once.
As a person coming from the data industry, I've dealt with exactly this situation.
Generally speaking, MySQL will be faster.
However, you don't state the type of application that you are developing. Are you developing a data warehouse application that is mainly used for searching and retrieval of records? How many fields are typically present in your records? How many records are typically present in your data files? Do these files have any relational properties to each other, i.e. do you have a file of customers and a file of customer orders? How much time do you have to develop a system?
The answer will depend on the answer to the questions listed previously. However, you can generally use the following as a guidelines:
If you are building a data warehouse application with records exceeding one million, you may want to consider ditching both and moving to a Column Oriented Database.
CSV will probably be faster for smaller data sets. However, rolling your own insert routines in CSV could be painful and you lose the advantages of database indexing.
My general recommendation would be to just use MySql, as I said previously, in most cases it will be faster.
From a pure performance standpoint, it completely depends on the operation you're doing, as #MarkR says. Appending to a flat file is very fast. As is reading in the entire file (for a non-indexed search or other purposes).
The only way to know for sure what will work better for your use cases on your platform is to do actual profiling. I can guarantee you that doing a full table scan on a million row database will be slower than grep on a million line CSV file. But that's probably not a realistic example of your usage. The "breakpoints" will vary wildly depending on your particular mix of retrieve, indexed search, non-indexed search, update, append.
To me, this isn't a performance issue. Your data sounds record-oriented, and MySQL is vastly superior (in general terms) for dealing with that kind of data. If your use cases are even a little bit complicated by the time your data gets large, dealing with a 100k line CSV file is going to be horrific compared to a 100k record db table, even if the performance is marginally better (which is by no means guaranteed).
Depends on the use. For example for configuration or language files CSV might do better.
Anyway, if you're using PHP5, you have 3rd option -- SQLite, which comes embedded in PHP. It gives you ease of use like regular files, but robustness of RDBMS.
Databases are for storing and retrieving data. If you need anything more than plain line/entry addition or bulk listing, why not go for the database way? Otherwise you'd basically have to code the functionality (incl. deletion, sorting etc) yourself.
CSV is an incredibly brittle format and requires your app to do all the formatting and calcuations. If you need to update a spesific record in a csv you will have to first read the entire csv file, find the entry in memory would need to change, then write the whole file out again. This gets very slow very quickly. CSV is only useful for write once, readd once type apps.
If you want to import swiftly like a thief in the night, use SQL format.
If you are working in production server, CSV is slow but it is the safest.
Just make sure the CSV file doesn't have a Primary Key which will override your existing data.