Lets assume the same environments for PHP5 working with MySQL5 and CSV files. MySQL is on the same host as hosted scripts.
Will MySQL always be faster than retriving/searching/changing/adding/deleting records to CSV?
Or is there some amount of data below which PHP+CSV performance is better than using database server?
CSV won't let you create indexes for fast searching.
If you always need all data from a single table (like for application settings), CSV is faster, otherwise not.
I don't even consider SQL queries, transactions, data manipulation or concurrent access here, as CSV is certainly not for these things.
No, MySQL will probably be slower for inserting (appending to a CSV is very fast) and table-scan (non-index based) searches.
Updating or deleting from a CSV is nontrivial - I leave that as an exercise for the reader.
If you use a CSV, you need to be really careful to handle multiple threads / processes correctly, otherwise you'll get bad data or corrupt your file.
However, there are other advantages too. Care to work out how you do ALTER TABLE on a CSV?
Using a CSV is a very bad idea if you ever need UPDATEs, DELETEs, ALTER TABLE or to access the file from more than one process at once.
As a person coming from the data industry, I've dealt with exactly this situation.
Generally speaking, MySQL will be faster.
However, you don't state the type of application that you are developing. Are you developing a data warehouse application that is mainly used for searching and retrieval of records? How many fields are typically present in your records? How many records are typically present in your data files? Do these files have any relational properties to each other, i.e. do you have a file of customers and a file of customer orders? How much time do you have to develop a system?
The answer will depend on the answer to the questions listed previously. However, you can generally use the following as a guidelines:
If you are building a data warehouse application with records exceeding one million, you may want to consider ditching both and moving to a Column Oriented Database.
CSV will probably be faster for smaller data sets. However, rolling your own insert routines in CSV could be painful and you lose the advantages of database indexing.
My general recommendation would be to just use MySql, as I said previously, in most cases it will be faster.
From a pure performance standpoint, it completely depends on the operation you're doing, as #MarkR says. Appending to a flat file is very fast. As is reading in the entire file (for a non-indexed search or other purposes).
The only way to know for sure what will work better for your use cases on your platform is to do actual profiling. I can guarantee you that doing a full table scan on a million row database will be slower than grep on a million line CSV file. But that's probably not a realistic example of your usage. The "breakpoints" will vary wildly depending on your particular mix of retrieve, indexed search, non-indexed search, update, append.
To me, this isn't a performance issue. Your data sounds record-oriented, and MySQL is vastly superior (in general terms) for dealing with that kind of data. If your use cases are even a little bit complicated by the time your data gets large, dealing with a 100k line CSV file is going to be horrific compared to a 100k record db table, even if the performance is marginally better (which is by no means guaranteed).
Depends on the use. For example for configuration or language files CSV might do better.
Anyway, if you're using PHP5, you have 3rd option -- SQLite, which comes embedded in PHP. It gives you ease of use like regular files, but robustness of RDBMS.
Databases are for storing and retrieving data. If you need anything more than plain line/entry addition or bulk listing, why not go for the database way? Otherwise you'd basically have to code the functionality (incl. deletion, sorting etc) yourself.
CSV is an incredibly brittle format and requires your app to do all the formatting and calcuations. If you need to update a spesific record in a csv you will have to first read the entire csv file, find the entry in memory would need to change, then write the whole file out again. This gets very slow very quickly. CSV is only useful for write once, readd once type apps.
If you want to import swiftly like a thief in the night, use SQL format.
If you are working in production server, CSV is slow but it is the safest.
Just make sure the CSV file doesn't have a Primary Key which will override your existing data.
Related
I just have a question which way gives me more performance and would be easier to get done. We have a DB with over 120000 datarows which is stored in a database. These data is currently exported as CSV file to an ftp location.
Now from this csv file there should be a webform created to filter the datasets. What would you recommend regarding performance and work todo. Should I parse the csv file and get the information out to the webpage or should I reimport the csv file to a DB (MySQL) and use SQL queries to filter the data (Note: The original DB and export is on a different server than the webpage/webform.)
A direct connection to the DB on the original server is not possible.
I prefer reuploading it to a DB, because it makes the development easier, I just simply need to create the SQL query against the filter criteria entered in the webform and run it.
Any ideas?
Thanks...
WorldSignia
The database is undoubtedly the best answer. Since you are looking to use a web form to analyze the results and perform complex queries, the other alternative may prove VERY expensive in terms of server processing time, and quite more difficult to implement. After all, on the one hand you have SQL that handles all filtering details for you, and on the other you will have to implement something yourself.
I would advise, performance - wise, that you create indices for all fields that you know you will be using as criteria, and to display results partially, say 50 per page to minimize load times.
These data is currently exported as CSV file to an ftp location.
There are so many things wrong in that one sentence.
Should I parse the csv file and get the information out to the webpage
Definitely not.
While it is technically possible, and will probably be faster given the number of rows if you use the right tools this is a high risk approach which gives a lot less clarity of code. And while it may meet your immediate requirement is it rather inflexible.
Since the only sensible option is to transfer to another database, perhaps you should think about how you can do this
without using FTP
without using CSV
What happens to the data after it has been filtered?
I think the DB with indexes may be a better solution in case you need to filter the data. Actually this is the idea of DB to optimize your work with data. But you could profile you work and measure the performance. Then you just choose..
hmm good question.
i would think the analysis with a DB is faster. You can set Indizes and optimize the analysis.
But it could take some time to load the CSV into the Database.
To analyse the CSV without a Db it could take some time. You have to create a concrete algorithm and this may be a lot of work :)
So I think u have to proof it both and take the best performance... evaluate them ;-)
I am currently developing an ecommerce software using PHP/MySQL for a big company. There are two options for me to get some specificed data:
DB (for getting huge data, such as PRODUCTS, CATEGORIES, ORDERS, etc.)
TXT (using YAML -for getting analytical data and some options)
For instance, when a user go to product details page I need to get those TXT files:
Product summary file (product_hit, quantity_sold, etc.) -approximately max. 90KB
Langauge and Settings file (such as company_name, translations for template) -approximately max. 300KB
May be one more file (I don't know right know) -assume that 100KB.
I want to use this way, because data is easily readable by human and portable between programming languages. In addition, if I use DB, I need to connect a couple of tables. But these files GET THEM TOGETHER.
My txt file looks like (YAML):
product_id: 1281
quantity_sold: 12 #item(s)
hit: 1105
hit_avarage: 92 #quantity_sold/hit
vote: 2
...
But, still I am not sure about speed and performance. Using TXT files are good idea? Should I really use this way instead of DB?
As you can't partially include and parse a YAML file, you'll have to parse the file as a whole, which means that you'll have an incredible performance hit. You can compare this to selecting all rows from a database and then looping over them to find the one that you're looking for, instead of just typing a WHERE condition. So yes, a database is much faster to accomplish what you ask.
Please do take a look at Document Based Databases though, you don't necessarily have to use a relational database. In fact, when looking at the example of the YAML file, I think using a "no SQL" database would be a better alternative.
Cheers.
I love YAML and think it's great for smaller amounts of data, but the dimensions you mention are better dealt with using a database. It's faster, and data can be indexed - in a file based scenario, you would have to walk through the whole file to find something.
Use the YAML approach. The data structure suggests that they are tantamount to fixed data / configuration settings. And if you cannot reasonably do the calculations within the database, then don't attempt to.
You could however convert your fixed data from YAML to CSV, and import them within the database into a temporary table. If and only if calculating everything there is feasible.
Cannot say anything about performance. Technically reading file data is as slow as having the database read disk sectors, and the difference between YAML parsing and column splitting might not be significant. You'll have to test that.
YAML is 'human-readable data serialization format'.
Serialization is a process of converting in-memory structures into format that can be written, possibly transmitted and read into the in-memory structures.
Database management systems are programs that help control data management from creation through processing, including
security
scalability
concurrency
data integrity (atomicity, consistency, isolation and durability)
performance
availability
YAML does not provide tools and integrated environment that take care of the above and if you want to use it as a principal data store you either need to isolate all of the above challenges away from this particular scenario that would use YAML as principal data management system (or reinvent the wheels to certain extent, sooner or later).
I would imagine that no "e-commerce system for a big company" would want to sacrifice any of the above listed features for human readability.
HI
I got a doubt I have seen reading mysql data is slower in case of large tables...I have done lots of optimization but cant get through..
what I am thinking is will it give a better speed if I store data in a piece of file??
off course each data will be a separate file. so millions of data = millions of file. I agree it will consume the disk space... but what about reading process?? is it faster??
I am using PHP to read file...
Reading one file = fast.
Reading many / big files = slow.
Reading singular small entries from database = waste of I/O.
Combining many entries within the database = faster than file accesses.
As long as your tables are properly indexed and as long as you are using those indices (that's right), using a relational DB (like mysql) is going to be much faster, more robust, flexible (insert many buzzwords here), etc.
To examine why your queries' performance does not match your expectations, you can use the explain clause with your selects (http://dev.mysql.com/doc/refman/5.1/en/explain.html).
To answer the topic, yes.
By which I mean that there are so many (unmentioned) factors that it's impossible to unequivocally state that one will be faster than the other every time.
It depends on what kind of data you're storing. Structured data is usually much faster and more flexible/powerful to read using SQL, since that's exactly what its made for. If you want to search, filter, sort or group by a certain attribute, the index structures and optimizations of a DBS are appropriate.
However, when using a DB for storing large files (BLOBs), which contain unstructured data in the sense that you are not going to search, filter, sort or group by any part of the files, then these files just blow up the database size and make it slow. There is an interesting study by Microsoft on this topic (just have to find the link yet). This study is the reason why Microsoft introduced the External BLOB storage in their SQLServer, which basically means what you asked: The BLOBs are saved in files outside the database, because they measured that access is much faster that way.
When storing files (e.g., pictures, videos, documents...) you often have some metadata on the file which you want to be able to use with a structured query language like SQL, while the actual files don't necessarily need to be saved in the database.
Reading from a dbms (MySQL is one) is faster in most cases, because they have built in cache that will keep the data in memory, so next time you try to read the same data, you will not have to wait on the incredible slow hard drive.
A dbms is essentially reading from your hard drive + a cache to speed things up (+ some data sorting algorithms). Remember, your database is stored on your hard drive :)
It depends on a lot of factors, not least of which is what kind of file system you're using. MySQL uses files for storage anyway, so read speed isn't the issue -- the biggest factor will be how fast MySQL can find your data, compared to how fast it can be looked up in your filesystem.
Generally, though, MySQL is quite good about finding data quickly -- after all, that's its purpose in life. So unless you have a really good reason why the FS should be much faster, stick with the DB and check your indexes and such.
By choosing a custom file storage system you will lose the benefits of using a relational database. Also your code might not be easy maintainable.
Nonetheless, there are many who believe that relational databases offer too much complexity at the cost of speed. Have a look at the NoSQL entry in wikipedia and read about possible alternatives.
What's better? I want to share a script where some data (4 ints (between 0 and 2000) and a string (length up to 200)).
Should I store them in files or in a MySQL database?
I use normally databases, but in this case are files also not that bad (to handle).
The problem is that there are partial in one day over 100.000 inserts.
That are some million in a few days.
Could handle MySQL so huge data in under 1 second?
Or is it better to create for each day a seperated file?
PS: I want to have a big user base who could use it, so files are probably better?
You need a database for this type of thing. Databases handle concurrency much better than files. Mysql can handle 100k inserts a day no problem. You will probably want aggregate the data and move to another table for reporting. Since indexes slow inserts your table will need to carefully designed and cleaned up on a regular basis.
Have you considered SQLite as a good inbetween?
It has all the database functionality you would probably need, but it has all the portability of flatfiles and it's easy to create a new one and archive the old ones.
Judging by the sound of your project it might be a perfect fit.
I agree with Byron - you need a database. Unless you only have one simultaneous user a DB is generally better than lots of bugs ;-)
Without a better understanding of the use case it's hard to propose the correct solution, but maybe you could generate a new MySQL table for each day or week? As long as you don't ever need to query the data as a whole, that'll work. And you can easily zip up the directory and push it somewhere else for archiving purposes.
I am starting new project. In my project I will need to use local provinces and local city names. I do not want to have many mysql tables unless I have to have or csv is fast. For province-city case I am not sure which one to use.
I have job announcements related with cities, provinces. For Csv case I will keep the name of city in announcements table, so when I do search I send selected city name to db in query.
can anyone give me better idea on how to do this? csv or mysql? why?
Thanks in advance.
Database Pros
Relating cities to provinces and job announcements will mean less redundant data, and consistently formatted data
The ability to search/report data is much simpler, being [relatively] standardized by the use of SQL
More scalable, accommodating GBs of data if necessary
Infrastructure is already in place, well documented in online resources
Flat File (CSV) Pros
I'm trying, but I can't think of any. Reading from a csv means loading the contents into memory, whether the contents will be used or not. As astander mentioned, changes while the application is in use would be a nightmare. Then there's the infrastructure to pull data out, searching, etc.
Conclusion
Use a database, be it MySQL or the free versions of Oracle or SQL Server. Basing things off a csv is coding yourself into a corner, with no long term benefits.
If you use CSV you will run into problems eventually if you are planning on a lot of traffic. If you are just going to use this personally on your machine or with a couple people in an office then CSV is probably sufficient.
I would recomend keeping it in the db. If you store the names in the annoucements table, any changes to the csv will not be updated in the queries.
DBs are meant to hanle these issues.
If you don't want to use a database table, use an hardcoded array directly in PHP: if the performances are so critic I don't know any way faster than this one (and I don't see a single advantage in using CSV too).
Apart of that I think this is a clear premature optimization. You should make your application extensible, especially at the planning stage. Not using a table will make the overall structure rigid.
While people often get worried about the the proliferation of tables inside a database they are under management. Management by the DBMS. This means that you can control the data control task like updating and it also takes you down the route of organising the data properly, i.e. normalisation.
Large collections of CSV or XML files can get extremely unwieldy unless you are prepared to write management systems arounf them (that already come with the DBMS for, as it were, free).
There can be good reason for not using DBMS's but i have not found many and certainly not in mainstream development.