How to import data to analyse it the fastest way

How to import data to analyse it the fastest way - php

I just have a question which way gives me more performance and would be easier to get done. We have a DB with over 120000 datarows which is stored in a database. These data is currently exported as CSV file to an ftp location.
Now from this csv file there should be a webform created to filter the datasets. What would you recommend regarding performance and work todo. Should I parse the csv file and get the information out to the webpage or should I reimport the csv file to a DB (MySQL) and use SQL queries to filter the data (Note: The original DB and export is on a different server than the webpage/webform.)
A direct connection to the DB on the original server is not possible.
I prefer reuploading it to a DB, because it makes the development easier, I just simply need to create the SQL query against the filter criteria entered in the webform and run it.
Any ideas?
Thanks...
WorldSignia

The database is undoubtedly the best answer. Since you are looking to use a web form to analyze the results and perform complex queries, the other alternative may prove VERY expensive in terms of server processing time, and quite more difficult to implement. After all, on the one hand you have SQL that handles all filtering details for you, and on the other you will have to implement something yourself.
I would advise, performance - wise, that you create indices for all fields that you know you will be using as criteria, and to display results partially, say 50 per page to minimize load times.

These data is currently exported as CSV file to an ftp location.
There are so many things wrong in that one sentence.
Should I parse the csv file and get the information out to the webpage
Definitely not.
While it is technically possible, and will probably be faster given the number of rows if you use the right tools this is a high risk approach which gives a lot less clarity of code. And while it may meet your immediate requirement is it rather inflexible.
Since the only sensible option is to transfer to another database, perhaps you should think about how you can do this
without using FTP
without using CSV
What happens to the data after it has been filtered?

I think the DB with indexes may be a better solution in case you need to filter the data. Actually this is the idea of DB to optimize your work with data. But you could profile you work and measure the performance. Then you just choose..

hmm good question.
i would think the analysis with a DB is faster. You can set Indizes and optimize the analysis.
But it could take some time to load the CSV into the Database.
To analyse the CSV without a Db it could take some time. You have to create a concrete algorithm and this may be a lot of work :)
So I think u have to proof it both and take the best performance... evaluate them ;-)

Related

Reading a file or searching in a database?

I am creating a web-based app for android and I came to the point of the account system. Previously I stored all data for a person inside a text file, located users/<name>.txt. Now thinking about doing it in a database (like you probably should), wouldn't that take longer to load since it has to look for the row where the name is equal to the input?
So, my question is, is it faster to read data from a text file, easy to open because it knows its location, or would it be faster to get the information from a database, although it would have to first scan line by line untill it reaches the one with the correct name?
I don't care about the safety, I know the first option is not save at all. It doesn't really matter in this case.
Thanks,
Merijn

In any question about performance, the first answer is usually: Try it out and see.
In your case, you are reading a file line-by-line to find a particular name. If you have only a few names, then the file is probably faster. With more lines, you could be reading for a while.
A database can optimize this using an index. Do note that the index will not have much effect until you have a fair amount of data (tens of thousands of bytes). The reason is that the database reads the records in units called data pages. So, it doesn't read one record at a time, it reads a page's worth of records. If you have hundreds of thousands of names, a database will be faster.
Perhaps the main performance advantage of a database is that after the first time you read the data, it will reside in the page cache. Subsequent access will use the cache and just read it from memory -- automatically, I might add, with no effort on your part.
The real advantage to a database is that it then gives you the flexibility to easily add more data, to log interactions, and to store other types of data the might be relevant to your application. On the narrow question of just searching for a particular name, if you have at most a few dozen, the file is probably fast enough. The database is more useful for a large volume of data and because it gives you additional capabilities.

Abit of googling came up with this question: https://dba.stackexchange.com/questions/23124/whats-better-faster-mysql-or-filesystem
I think the answer suits this one as well.
The file system is useful if you are looking for a particular file, as
operating systems maintain a sort of index. However, the contents of a
txt file won't be indexed, which is one of the main advantages of a
database. Another is understanding the relational model, so that data
doesn't need to be repeated over and over. Another is understanding
types. If you have a txt file, you'll need to parse numbers, dates,
etc.
So - the file system might work for you in some cases, but certainly
not all.

That's where database indexes come in.
You may wish to take a look at How does database indexing work? :)

It is quite a simple solution - use database.
Not because its faster or slower, but because it has mechanisms to prevent data loss or corruption.
A failed write to the text file can happen and you will lose a user profile info.
With database engine - its much more difficult to lose data like that.
EDIT:
Also, a big question - is this about server side or app side??
Because, for app side, realistically you wont have more than 100 users per smartphone... More likely you will have 1-5 users, who share the phone and thus need their own profiles, and for the majority - you will have a single user.

Why do forums store posts in a database?

From looking at the way some forum softwares are storing data in a database (eg. phpBB uses MySQL databases for storing just about everything) I started to wonder why they do it that way? Couldn't it be just as fast and efficient to use.. maybe xsl with xslt to store forum topics and posts? Or to at least store the posts in a topic?

There are loads of reasons why they use databases and not flat files. Here are a few off the top of my head.
Referential integrity
Indexes and efficient searching
SQL Joins
Here are a couple more posts you can look at for more information :
If i can store my data in text files and easily can deal with these files, why should i use database like Mysql, oracle etc
Why use MySQL over flatfiles?
Why use SQL database?

But this is exactly what databases have been designed and optimized for, storage and retrieval of data. Using a database allows the forum designer to focus on their problem and not worry about implementing storage as well. It wouldn't make sense to ignore all the work that has been done in the database world and instead implement your own solution. It would take more time, be more buggy, and not run as quickly.

Database engines handle all the problems of concurrency. Imagine that, two users try to write in your forum at the same time. If you store the post in files, the first attempt will lock the file so the second has to wait for the first to finish.
Otherwise if you want to search, it's much faster to do it in database than scanning all the files.
So basically, it's not a good idea to store data wich can be modified by useres simultaneously, and searching is much more efficient in database.

Simply, easy access to data. It's a lot easier to find posts between a date, created by a user, or with certain keywords. You could do all of the above with flat file storage, but this would be IO intensive and slow. If you had the idea of storing each post in its own file, you'd then have the problem of running out of disk space, not because of lack of capacity, but because you'd have consumed all the available inodes.
Software such as this usually has a static caching feature - pages that don't change are written out to static HTML files, and those are served instead of hitting the database.
Mixing static caching with relational DB storage provides the best of both worlds.

What is more expensive for template reading: Database query or File reading?

My question is fairly simple; I need to read out some templates (in PHP) and send them to the client.
For this kind of data, specifically text/html and text/javascript; is it more expensive to read them out a MySQL database or out of files?
Kind regards
Tom
inb4 security; I'm aware.
PS: I read other topics about similar questions but they either had to do with other kind of data, or haven't been answered.

Reading from a database is more expensive, no question.
Where do the flat files live? On the file system. In the best case, they've been recently accessed so the OS has cached the files in memory, and it's just a memory read to get them into your PHP program to send to the client. In the worst case, the OS has to copy the file from disc to memory before your program can use it.
Where does the data in a database live? On the file system. In the best case, they've been recently accessed so MySQL has that table in memory. However, your program can't get at that memory directly, it needs to first establish a connection with the server, send authentication data back and forth, send a query, MySQL has to parse and execute the query, then grab the row from memory and send it to your program. In the worst case, the OS has to copy from the database table's file on disk to memory before MySQL can get the row to send.
As you can see, the scenarios are almost exactly the same, except that using a database involves the additional overhead of connections and queries before getting the data out of memory or off disc.

There are many factors that would affect how expensive both are.
I'll assume that since they are templates, they probably won't be changing often. If so, flat-file may be a better option. Anything write-heavy should be done in a database.
Reading a flat-file should be faster than reading data from the database.
Having them in the database usually makes it easier for multiple people to edit.
You might consider using memcache to store the templates after reading them, since reading from memory is always faster than reading from a db or flat-file.

It really doesnt make enough difference to worry you. What sort of volume are you working with? Will you have over a million page views a day? If not I'd say pick whichever one is easiest for you to code with and maintain and dont worry about the expense of the alternatives until it becomes a problem.
Specifically, if your templates are currently in file form I would leave them there, and if they are currently in DB form I'd leave them there.

in this case Csv or Mysql?

I am starting new project. In my project I will need to use local provinces and local city names. I do not want to have many mysql tables unless I have to have or csv is fast. For province-city case I am not sure which one to use.
I have job announcements related with cities, provinces. For Csv case I will keep the name of city in announcements table, so when I do search I send selected city name to db in query.
can anyone give me better idea on how to do this? csv or mysql? why?
Thanks in advance.

Database Pros
Relating cities to provinces and job announcements will mean less redundant data, and consistently formatted data
The ability to search/report data is much simpler, being [relatively] standardized by the use of SQL
More scalable, accommodating GBs of data if necessary
Infrastructure is already in place, well documented in online resources
Flat File (CSV) Pros
I'm trying, but I can't think of any. Reading from a csv means loading the contents into memory, whether the contents will be used or not. As astander mentioned, changes while the application is in use would be a nightmare. Then there's the infrastructure to pull data out, searching, etc.
Conclusion
Use a database, be it MySQL or the free versions of Oracle or SQL Server. Basing things off a csv is coding yourself into a corner, with no long term benefits.

If you use CSV you will run into problems eventually if you are planning on a lot of traffic. If you are just going to use this personally on your machine or with a couple people in an office then CSV is probably sufficient.

I would recomend keeping it in the db. If you store the names in the annoucements table, any changes to the csv will not be updated in the queries.
DBs are meant to hanle these issues.

If you don't want to use a database table, use an hardcoded array directly in PHP: if the performances are so critic I don't know any way faster than this one (and I don't see a single advantage in using CSV too).
Apart of that I think this is a clear premature optimization. You should make your application extensible, especially at the planning stage. Not using a table will make the overall structure rigid.

While people often get worried about the the proliferation of tables inside a database they are under management. Management by the DBMS. This means that you can control the data control task like updating and it also takes you down the route of organising the data properly, i.e. normalisation.
Large collections of CSV or XML files can get extremely unwieldy unless you are prepared to write management systems arounf them (that already come with the DBMS for, as it were, free).
There can be good reason for not using DBMS's but i have not found many and certainly not in mainstream development.

CSV vs MySQL performance

Lets assume the same environments for PHP5 working with MySQL5 and CSV files. MySQL is on the same host as hosted scripts.
Will MySQL always be faster than retriving/searching/changing/adding/deleting records to CSV?
Or is there some amount of data below which PHP+CSV performance is better than using database server?

CSV won't let you create indexes for fast searching.
If you always need all data from a single table (like for application settings), CSV is faster, otherwise not.
I don't even consider SQL queries, transactions, data manipulation or concurrent access here, as CSV is certainly not for these things.

No, MySQL will probably be slower for inserting (appending to a CSV is very fast) and table-scan (non-index based) searches.
Updating or deleting from a CSV is nontrivial - I leave that as an exercise for the reader.
If you use a CSV, you need to be really careful to handle multiple threads / processes correctly, otherwise you'll get bad data or corrupt your file.
However, there are other advantages too. Care to work out how you do ALTER TABLE on a CSV?
Using a CSV is a very bad idea if you ever need UPDATEs, DELETEs, ALTER TABLE or to access the file from more than one process at once.

As a person coming from the data industry, I've dealt with exactly this situation.
Generally speaking, MySQL will be faster.
However, you don't state the type of application that you are developing. Are you developing a data warehouse application that is mainly used for searching and retrieval of records? How many fields are typically present in your records? How many records are typically present in your data files? Do these files have any relational properties to each other, i.e. do you have a file of customers and a file of customer orders? How much time do you have to develop a system?
The answer will depend on the answer to the questions listed previously. However, you can generally use the following as a guidelines:
If you are building a data warehouse application with records exceeding one million, you may want to consider ditching both and moving to a Column Oriented Database.
CSV will probably be faster for smaller data sets. However, rolling your own insert routines in CSV could be painful and you lose the advantages of database indexing.
My general recommendation would be to just use MySql, as I said previously, in most cases it will be faster.

From a pure performance standpoint, it completely depends on the operation you're doing, as #MarkR says. Appending to a flat file is very fast. As is reading in the entire file (for a non-indexed search or other purposes).
The only way to know for sure what will work better for your use cases on your platform is to do actual profiling. I can guarantee you that doing a full table scan on a million row database will be slower than grep on a million line CSV file. But that's probably not a realistic example of your usage. The "breakpoints" will vary wildly depending on your particular mix of retrieve, indexed search, non-indexed search, update, append.
To me, this isn't a performance issue. Your data sounds record-oriented, and MySQL is vastly superior (in general terms) for dealing with that kind of data. If your use cases are even a little bit complicated by the time your data gets large, dealing with a 100k line CSV file is going to be horrific compared to a 100k record db table, even if the performance is marginally better (which is by no means guaranteed).

Depends on the use. For example for configuration or language files CSV might do better.
Anyway, if you're using PHP5, you have 3rd option -- SQLite, which comes embedded in PHP. It gives you ease of use like regular files, but robustness of RDBMS.

Databases are for storing and retrieving data. If you need anything more than plain line/entry addition or bulk listing, why not go for the database way? Otherwise you'd basically have to code the functionality (incl. deletion, sorting etc) yourself.

CSV is an incredibly brittle format and requires your app to do all the formatting and calcuations. If you need to update a spesific record in a csv you will have to first read the entire csv file, find the entry in memory would need to change, then write the whole file out again. This gets very slow very quickly. CSV is only useful for write once, readd once type apps.

If you want to import swiftly like a thief in the night, use SQL format.
If you are working in production server, CSV is slow but it is the safest.
Just make sure the CSV file doesn't have a Primary Key which will override your existing data.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to import data to analyse it the fastest way - php

I think the DB with indexes may be a better solution in case you need to filter the data. Actually this is the idea of DB to optimize your work with data. But you could profile you work and measure the performance. Then you just choose..

Related

Reading a file or searching in a database?

Why do forums store posts in a database?

What is more expensive for template reading: Database query or File reading?

in this case Csv or Mysql?

CSV vs MySQL performance

Categories

Resources