Reading a file or searching in a database?

Reading a file or searching in a database? - php

I am creating a web-based app for android and I came to the point of the account system. Previously I stored all data for a person inside a text file, located users/<name>.txt. Now thinking about doing it in a database (like you probably should), wouldn't that take longer to load since it has to look for the row where the name is equal to the input?
So, my question is, is it faster to read data from a text file, easy to open because it knows its location, or would it be faster to get the information from a database, although it would have to first scan line by line untill it reaches the one with the correct name?
I don't care about the safety, I know the first option is not save at all. It doesn't really matter in this case.
Thanks,
Merijn

In any question about performance, the first answer is usually: Try it out and see.
In your case, you are reading a file line-by-line to find a particular name. If you have only a few names, then the file is probably faster. With more lines, you could be reading for a while.
A database can optimize this using an index. Do note that the index will not have much effect until you have a fair amount of data (tens of thousands of bytes). The reason is that the database reads the records in units called data pages. So, it doesn't read one record at a time, it reads a page's worth of records. If you have hundreds of thousands of names, a database will be faster.
Perhaps the main performance advantage of a database is that after the first time you read the data, it will reside in the page cache. Subsequent access will use the cache and just read it from memory -- automatically, I might add, with no effort on your part.
The real advantage to a database is that it then gives you the flexibility to easily add more data, to log interactions, and to store other types of data the might be relevant to your application. On the narrow question of just searching for a particular name, if you have at most a few dozen, the file is probably fast enough. The database is more useful for a large volume of data and because it gives you additional capabilities.

Abit of googling came up with this question: https://dba.stackexchange.com/questions/23124/whats-better-faster-mysql-or-filesystem
I think the answer suits this one as well.
The file system is useful if you are looking for a particular file, as
operating systems maintain a sort of index. However, the contents of a
txt file won't be indexed, which is one of the main advantages of a
database. Another is understanding the relational model, so that data
doesn't need to be repeated over and over. Another is understanding
types. If you have a txt file, you'll need to parse numbers, dates,
etc.
So - the file system might work for you in some cases, but certainly
not all.

That's where database indexes come in.
You may wish to take a look at How does database indexing work? :)

It is quite a simple solution - use database.
Not because its faster or slower, but because it has mechanisms to prevent data loss or corruption.
A failed write to the text file can happen and you will lose a user profile info.
With database engine - its much more difficult to lose data like that.
EDIT:
Also, a big question - is this about server side or app side??
Because, for app side, realistically you wont have more than 100 users per smartphone... More likely you will have 1-5 users, who share the phone and thus need their own profiles, and for the majority - you will have a single user.

Related

storing large data in files vs tables

So I am working on this website where people can post articles. My colleague suggested storing all the meta-data of an article (user, title, dates etc) in a table and the actual article body as a file in the server.
A data structure would look like:
post_id post_user_id post_title post_body post_date etc
-------------------------------------------------------------------------------
1 1 My First Post 1_1.txt 2014-07-07 ...
2 1 My First Post 2_1.txt 2014-07-07 ...
--------------------------------------------------------------------------------
Now we would get the record of the post and then locate where it is by
$post_id . "_" . $post_user_id . ".txt";
He says that this will reduce the size of the tables and on the long run make it faster to access. I am not sure about this and wanted to ask if there are any issues in this design.

The first risk that pops into my mind is data corruption. Following the design, you are splitting the information into two fragments, even though both pieces are dependant from one another :
A file has to exist for each metadata entry (or you'll end up with not found errors for entries supposed to exist).
A metadata entry has to exist for each file (or you'll end up with garbage).
Using the database only has one big advantage : it is very probably relational. This means that you actually can set up rules to prevent the two scenarios above to occur (you could use an SQL CASCADE DELETE for instance, or put every piece of information in one table). Keeping these relations between two data backends is going to be a tricky thing to setup.
Another important thing to remember : data stored in a SQL database isn't sent to a magical place far from your drive. When you add an entry into your database, you write to your database files. For instance, those files are stored in /var/lib/mysql for MySQL engines. Writing to other files does not make that much of a difference...
Next thing : time. Accessing a database is fast once it's opened, all it takes is query processing. Accessing files (and that is, once per article) may be heavier : files need to be opened (including privileges checks, ...), read (line-by-line according to your buffer size) and closed. Of course, you can add to that all the programming it would take to link those files to their metadata...
To me, this design adds unecessary complexity to the application. You could store everything in the database, centralise. You'll use pretty much the same amount of disk space in both cases, yet looking-up/accessing each article file separately (while keeping it connected with its DB metadata) will certainly waste some time.
Design for simplicity; add complexity only where you must. (Eric S. Raymond)

This could look like a good idea is those posts are NEVER edited. Access to a file could take a while, and if your user wants to edit a lot of times his post, storing the content in a file is not a great idea. SQL support well large text values (as WYSIWYG text), dont be afraid to store them in your Post table.
Additionally, your filesystem will take ways more time to read and writes datas stored in files than in database.
Everything will depend of the number of post you want to store, and if you users can edit or not their posts.

I would agree, in a production environment it is generally recommended to let the file system keep track of files and the database to hold on to the metadata.
However, I have mostly heard this philosophy be applicable to BLOG types and Images. Because even large articles are relatively small, a TEXT data type can suffice and even make it easier to edit, draw from, and search as needed. \
(hence I agree with Rémi Delhaye, who answered this just as I was writing this post)

Filesystem is much more likely to have higher latency and files can 'go missing' where a database record is less likely to.
If the contents of the field is too large in the case of SQL Server then you could look at the FileStream API in newer versions.
Really though, either approach is valid in my opinion. With a file you don't have to worry about the database mangling the content if you make a mistake during escaping or something.
Beware if you're writing your code on a case-insensitive filesystem and running on a case-sensitive one in production- filename case matters so it can be another way to lose access to your files later on or unexpectedly once the application is deployed.

Should a very long function/series of functions be in one php file, or broken up into smaller ones?

At the moment I am writing a series of functions for fetching Dota 2 matches from the Steam API. When someone fetches their games, I have to (for my use) take a history of all of their games (lets say 3 api calls), then all the details from each of those games (so if there's 200 games, another 200 api calls). This takes a long time, and so far I'm programming all of the above to be in one php file "FetchMatchHistory.php", which is run by the user clicking a button on the web page.
Another thing that is making me feel it should be in one file, is that I imagine it is probably good practice to put all of the information (In this case, match history, match details, id's etc.) into the database all at once, so that there doesn't have to be null values in the database?
My question is whether or not having a function that takes a very long time should be in just one PHP file (should meaning, is generally considered good practice), or whether I should break the seperate functions down into smaller files. This is very context dependent, I know, so please forgive me.
Is it common to have API calls spanning several PHP files if that is what you are making? Is there a security/reliability issue with having only one file doing all the leg-work (so to speak)?

Good practice is to have a number of relevant functions grouped together in a php file that describes them, for organize them better and also for caching reasons for the parts that get updated more slowly than other.
But speaking of performance, i doubt you'll get the performance improvements you seek by just moving code through files.
Personally i had the habit to put everything in a file, consistently:
making my files fat
hard-to-update
hard-to-read
hard to find the thing i want (Ctrl+F meltdown)
wasting bandwidth uploading parts they did not need to be updated
virtually disabling caching on server
I dont know if any of the above is of any use for your App, but breaking files into their relevant files/places did my life easier.
UPDATE:
About the database practice, you're going to query only the parts you want to be updated.
I dont understand why you split that logic in files, there's not going to give you performance. Instead, what is going to give you performance is to update only the relevant parts and having tables with relevant content. Speaking of multiple tables have a lot more sense, since you could use them as pointers to the large data contained in another tables, reducing the possible waste of data having just one table.
Also, dont forget a single table has limitations; I personally try to have as few columns as possible. Adding more and more and a day you can't add more because of the row limit. There is a maximum number of columns in general, but this limit rarely ever get maxed by developer; the increased per-row content itself is going to suck out that limit.

Whether to split server side code to multiple files or keep it in a single one is an organizational issue, more than a security/reliability one...
I don't think it's more secure to keep your code in separate source files.
It's entirely a of how you prefer to organize and mantain your code base.
Usually, I separate it when I can find some kind of "categories" in my code.
Obviously, if you write OO code, the most common choice is to keep each class in a single file...

Is it faster or better to use MySQL instead of text files or file names for order of images with PHP?

I have images being stored in folders related to articles on my PHP web site, and would like to set the order to display the images based on author input. I started by naming the files with a number in front of them, but was considering recording the order in a text file instead to avoid renaming every file and retaining their original file names, or possibly storing the order in a MySQL table.
My question is about best practice and speed - every time the article is loaded, it will have to find out the order of images to display them. I was hoping to get some input about which method would be best and fastest.
For example, is it much slower to read a list of file names in a folder with PHP, or open a text file and read the contents, compared to making MySQL query and update statements?

I'd say a lot depends on your base hardware/filesystem/mysql connection performances. A single access to disk, just to read images is most likely going to be your quickest option. But you'd need to name your files manually ahead.
Mysql requires a TCP or *NIX socket connection, and this might slow down things (a lot depends on the number of pictures you have, and the "quality" of your db link, though). If you have a lot of files, performance hit might be negligible. Just reading from a file might be quicker nevertheless, without bothering to set up a DB connection; you'd still have to write down ID/filename correspondence for the ordering though.
Something I'd try out in your situation is to take a look at the php stat command, and see if it can help you out sorting the pictures. Depending on the number of pictures you have (it works better with lower numbers), performance might not get a serious performance hit, and you'd be able NOT to keep a separate list of picture/creation date tuples. As your number of pictures grow, the file list approach seems to me like a reasonable way to solve the problem. Just benchmarking the thing as the number of pictures increases can tell you the truth, though. Since, I think, you can expect to have lot of variability, depending on your specific context.

if your concern is performance why don't you save the list (maybe already formatted in HTML) to a file. When your page is loaded just read the file with
$code = file_get_contents("cache_file.html")
and output to the user. The fastest solution is to store the file as .html and let apache serve it directly, but this works only if your page doesn't have any other dinamic part.
to ensure that your cache file is up to date you can make it invalid and recreate it after some time (the specific time depends from the frequency in image changes) or check if the directory is changed after the cache file creation date. If you can trigger the changes in the image directory (for example if the changes are made from a piece of code that you wrote you can always ensure that you cache is refreshed when the images are changed)
Hope this helps

This smells like premature optimization.
My question is about best practice and speed - every time the article is loaded, it will have to find out the order of images to display them.
So? A query like "select filename, title from images where articleId=$articleId order by 'order'" will execute within a fraction of a second. It really doesn't matter. Do whatever is the easiest to do, and might I suggest that being the SQL option.

imho, using mysql would be slower, but oh so much easier. if the mysql server is hosted on the same server (or within dedicated space on the same server, like cloud linux), then it probably wont save too much time
edit
if you want to do a test, you can use the microtime function to time exactly how long it takes to append and sort the files, and how long it takes to get it all from mysql

Configuration storage setup [file vs. database]

I see programmers putting a lot of information into databases that could otherwise be put in a file that holds arrays. Instead of arrays, they'll use many tables of SQL which, I believe, is slower.
CitrusDB has a table in the database called "holiday". This table consists of just one date column called "holiday_date" that holds dates that are holidays. The idea is to let the user add holidays to the table. Citrus and the programmers I work with at my workplace will prefer to put all this information in tables because it is "standard".
I don't see why this would be true unless you are allowing the user, through a user interface, to add holidays. I have a feeling there's something I'm missing.

Sometimes you want to design in a bit of flexibility to a product. What if your product is released in a different country with different holidays? Just tweak the table and everything will work fine. If it's hard coded into the application, or worse, hard coded in many different places through the application, you could be in a world of pain trying to get it to work in the new locale.
By using tables, there is also a single way of accessing this information, which probably makes the program more consistent, and easier to maintain.
Sometimes efficiency/speed is not the only motivation for a design. Maintainability, flexibility, etc are very important factors.

The main advantage I have found of storing 'configuration' in a database, rather than in a property file, or a file full of arrays, is that the database is usually centrally stored, whereas a server may often be split across a farm of several, or even hundreds of servers.
I have implemented, in a corporate environment, such a solution, and the power of being able to change configuration at a single point of access, knowing that it will immediately be propagated to all servers, without the concern of a deployment process is actually very powerful, and one that we have come to rely on quite heavily.

The actual dates of some holidays change every year. The flexibility to update the holidays with a query or with a script makes putting it in the database the easiest way. One could easily implement a script that updates the holidays each year for their country or region when it is stored in the database.

Theoretically, databases are designed and tuned to provide faster access to data than doing a disk read from a file. In practice, for small to mid-sized applications this difference is minuscule. Best practices, however, are typically oriented at larger scale. By implementing best practices on your small application, you create one that is capable of scaling up.
There is also the consideration of the accessibility of the data in terms of other aspects of the project. Where is most of the data in a web-based application? In the database. Thus, we try to keep ALL the data in the database, or as much as is feasible. That way, in the future, if you decide that now you need to join the holiday dates again a list of events (for example), all the data is in a single place. This segmenting of disparate layers creates tiers within your application. When each tier can be devoted to exclusive handling of the roles within its domain (database handles data, HTML handles presentation, etc), it is again easier to change or scale your application.
Last, when designing an application, one must consider the "hit by a bus principle". So you, Developer 'A', put the holidays in a PHP file. You know they are there, and when you work on the code it doesn't create a problem. Then.... you get hit by a bus. You're out of commission. Developer 'B' comes along, and now your boss wants the holiday dates changed - we don't get President's Day off any more. Um. Johnny Next Guy has no idea about your PHP file, so he has to dig. In this example, it sounds a little trivial, maybe a little silly, but again, we always design with scalability in mind. Even if you KNOW it isn't going to scale up. These standards make it easier for other developers to pick up where you left off, should you ever leave off.

The answer lays in many realms. I used to code my own software to read and write to my own flat-file database format. For small systems, with few fields, it may seem worth it. Once you learn SQL, you'll probably use it for even the smallest things.
File parsing is slow. String readers, comparing characters, looking for character sequences, all take time. SQL Databases do have files, but they are read and then cached, both more efficiently.
Updating & saving arrays require you to read all, rebuild all, write all, save all, then close the file.
Options: SQL has many built-in features to do many powerful things, from putting things in order to only returning x through y results.
Security
Synchronization - say you have the same page accessed twice at the same time. PHP will read from your flatfile, process, and write at the same time. They will overwrite each other, resulting in dataloss.
The amount of features SQL provides, the ease of access, the lack of things you need to code, and plenty other things contribute to why hard-coded arrays aren't as good.

The answer is it depends on what kind of lists you are dealing with. It seems that here, your list consists of a small, fixed set of values.
For many valid reasons, database administrators like having value tables for enumerated values. It helps with data integrity and for dealing wtih ETL, as two examples for why you want it.
At least in Java, for these kinds of short, fixed lists, I usually use Enums. In PHP, you can use what seems to be a good way of doing enums in PHP.
The benefit of doing this is the value is an in-memory lookup, but you can still get data integrity that DBAs care about.

If you need to find a single piece of information out of 10, reading a file vs. querying a database may not give a serious advantage either way. Reading a single piece of data from hundreds or thousands, etc, has a serious advantage when you read from a database. Rather than load a file of some size and read all the contents, taking time and memory, querying from the database is quick and returns exactly what you query for. It's similar to writing data to a database vs text files - the insert into the database includes only what you are adding. Writing a file means reading the entire contents and writing them all back out again.
If you know you're dealing with very small numbers of values, and you know that requirement will never change, put data into files and read them. If you're not 100% sure about it, don't shoot yourself in the foot. Work with a database and you're probably going to be future proof.

This is a big question. The short answer would be, never store 'data' in a file.
First you have to deal with read/write file permission issues, which introduces security risk.
Second, you should always plan on an application growing. When the 'holiday' array becomes very large, or needs to be expanded to include holiday types, your going to wish it was in the DB.
I can see other answers rolling in, so I'll leave it at that.

Generally, application data should be stored in some kind of storage (not flat files).
Configuration/settings can be stored in a KVP storage (such as Redis) then access it via REST API.

CSV vs MySQL performance

Lets assume the same environments for PHP5 working with MySQL5 and CSV files. MySQL is on the same host as hosted scripts.
Will MySQL always be faster than retriving/searching/changing/adding/deleting records to CSV?
Or is there some amount of data below which PHP+CSV performance is better than using database server?

CSV won't let you create indexes for fast searching.
If you always need all data from a single table (like for application settings), CSV is faster, otherwise not.
I don't even consider SQL queries, transactions, data manipulation or concurrent access here, as CSV is certainly not for these things.

No, MySQL will probably be slower for inserting (appending to a CSV is very fast) and table-scan (non-index based) searches.
Updating or deleting from a CSV is nontrivial - I leave that as an exercise for the reader.
If you use a CSV, you need to be really careful to handle multiple threads / processes correctly, otherwise you'll get bad data or corrupt your file.
However, there are other advantages too. Care to work out how you do ALTER TABLE on a CSV?
Using a CSV is a very bad idea if you ever need UPDATEs, DELETEs, ALTER TABLE or to access the file from more than one process at once.

As a person coming from the data industry, I've dealt with exactly this situation.
Generally speaking, MySQL will be faster.
However, you don't state the type of application that you are developing. Are you developing a data warehouse application that is mainly used for searching and retrieval of records? How many fields are typically present in your records? How many records are typically present in your data files? Do these files have any relational properties to each other, i.e. do you have a file of customers and a file of customer orders? How much time do you have to develop a system?
The answer will depend on the answer to the questions listed previously. However, you can generally use the following as a guidelines:
If you are building a data warehouse application with records exceeding one million, you may want to consider ditching both and moving to a Column Oriented Database.
CSV will probably be faster for smaller data sets. However, rolling your own insert routines in CSV could be painful and you lose the advantages of database indexing.
My general recommendation would be to just use MySql, as I said previously, in most cases it will be faster.

From a pure performance standpoint, it completely depends on the operation you're doing, as #MarkR says. Appending to a flat file is very fast. As is reading in the entire file (for a non-indexed search or other purposes).
The only way to know for sure what will work better for your use cases on your platform is to do actual profiling. I can guarantee you that doing a full table scan on a million row database will be slower than grep on a million line CSV file. But that's probably not a realistic example of your usage. The "breakpoints" will vary wildly depending on your particular mix of retrieve, indexed search, non-indexed search, update, append.
To me, this isn't a performance issue. Your data sounds record-oriented, and MySQL is vastly superior (in general terms) for dealing with that kind of data. If your use cases are even a little bit complicated by the time your data gets large, dealing with a 100k line CSV file is going to be horrific compared to a 100k record db table, even if the performance is marginally better (which is by no means guaranteed).

Depends on the use. For example for configuration or language files CSV might do better.
Anyway, if you're using PHP5, you have 3rd option -- SQLite, which comes embedded in PHP. It gives you ease of use like regular files, but robustness of RDBMS.

Databases are for storing and retrieving data. If you need anything more than plain line/entry addition or bulk listing, why not go for the database way? Otherwise you'd basically have to code the functionality (incl. deletion, sorting etc) yourself.

CSV is an incredibly brittle format and requires your app to do all the formatting and calcuations. If you need to update a spesific record in a csv you will have to first read the entire csv file, find the entry in memory would need to change, then write the whole file out again. This gets very slow very quickly. CSV is only useful for write once, readd once type apps.

If you want to import swiftly like a thief in the night, use SQL format.
If you are working in production server, CSV is slow but it is the safest.
Just make sure the CSV file doesn't have a Primary Key which will override your existing data.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.