XML or SQL - Only Reading data frequently - php

I have a PHP function, that returns 'true' or 'false' based on if a particular data exists in Database/XML
This search query will be performed on each page load of the website.
Which option will be better- SQL DB or data stored in XML.
Which will have less load and effect on server performance due to the high number of page loads/visits.
I don't Need to modify or Insert data, It's just searching.
Thank you
------- update --------
Another way, I was thinking is to store all information ( strings ) in form of file name. [eg- a directory having 1000 files with different names ]
Now I can search for a particular file (eg- 457.txt ) in the directory, to see if that file exists.
<?php
if(file_exists('457.txt'){
echo "file exists";
}
How much load will it be on the server, compared to XML query & SQL query?
Main aim to see, if a particular data already exist on the server.

If you have reason to be concerned about user/system load, I would argue against the XML file solution but not for the reason you think of. It's more about concurrency. What would happen if several processes were to update the file at the same time ? Writing to a file raises concerns about locking issues you might encounter soon enough.
The file name idea is interesting but will probably scale badly, and not allow you to perform much more than single item existence search with acceptable performance, unless you dive deep into OS, FS and storage layer intricacies. It all depends on the volume of data you intend to store and how you intend to access it.
All in one, the database is the most versatile storage solution for these kind of needs, and allows you to perform queries using an easy, explicit language, among many other advantages (ACID transactions [ https://en.wikipedia.org/wiki/ACID ], scalability...).

Related

Reading a file or searching in a database?

I am creating a web-based app for android and I came to the point of the account system. Previously I stored all data for a person inside a text file, located users/<name>.txt. Now thinking about doing it in a database (like you probably should), wouldn't that take longer to load since it has to look for the row where the name is equal to the input?
So, my question is, is it faster to read data from a text file, easy to open because it knows its location, or would it be faster to get the information from a database, although it would have to first scan line by line untill it reaches the one with the correct name?
I don't care about the safety, I know the first option is not save at all. It doesn't really matter in this case.
Thanks,
Merijn
In any question about performance, the first answer is usually: Try it out and see.
In your case, you are reading a file line-by-line to find a particular name. If you have only a few names, then the file is probably faster. With more lines, you could be reading for a while.
A database can optimize this using an index. Do note that the index will not have much effect until you have a fair amount of data (tens of thousands of bytes). The reason is that the database reads the records in units called data pages. So, it doesn't read one record at a time, it reads a page's worth of records. If you have hundreds of thousands of names, a database will be faster.
Perhaps the main performance advantage of a database is that after the first time you read the data, it will reside in the page cache. Subsequent access will use the cache and just read it from memory -- automatically, I might add, with no effort on your part.
The real advantage to a database is that it then gives you the flexibility to easily add more data, to log interactions, and to store other types of data the might be relevant to your application. On the narrow question of just searching for a particular name, if you have at most a few dozen, the file is probably fast enough. The database is more useful for a large volume of data and because it gives you additional capabilities.
Abit of googling came up with this question: https://dba.stackexchange.com/questions/23124/whats-better-faster-mysql-or-filesystem
I think the answer suits this one as well.
The file system is useful if you are looking for a particular file, as
operating systems maintain a sort of index. However, the contents of a
txt file won't be indexed, which is one of the main advantages of a
database. Another is understanding the relational model, so that data
doesn't need to be repeated over and over. Another is understanding
types. If you have a txt file, you'll need to parse numbers, dates,
etc.
So - the file system might work for you in some cases, but certainly
not all.
That's where database indexes come in.
You may wish to take a look at How does database indexing work? :)
It is quite a simple solution - use database.
Not because its faster or slower, but because it has mechanisms to prevent data loss or corruption.
A failed write to the text file can happen and you will lose a user profile info.
With database engine - its much more difficult to lose data like that.
EDIT:
Also, a big question - is this about server side or app side??
Because, for app side, realistically you wont have more than 100 users per smartphone... More likely you will have 1-5 users, who share the phone and thus need their own profiles, and for the majority - you will have a single user.

storing large data in files vs tables

So I am working on this website where people can post articles. My colleague suggested storing all the meta-data of an article (user, title, dates etc) in a table and the actual article body as a file in the server.
A data structure would look like:
post_id post_user_id post_title post_body post_date etc
-------------------------------------------------------------------------------
1 1 My First Post 1_1.txt 2014-07-07 ...
2 1 My First Post 2_1.txt 2014-07-07 ...
--------------------------------------------------------------------------------
Now we would get the record of the post and then locate where it is by
$post_id . "_" . $post_user_id . ".txt";
He says that this will reduce the size of the tables and on the long run make it faster to access. I am not sure about this and wanted to ask if there are any issues in this design.
The first risk that pops into my mind is data corruption. Following the design, you are splitting the information into two fragments, even though both pieces are dependant from one another :
A file has to exist for each metadata entry (or you'll end up with not found errors for entries supposed to exist).
A metadata entry has to exist for each file (or you'll end up with garbage).
Using the database only has one big advantage : it is very probably relational. This means that you actually can set up rules to prevent the two scenarios above to occur (you could use an SQL CASCADE DELETE for instance, or put every piece of information in one table). Keeping these relations between two data backends is going to be a tricky thing to setup.
Another important thing to remember : data stored in a SQL database isn't sent to a magical place far from your drive. When you add an entry into your database, you write to your database files. For instance, those files are stored in /var/lib/mysql for MySQL engines. Writing to other files does not make that much of a difference...
Next thing : time. Accessing a database is fast once it's opened, all it takes is query processing. Accessing files (and that is, once per article) may be heavier : files need to be opened (including privileges checks, ...), read (line-by-line according to your buffer size) and closed. Of course, you can add to that all the programming it would take to link those files to their metadata...
To me, this design adds unecessary complexity to the application. You could store everything in the database, centralise. You'll use pretty much the same amount of disk space in both cases, yet looking-up/accessing each article file separately (while keeping it connected with its DB metadata) will certainly waste some time.
Design for simplicity; add complexity only where you must. (Eric S. Raymond)
This could look like a good idea is those posts are NEVER edited. Access to a file could take a while, and if your user wants to edit a lot of times his post, storing the content in a file is not a great idea. SQL support well large text values (as WYSIWYG text), dont be afraid to store them in your Post table.
Additionally, your filesystem will take ways more time to read and writes datas stored in files than in database.
Everything will depend of the number of post you want to store, and if you users can edit or not their posts.
I would agree, in a production environment it is generally recommended to let the file system keep track of files and the database to hold on to the metadata.
However, I have mostly heard this philosophy be applicable to BLOG types and Images. Because even large articles are relatively small, a TEXT data type can suffice and even make it easier to edit, draw from, and search as needed. \
(hence I agree with Rémi Delhaye, who answered this just as I was writing this post)
Filesystem is much more likely to have higher latency and files can 'go missing' where a database record is less likely to.
If the contents of the field is too large in the case of SQL Server then you could look at the FileStream API in newer versions.
Really though, either approach is valid in my opinion. With a file you don't have to worry about the database mangling the content if you make a mistake during escaping or something.
Beware if you're writing your code on a case-insensitive filesystem and running on a case-sensitive one in production- filename case matters so it can be another way to lose access to your files later on or unexpectedly once the application is deployed.

Is there a way of keeping database data in PHP while server is running?

I'm making a website that (essentially) lets the user submit a word, matches it against a MySQL database, and returns the closest match found. My current implementation is that whenever the user submits a word, the PHP script is called, it reads the database information, scans each word one-by-one until a match is found, and returns it.
I feel like this is very inefficient. I'm about to make a program that stores the list of words in a tree structure for much more effective searching. If there are tens of thousands of words in the database, I can see the current implementation slowing down quite a bit.
My question is this: instead of having to write another, separate program, and use PHP to just connect to it with every query, can I instead save an entire data tree in memory with just PHP? That way, any session, any query would just read from memory instead of re-reading the database and rebuilding the tree over and over.
I'd look into running an instance of memcached on your server. http://www.memcached.org.
You should be able to store the compiled tree of data in memory there and retrieve it for use in PHP. You'll have to load it into PHP to perform your search, though, as well as architect a way for the tree in memcached to be updated when the database changes (assuming the word list can be updated, since there's not a good reason to store it in a database otherwise).
Might I suggest looking at the memory table type in mysql: http://dev.mysql.com/doc/refman/5.0/en/memory-storage-engine.html
You can then still use mysql's searching features on fast "in memory" data.
PHP really isn't a good language for large memory structures. It's just not very memory efficient and it has a persistence problem, as you are asking about. Typically with PHP, people would store the data in some external persistent data store that is optimized for quick retrieval.
Usually people use a two fold approach:
1) Store data in the database as optimized as possible for standard queries
2) Cache results of expensive queries in memcached
If you are dealing with a lot of data that cannot be indexed easily by relational databases, then you'd probably need to roll your own daemon (e.g., written in C) that kept a persistent copy of the data structure in memory for fast querying capabilities.

If I have the choice, should webpage contents be saved on the file system or in MySQL?

I am in the planning stages of writing a CMS for my company. I find myself having to make the choice between saving page contents in a database or in folders on a file system. I have learned that PHP performs admirably well reading and writing to file systems, way better in fact than running SQL queries. But when it comes to saving pages and their data on a file system, there'll be a lot more involved than just reading and writing. Since pages will be drawn using a PHP class, the data for each page will be just data, no HTML. Therefore a parser for the files would have to be written. Also I doubt that all the data from a page will be saved in just one file, it would rather be saved in one directory, with content boxes and data in separated files.
All this would be done so much easier with MySQL, so what I want to ask you experts:
Will all the extra dilly dally with file system saving outweigh it's speed and resource advantage over MySQL?
Thanks for your time.
Go for MySQL. I'd say the only time you should think about using the file system is when you are storing files (BLOBS) of several megabytes, databases (at least the ones you typically use with a php website) are generally less performant when storing that kind of data. For the rest I'd say: always use a relational database. (Assuming you are dealing with data dat has relations of course, if it is random data there is not much benefit in using a relational database ;-)
Addition: If you define your own file-structure, and even your own way of cross referencing files you've already started building a 'database' yourself, that is not bad in itself -- it might be loads of fun! -- but you probably will not get the performance benefits you're looking for unless your situation is radically different than the other 80% of 'standard' websites on the web (a couple of pages with text and images on them). (If you are building google/youtube/flickr/facebook ... you've got a different situation and developing your own unique storage solution starts making sense)
things to consider
race-condition in file write if two user editing same piece of content
distribute file across multiple servers if CMS growth, latency on replication will cause data integrity problem
search performance, grep on files on multiple directory will be very slow
too many files in same directory will cause server performance especially in windows
Assuming you have a low-traffic, single-server environment here…
If you expect to ever have to manage those entries outside of the CMS, my opinion is that it's much, much easier to do so with existing tools than with database access tools.
For example, there's huge value in being able to use awk, grep, sed, sort, uniq, etc. on textual data. Proxying that through a database makes this hard but not impossible.
Of course, this is just opinion based on experience.
S
Storing Data on the filesystem may be faster for large blobs that are always accessed as one piece of information. When implementing a CMS, you typically don't only have to deal with such blobs but also with structured information that has internal references (like content fields belonging to a certain page that has links to other pages...). SQL-Databases provide an easy way to access structured information, files on your filesystem do not (except of course simple hierarchical structures that can be represented with folders).
So if you wanted to store the structured data of your cms in files, you'd have to use a file format that allows you to save the internal references of your data, e.g. XML. But that means that you would have to parse those files, which is not only a lot of work but also makes the process of accessing the data slow again.
In short, use MySQL
Use a database and you have lots of important properties from the beginning "for free" without inventing them in some suboptimal ways if you go the filesystem way. If you don't want to be constrained to MySQL only you can make use of e.g. the database abstraction layer of the doctrine project.
Additionally you have tools like phpMyAdmin for easy lookup or manipulation of your data versus the texteditor.
Keep in mind that the result of your database queries can almost always be cached in memory or even in the filesystem so you have the benefit of easier management with well known tools and similar performance.
When it comes to minor modifications of website contents (eg. fixing a typo or updating external links), I find it much easier to connect to the server using SSH and use various tools (text editors, grep etc.) on files, rather than I having to use CMS interface to update each file manually (our CMS has such interface).
Yet there are several questions to analyze and answer, mentioned above - do you plan for scalability, concurrent modification of data etc.
No, it will not be worth it.
And there is no advantage to using the filesystem over a database unless you are the only user on the system (in which the advantage would be lost anyway). As soon as the transactions start rolling in and updates cascades to multiple pages and multiple files you will regret that you didn't used the database from the beginning :)
If you are set on using caching, experiment with some of the existing frameworks first. You will learn a lot from it. Maybe you can steal an idea or two for your CMS?

Is file_exist() in PHP a very expensive operation?

I'm adding avatars to a forum engine I'm designing, and I'm debating whether to do something simple (forum image is named .png) and use PHP to check if the file exists before displaying it, or to do something a bit more complicated (but not much) and use a database field to contain the name of the image to show.
I'd much rather go with the file_exists() method personally, as that gives me an easy way to fall back to a "default" avatar if the current one doesn't exist (yet), and its simple to implement code wise. However, I'm worried about performance, since this will be run once per user shown per pageload on the forum read pages. So I'd like to know, does the file_exists() function in PHP cause any major slowdowns that would cause significant performance hits in high traffic conditions?
If not, great. If it does, what is your opinion on alternatives for keeping track of a user-uploaded image? Thanks!
PS: The code differences I can see are that the file checking versions lets the files do the talking, while the database form trusts that the database is accurate and doesn't bother to check. (its just a url that gets passed to the browser of course.)
As well as what the other posters have said, the result of file_exists() is automatically cached by PHP to improve performance.
However, if you're already reading user info from the database, you may as well store the information in there. If the user is only allowed one avatar, you could just store a single bit in a column for "has avatar" (1/0), and then have the filename the same as the user id, and use something like SELECT CONCAT(IF(has_avatar, id, 'default'), '.png') AS avatar FROM users
You could also consider storing the actual image in the database as a BLOB. Put it in its own table rather than attaching it as a column to the user table. This has the benefit that it makes your forum very easy to back up - you just export the database.
Since your web server will already be doing a lot of (the equivalent of) file_exists() operations in the process of showing your web page, one more run by your script probably won't have a measurable impact. The web server will probably do at least:
one for each subdirectory of the web root (to check existence and for symlinks)
one to check for a .htaccess file for each subdirectory of the web root
one for the existence of your script
This is not considering more of them that PHP might do itself.
In actual performance testing, you will discover file_exists to be very fast. As it is, in php, when the same url is "stat"'d twice, the second call is just pulled from php's internal stat cache.
And that's just in the php run scope. Even between runs, the filesystem/os will tend to aggressively put the file into the filesystem cache, and if the file is small enough, not only will the file exists test come straight out of memory, but the entire file will too.
Here's some real data to back my theory:
I was just doing some performance tests of linux command line utilities "find" and "xargs". In the proceeds, I performed a file exists test on 13000 files, 100 times each, in under 30 seconds, so thats averaging 43,000 stat tests per second, so sure, on the fine scale its slow if your comparing it to say, the time it takes to divide 9 by 8 , but in a real world scenario, you would need to be doing this an awful lot of times to see a notable performance problem.
If you have 43 thousand users concurrently accessing your page, during the period of a second, I think you are going to have much bigger concerns than the time it takes to copy the status of the existence of a file more-or-less out of memory on the average case scenario.
At least with PHP4, I've found that a call to a file_exists was definitely killing our application - it was made very repetidly deep in a library, so we really had to use a profiler to find it. Removing the call increased the computation of some pages a dozen times (the call was made verrry repetidly).
It may be possible that in PHP5 they cache file_exists, but at least with PHP4 that was not the case.
Now, if you are not in a loop, obviously, file_exists won't be a big deal.
file_exists() is not slow per se. The real issue is how your system is configured and where the performance bottlenecks are. Remember, databases have to store things on disk too, so either way you're potentially facing disk activity. On the other hand, both databases and file systems usually have some form of transparent caching to optimized repeat access.
You could easily go either way, since chances are your performance bottleneck will be elsewhere. The only place where I can see it being an obvious choice would be if you're on some kind of oversold shared hosting where there's a ton of disk contention, but maybe database access is on a separate cluster and faster (or vice versa).
In the past I've stored the image metadata in a database (including its name) so that we could generate useful stats. More importantly, storing image data (not the file itself, just the metadata) is conducive to change. What if in the future you need to "approve" the image, or you want to delete it without deleting the file?
As per the "default" avatar... well if the record isn't found for that user, just use the default one.
Either way, file_exists() or db, it shouldn't be much of a bottleneck to worry about. One solution, however, is much more expandable.
If performance is your only consideration a file_exists() will be much less expensive then a database lookup.
After all this is just a directory lookup using system calls. After the first execution of the script most of the relevent directory will be cached in storage so there is very little actual I/O involved, and, "file_exists()" is such a common operation that it and the underlying system calls will be highly optimised on any common php/os combination.
As John II noted. If extra functionality and user inteface features are a priority then a database would be the way to go.

Categories