I'm adding avatars to a forum engine I'm designing, and I'm debating whether to do something simple (forum image is named .png) and use PHP to check if the file exists before displaying it, or to do something a bit more complicated (but not much) and use a database field to contain the name of the image to show.
I'd much rather go with the file_exists() method personally, as that gives me an easy way to fall back to a "default" avatar if the current one doesn't exist (yet), and its simple to implement code wise. However, I'm worried about performance, since this will be run once per user shown per pageload on the forum read pages. So I'd like to know, does the file_exists() function in PHP cause any major slowdowns that would cause significant performance hits in high traffic conditions?
If not, great. If it does, what is your opinion on alternatives for keeping track of a user-uploaded image? Thanks!
PS: The code differences I can see are that the file checking versions lets the files do the talking, while the database form trusts that the database is accurate and doesn't bother to check. (its just a url that gets passed to the browser of course.)
As well as what the other posters have said, the result of file_exists() is automatically cached by PHP to improve performance.
However, if you're already reading user info from the database, you may as well store the information in there. If the user is only allowed one avatar, you could just store a single bit in a column for "has avatar" (1/0), and then have the filename the same as the user id, and use something like SELECT CONCAT(IF(has_avatar, id, 'default'), '.png') AS avatar FROM users
You could also consider storing the actual image in the database as a BLOB. Put it in its own table rather than attaching it as a column to the user table. This has the benefit that it makes your forum very easy to back up - you just export the database.
Since your web server will already be doing a lot of (the equivalent of) file_exists() operations in the process of showing your web page, one more run by your script probably won't have a measurable impact. The web server will probably do at least:
one for each subdirectory of the web root (to check existence and for symlinks)
one to check for a .htaccess file for each subdirectory of the web root
one for the existence of your script
This is not considering more of them that PHP might do itself.
In actual performance testing, you will discover file_exists to be very fast. As it is, in php, when the same url is "stat"'d twice, the second call is just pulled from php's internal stat cache.
And that's just in the php run scope. Even between runs, the filesystem/os will tend to aggressively put the file into the filesystem cache, and if the file is small enough, not only will the file exists test come straight out of memory, but the entire file will too.
Here's some real data to back my theory:
I was just doing some performance tests of linux command line utilities "find" and "xargs". In the proceeds, I performed a file exists test on 13000 files, 100 times each, in under 30 seconds, so thats averaging 43,000 stat tests per second, so sure, on the fine scale its slow if your comparing it to say, the time it takes to divide 9 by 8 , but in a real world scenario, you would need to be doing this an awful lot of times to see a notable performance problem.
If you have 43 thousand users concurrently accessing your page, during the period of a second, I think you are going to have much bigger concerns than the time it takes to copy the status of the existence of a file more-or-less out of memory on the average case scenario.
At least with PHP4, I've found that a call to a file_exists was definitely killing our application - it was made very repetidly deep in a library, so we really had to use a profiler to find it. Removing the call increased the computation of some pages a dozen times (the call was made verrry repetidly).
It may be possible that in PHP5 they cache file_exists, but at least with PHP4 that was not the case.
Now, if you are not in a loop, obviously, file_exists won't be a big deal.
file_exists() is not slow per se. The real issue is how your system is configured and where the performance bottlenecks are. Remember, databases have to store things on disk too, so either way you're potentially facing disk activity. On the other hand, both databases and file systems usually have some form of transparent caching to optimized repeat access.
You could easily go either way, since chances are your performance bottleneck will be elsewhere. The only place where I can see it being an obvious choice would be if you're on some kind of oversold shared hosting where there's a ton of disk contention, but maybe database access is on a separate cluster and faster (or vice versa).
In the past I've stored the image metadata in a database (including its name) so that we could generate useful stats. More importantly, storing image data (not the file itself, just the metadata) is conducive to change. What if in the future you need to "approve" the image, or you want to delete it without deleting the file?
As per the "default" avatar... well if the record isn't found for that user, just use the default one.
Either way, file_exists() or db, it shouldn't be much of a bottleneck to worry about. One solution, however, is much more expandable.
If performance is your only consideration a file_exists() will be much less expensive then a database lookup.
After all this is just a directory lookup using system calls. After the first execution of the script most of the relevent directory will be cached in storage so there is very little actual I/O involved, and, "file_exists()" is such a common operation that it and the underlying system calls will be highly optimised on any common php/os combination.
As John II noted. If extra functionality and user inteface features are a priority then a database would be the way to go.
Related
I'm storing some highres files on a serves (100K+ if that matters) and I organize them in different galleries. When somebody access the gallery I only show thumbnails and a lowres version of the images, which in some cases is watermarked in other case not. Now due to the fact that I'm speaking of a huge amount of pictures the low resolution version which is displayed on the gallery page is purged from the server after X days. If somebody does access the gallery and the lowres version of the file doesn't exist on the server it is generated on the fly, however when I generate the lowres I might need to watermark it or not.
Currently, the script which display the images doesn't do any SQL call it's all based on filesystem (if file exists, etc) and the decision to watermark an image or not is based on:
if (strpos($file_name,"FREE")===false){ //add watermark }else{ //just resize}
My logic says this is more performant than doing an SQL query against the filename or fileid and checking if it should be a non watermarked image. However I find it a bit of inconvenience to have the file names containing the word FREE.
How much of a performance difference I can expect if I use an SQL query instead of the strpos?
EDIT/UPDATE
To summarize up the answers and the comments:
The system is being designed to be working for a few years, with all the galleries which are added over time to be still accessible. This means the storage requirement is just HUGE, and older albums high res image will be moved off-site on slow and cheap dedicated storage, so suggestion to keep an additional overhead off all the thumbnails it's a severe no go option. Last year I needed to store over 3TB of images (this is only the highres size).
I'm on Lighttpd and I'm intending to use rewrite-if-not-file to get the best performance for the existing thumbnails.
I know about the I/O write penalty and I intend to keep it to the minimum, writing only when necessary, preferably reading. However the comment from #N.B. did actually got me thinking to store the lowres images on an SSD, so even when I need to create and write them to the disk have a much better I/O performance than a normal HDD.
It will be actually difficult to do some test (#Steve E.) I'm behind schedule and the system has to go live by the end of this month. (I just got the bomb today that they are pulling the plug on the old system). Yes the flexibility is the main reason I'm tempted to go with SQL, but I'm expecting the SQL database to grow significantly, beside the file information, there are loads of other informations that I need to store as well, tagging, purchases, downloads, etc, so I'm also trying to make sure that I'm not putting too much pressure on the SQL, when I can actually leverage some of that with a good structure and Filesystem access.
Without doing tests, it is hard to be certain which approach would be faster. Simple logic may suggest that PHP accessing disk is faster, but that is based on a lot of assumptions.
In a well configured system, variables that are required frequently will be in the RAM cache rather than on disk. This applies to caching of the filesystem as well as MySQL caching indexes. The impact of caching and other mechanisms can give results different to what might be expected.
In many scenarios, either solution will work and be adequate since the time taken for either request should be minimal in a well designed system and the extra performance of one approach may not be worth the inconvenience you find with using 'FREE' in the filename. It would not be too hard to trial both methods and measure performance.
In the long term also consider that MySQL provides more flexibility for adding additional functionality that would get complicated if all the state was stored in file names.
If performance really is a significant issue, then look at using the webserver to check for the file on disk (or in a cache like memcache) and return that if it exists before passing the request to PHP at all. Both Nginx and Apache can do this, it's a common acceleration approach for high traffic websites.
You've already done the hard part. SQL query in your case will only slow you down...
here's how you're doing it
user--->php-->filesystem-->php--->user
if mysql comes in this is how it goes
user--->php--->mysql--->filesystem--->mysql-->php--->user
so you're already saving yourself some time while not using mysql...
If somebody does access the gallery and the lowres version of the file doesn't exist on the server it is generated on the fly
If the highres versions are not stored in a database but as server files, this means that the low res thumbnails occupy very little space in proportion to the high res images. For example, lets assume the low res images are 10% of the size of the high res ones. Keeping all the low res images available on your server only adds 10% to your storage needs and if you haven't got 10% spare capacity, then you need to go shopping for more storage, not try programming workarounds.
From the comment, it appears you already store some information about the file in the database. If this is the case then you should be able to add a column to determine if it is free or not and get the extra column at the same time as you query for the other info, adding little to no overhead.
I am creating a web-based app for android and I came to the point of the account system. Previously I stored all data for a person inside a text file, located users/<name>.txt. Now thinking about doing it in a database (like you probably should), wouldn't that take longer to load since it has to look for the row where the name is equal to the input?
So, my question is, is it faster to read data from a text file, easy to open because it knows its location, or would it be faster to get the information from a database, although it would have to first scan line by line untill it reaches the one with the correct name?
I don't care about the safety, I know the first option is not save at all. It doesn't really matter in this case.
Thanks,
Merijn
In any question about performance, the first answer is usually: Try it out and see.
In your case, you are reading a file line-by-line to find a particular name. If you have only a few names, then the file is probably faster. With more lines, you could be reading for a while.
A database can optimize this using an index. Do note that the index will not have much effect until you have a fair amount of data (tens of thousands of bytes). The reason is that the database reads the records in units called data pages. So, it doesn't read one record at a time, it reads a page's worth of records. If you have hundreds of thousands of names, a database will be faster.
Perhaps the main performance advantage of a database is that after the first time you read the data, it will reside in the page cache. Subsequent access will use the cache and just read it from memory -- automatically, I might add, with no effort on your part.
The real advantage to a database is that it then gives you the flexibility to easily add more data, to log interactions, and to store other types of data the might be relevant to your application. On the narrow question of just searching for a particular name, if you have at most a few dozen, the file is probably fast enough. The database is more useful for a large volume of data and because it gives you additional capabilities.
Abit of googling came up with this question: https://dba.stackexchange.com/questions/23124/whats-better-faster-mysql-or-filesystem
I think the answer suits this one as well.
The file system is useful if you are looking for a particular file, as
operating systems maintain a sort of index. However, the contents of a
txt file won't be indexed, which is one of the main advantages of a
database. Another is understanding the relational model, so that data
doesn't need to be repeated over and over. Another is understanding
types. If you have a txt file, you'll need to parse numbers, dates,
etc.
So - the file system might work for you in some cases, but certainly
not all.
That's where database indexes come in.
You may wish to take a look at How does database indexing work? :)
It is quite a simple solution - use database.
Not because its faster or slower, but because it has mechanisms to prevent data loss or corruption.
A failed write to the text file can happen and you will lose a user profile info.
With database engine - its much more difficult to lose data like that.
EDIT:
Also, a big question - is this about server side or app side??
Because, for app side, realistically you wont have more than 100 users per smartphone... More likely you will have 1-5 users, who share the phone and thus need their own profiles, and for the majority - you will have a single user.
Ok I am in the midst of developing a shared system/service of sorts. Where people will be able to upload there own media to the server(s). I am using PHP and mySQL for the majority of the build, and am currently using a single server environment. However I need this to be scaleable as I do intend on moving the media to a cluster of servers in the next 6 months leaving the site/service on its own server. Anyway thats a mute point.
My goal, or hope rather is to come up with an extremely low risk naming convention that runs little possibility ever of running into a collision with another file when renaming the file upon upload. I have read to date many concepts and find that UUID (GUID) is the best candidate for my over all needs as it has a number so high of possibilities that I dont think I could ever reach that many shared images ever.
My problem is coming up with a function that generates a UUID preferable v3 or v5 (I understand they are the same, but v5 currently doesn't comply 100% with the standard of UUID). Knowing little about UUID and the constraints there of that makes them unique and or valid when trying to regex over them later when and or if needed I can't seem to come up with a viable solution. Nor do I know which I should really go with v3 or v5. or v4 for that matter. So I am looking for advice as well as help on a function that will return the desired version UUID type.
Save your breath I haven't tried anything yet as I don't know where to begin currently. With that, I intend on saving these files across many folders to offset the loads caused by large directory listings. So I am also reducing my risk of collision there as well. I am also storing these names in a DB with there associated folders and other information tied to each image, so another problem I see there is when I randomly generate a UUID for a file to be renamed I don't want to query the DB multiple times in the event of a collision so I may actually want to return maybe 5 UUID per function call and see what if any have a match in my query where ill use the first one that doesnt have a match.
Anyway I know this is a lot of reading, I know theres no code with it, hopefully the lot of you don't end up down voting this cause theres to much reading, and assume this is a poor question/discussion. As I would seriously like to know how to tackle this from the begining so I can scale up as needed with as little hassel as possible.
If you are going to store a reference to each file in the database anyway .. why don't you use the MySQL auto_increment id to name your files? If you scale the DB to a cluster, the ID is still unique (being a PK, it must be unique!), so why waste precious CPU time with the UUID generation and stuff? this is not what UUIDs are made for.
I'd go for the easiest way (and i've seen that in many other systems, though):
upload file
when upload succeded, insert DB reference (with the path determined by 3.); fetch auto_incremented $ID
rename file to ${YEAR}/{$MONTH}/${DAY}/{$ID} (adjust if you need a more granular path, when too many files uploaded per day)
when rename failed, delete DB reference and show error message
update DB reference with the actual actual path in the file system
My goal, or hope rather is to come up with an extremely low risk
naming convention that runs little possibility ever of running into a
collision with another file when renaming the file upon upload. I have
read to date many concepts and find that UUID (GUID) is the best
candidate for my over all needs as it has a number so high of
possibilities that I dont think I could ever reach that many shared
images ever.
You could build a number (which you would then implement as UUID) made up of:
Date (YYYYMMDD)
Server (NNN)
Counter of images uploaded on that server that day
This number will never generate any collisions since it always increments, and can scale up to one thousand servers. Say that you get at most one million images per day on each server, that's around 43 bits of information. Add other 32 of randomness so that an UUID can't be guessed (in less than 2^31 attempts on average). You have some fifty-odd bits left to allow for further scaling.
Or you could store some digits in BCD to make them human-readable:
20120917-0172-4123-8456-7890d0b931b9
could be image 1234567890, random d0b931b9, uploaded on server 0172 on September 17th, 2012.
The scheme might even double as "directory spreading" scheme: once an image has an UUID which maps to, say, 20120917-125-00001827-d0b931b9, that means server 125, and you can store it in a directory structure called d0/b9/31/b9/20120917-125-00001827.jpg.
The naming convention ensures uniqueness, and the random bit ensure that the directory structure is "flat" (filling equally, with no directories too fuller than others), optimizing retrieval time.
I have images being stored in folders related to articles on my PHP web site, and would like to set the order to display the images based on author input. I started by naming the files with a number in front of them, but was considering recording the order in a text file instead to avoid renaming every file and retaining their original file names, or possibly storing the order in a MySQL table.
My question is about best practice and speed - every time the article is loaded, it will have to find out the order of images to display them. I was hoping to get some input about which method would be best and fastest.
For example, is it much slower to read a list of file names in a folder with PHP, or open a text file and read the contents, compared to making MySQL query and update statements?
I'd say a lot depends on your base hardware/filesystem/mysql connection performances. A single access to disk, just to read images is most likely going to be your quickest option. But you'd need to name your files manually ahead.
Mysql requires a TCP or *NIX socket connection, and this might slow down things (a lot depends on the number of pictures you have, and the "quality" of your db link, though). If you have a lot of files, performance hit might be negligible. Just reading from a file might be quicker nevertheless, without bothering to set up a DB connection; you'd still have to write down ID/filename correspondence for the ordering though.
Something I'd try out in your situation is to take a look at the php stat command, and see if it can help you out sorting the pictures. Depending on the number of pictures you have (it works better with lower numbers), performance might not get a serious performance hit, and you'd be able NOT to keep a separate list of picture/creation date tuples. As your number of pictures grow, the file list approach seems to me like a reasonable way to solve the problem. Just benchmarking the thing as the number of pictures increases can tell you the truth, though. Since, I think, you can expect to have lot of variability, depending on your specific context.
if your concern is performance why don't you save the list (maybe already formatted in HTML) to a file. When your page is loaded just read the file with
$code = file_get_contents("cache_file.html")
and output to the user. The fastest solution is to store the file as .html and let apache serve it directly, but this works only if your page doesn't have any other dinamic part.
to ensure that your cache file is up to date you can make it invalid and recreate it after some time (the specific time depends from the frequency in image changes) or check if the directory is changed after the cache file creation date. If you can trigger the changes in the image directory (for example if the changes are made from a piece of code that you wrote you can always ensure that you cache is refreshed when the images are changed)
Hope this helps
This smells like premature optimization.
My question is about best practice and speed - every time the article is loaded, it will have to find out the order of images to display them.
So? A query like "select filename, title from images where articleId=$articleId order by 'order'" will execute within a fraction of a second. It really doesn't matter. Do whatever is the easiest to do, and might I suggest that being the SQL option.
imho, using mysql would be slower, but oh so much easier. if the mysql server is hosted on the same server (or within dedicated space on the same server, like cloud linux), then it probably wont save too much time
edit
if you want to do a test, you can use the microtime function to time exactly how long it takes to append and sort the files, and how long it takes to get it all from mysql
Im on an optimization crusade for one of my sites, trying to cut down as many mysql queries as I can.
Im implementing partial caching, which writes .txt files for various modules of the site, and updates them on demand. I've came across one, that cannot remain static for all the users, so the .txt file thats written on the HD, will need to be altered on the fly via php.
Which is done via
flush();
ob_start();
include('file.txt');
$contents = ob_get_clean();
Then I modify the html in the $contents variable, and echo it out for different users.
Alternatively, I can leave it as it is, which runs a mysql query, which queries a small table that has category names (about 13 of them).
Which one is less expensive? Running a query every single time.... or doing it via the method I posted above, to inject html code on the fly, into a static .txt file?
Reading the file (save in very weird setups) will be minutely faster than querying the DB (no network interaction, &c), but the difference will hardly be measurable -- just try and see if you can measure it!
Optimize your queries first! Then use memcache or similar caching system, for data that is accessed frequently and then you can add file caching. We use all three combined and it runs very smooth. Small optimized queries aren't so bad. If your DB is in local server - network is not an issue. And don't forger to use MySQL query cache (i guess you do use MySQL).
Where is your the performance bottleneck?
If you don't know the bottleneck, you can't make any sensible assessment about optimisations.
Collect some metrics, and optimise accordingly.
Try both and choose the one that either is a clear winner or if not available, more maintainable. This depends on where the DB is, how much load it's getting, and whether you'll need to run more than one application instance (then they'd need to share this file on the network and it's not local anymore).
Here are the patterns that work for me when I'm refactoring PHP/MySQL site code.
The number of queries per page is absolutely critical - one complex query with joins is fastest as long as indexes are proper. A single page can almost always be generated with five or fewer queries in my experience, plus good use of classes and arrays of classes. Often one query for the session and one query for the app.
After indexes the biggest thing to work on is the caching configuration parameters.
Never have queries in loops.
Moving database queries to files has never been a useful strategy, especially since it often ends up screwing up your query integrity.
Alex and the others are right about testing. If your pages are noticeably slow, then they are slow for a reason (or reasons) - don't even start changing anything until you know what the reasons are and can measure the consequences of your changes. Refactoring by guessing is always a losing strategy espeically when (as in your case) you're adding complexity.