PHP performance strpos filename or MySQL query - php

I'm storing some highres files on a serves (100K+ if that matters) and I organize them in different galleries. When somebody access the gallery I only show thumbnails and a lowres version of the images, which in some cases is watermarked in other case not. Now due to the fact that I'm speaking of a huge amount of pictures the low resolution version which is displayed on the gallery page is purged from the server after X days. If somebody does access the gallery and the lowres version of the file doesn't exist on the server it is generated on the fly, however when I generate the lowres I might need to watermark it or not.
Currently, the script which display the images doesn't do any SQL call it's all based on filesystem (if file exists, etc) and the decision to watermark an image or not is based on:
if (strpos($file_name,"FREE")===false){ //add watermark }else{ //just resize}
My logic says this is more performant than doing an SQL query against the filename or fileid and checking if it should be a non watermarked image. However I find it a bit of inconvenience to have the file names containing the word FREE.
How much of a performance difference I can expect if I use an SQL query instead of the strpos?
EDIT/UPDATE
To summarize up the answers and the comments:
The system is being designed to be working for a few years, with all the galleries which are added over time to be still accessible. This means the storage requirement is just HUGE, and older albums high res image will be moved off-site on slow and cheap dedicated storage, so suggestion to keep an additional overhead off all the thumbnails it's a severe no go option. Last year I needed to store over 3TB of images (this is only the highres size).
I'm on Lighttpd and I'm intending to use rewrite-if-not-file to get the best performance for the existing thumbnails.
I know about the I/O write penalty and I intend to keep it to the minimum, writing only when necessary, preferably reading. However the comment from #N.B. did actually got me thinking to store the lowres images on an SSD, so even when I need to create and write them to the disk have a much better I/O performance than a normal HDD.
It will be actually difficult to do some test (#Steve E.) I'm behind schedule and the system has to go live by the end of this month. (I just got the bomb today that they are pulling the plug on the old system). Yes the flexibility is the main reason I'm tempted to go with SQL, but I'm expecting the SQL database to grow significantly, beside the file information, there are loads of other informations that I need to store as well, tagging, purchases, downloads, etc, so I'm also trying to make sure that I'm not putting too much pressure on the SQL, when I can actually leverage some of that with a good structure and Filesystem access.

Without doing tests, it is hard to be certain which approach would be faster. Simple logic may suggest that PHP accessing disk is faster, but that is based on a lot of assumptions.
In a well configured system, variables that are required frequently will be in the RAM cache rather than on disk. This applies to caching of the filesystem as well as MySQL caching indexes. The impact of caching and other mechanisms can give results different to what might be expected.
In many scenarios, either solution will work and be adequate since the time taken for either request should be minimal in a well designed system and the extra performance of one approach may not be worth the inconvenience you find with using 'FREE' in the filename. It would not be too hard to trial both methods and measure performance.
In the long term also consider that MySQL provides more flexibility for adding additional functionality that would get complicated if all the state was stored in file names.
If performance really is a significant issue, then look at using the webserver to check for the file on disk (or in a cache like memcache) and return that if it exists before passing the request to PHP at all. Both Nginx and Apache can do this, it's a common acceleration approach for high traffic websites.

You've already done the hard part. SQL query in your case will only slow you down...
here's how you're doing it
user--->php-->filesystem-->php--->user
if mysql comes in this is how it goes
user--->php--->mysql--->filesystem--->mysql-->php--->user
so you're already saving yourself some time while not using mysql...

If somebody does access the gallery and the lowres version of the file doesn't exist on the server it is generated on the fly
If the highres versions are not stored in a database but as server files, this means that the low res thumbnails occupy very little space in proportion to the high res images. For example, lets assume the low res images are 10% of the size of the high res ones. Keeping all the low res images available on your server only adds 10% to your storage needs and if you haven't got 10% spare capacity, then you need to go shopping for more storage, not try programming workarounds.
From the comment, it appears you already store some information about the file in the database. If this is the case then you should be able to add a column to determine if it is free or not and get the extra column at the same time as you query for the other info, adding little to no overhead.

Related

What should I do for storing images in PHP and MySQL? [duplicate]

Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
So I'm using an app that stores images heavily in the DB. What's your outlook on this? I'm more of a type to store the location in the filesystem, than store it directly in the DB.
What do you think are the pros/cons?
I'm in charge of some applications that manage many TB of images. We've found that storing file paths in the database to be best.
There are a couple of issues:
database storage is usually more expensive than file system storage
you can super-accelerate file system access with standard off the shelf products
for example, many web servers use the operating system's sendfile() system call to asynchronously send a file directly from the file system to the network interface. Images stored in a database don't benefit from this optimization.
things like web servers, etc, need no special coding or processing to access images in the file system
databases win out where transactional integrity between the image and metadata are important.
it is more complex to manage integrity between db metadata and file system data
it is difficult (within the context of a web application) to guarantee data has been flushed to disk on the filesystem
As with most issues, it's not as simple as it sounds. There are cases where it would make sense to store the images in the database.
You are storing images that are
changing dynamically, say invoices and you wanted
to get an invoice as it was on 1 Jan
2007?
The government wants you to maintain 6 years of history
Images stored in the database do not require a different backup strategy. Images stored on filesystem do
It is easier to control access to the images if they are in a database. Idle admins can access any folder on disk. It takes a really determined admin to go snooping in a database to extract the images
On the other hand there are problems associated
Require additional code to extract
and stream the images
Latency may be
slower than direct file access
Heavier load on the database server
File store. Facebook engineers had a great talk about it. One take away was to know the practical limit of files in a directory.
Needle in a Haystack: Efficient Storage of Billions of Photos
This might be a bit of a long shot, but if you're using (or planning on using) SQL Server 2008 I'd recommend having a look at the new FileStream data type.
FileStream solves most of the problems around storing the files in the DB:
The Blobs are actually stored as files in a folder.
The Blobs can be accessed using either a database connection or over the filesystem.
Backups are integrated.
Migration "just works".
However SQL's "Transparent Data Encryption" does not encrypt FileStream objects, so if that is a consideration, you may be better off just storing them as varbinary.
From the MSDN Article:
Transact-SQL statements can insert, update, query, search, and back up FILESTREAM data. Win32 file system interfaces provide streaming access to the data.
FILESTREAM uses the NT system cache for caching file data. This helps reduce any effect that FILESTREAM data might have on Database Engine performance. The SQL Server buffer pool is not used; therefore, this memory is available for query processing.
File paths in the DB is definitely the way to go - I've heard story after story from customers with TB of images that it became a nightmare trying to store any significant amount of images in a DB - the performance hit alone is too much.
In my experience, sometimes the simplest solution is to name the images according to the primary key. So it's easy to find the image that belongs to a particular record, and vice versa. But at the same time you're not storing anything about the image in the database.
The trick here is to not become a zealot.
One thing to note here is that no one in the pro file system camp has listed a particular file system. Does this mean that everything from FAT16 to ZFS handily beats every database?
No.
The truth is that many databases beat many files systems, even when we're only talking about raw speed.
The correct course of action is to make the right decision for your precise scenario, and to do that, you'll need some numbers and some use case estimates.
In places where you MUST guarantee referential integrity and ACID compliance, storing images in the database is required.
You cannot transactionaly guarantee that the image and the meta-data about that image stored in the database refer to the same file. In other words, it is impossible to guarantee that the file on the filesystem is only ever altered at the same time and in the same transaction as the metadata.
As others have said SQL 2008 comes with a Filestream type that allows you to store a filename or identifier as a pointer in the db and automatically stores the image on your filesystem which is a great scenario.
If you're on an older database, then I'd say that if you're storing it as blob data, then you're really not going to get anything out of the database in the way of searching features, so it's probably best to store an address on a filesystem, and store the image that way.
That way you also save space on your filesystem, as you are only going to save the exact amount of space, or even compacted space on the filesystem.
Also, you could decide to save with some structure or elements that allow you to browse the raw images in your filesystem without any db hits, or transfer the files in bulk to another system, hard drive, S3 or another scenario - updating the location in your program, but keep the structure, again without much of a hit trying to bring the images out of your db when trying to increase storage.
Probably, it would also allow you to throw some caching element, based on commonly hit image urls into your web engine/program, so you're saving yourself there as well.
Small static images (not more than a couple of megs) that are not frequently edited, should be stored in the database. This method has several benefits including easier portability (images are transferred with the database), easier backup/restore (images are backed up with the database) and better scalability (a file system folder with thousands of little thumbnail files sounds like a scalability nightmare to me).
Serving up images from a database is easy, just implement an http handler that serves the byte array returned from the DB server as a binary stream.
Here's an interesting white paper on the topic.
To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem
The answer is "It depends." Certainly it would depend upon the database server and its approach to blob storage. It also depends on the type of data being stored in blobs, as well as how that data is to be accessed.
Smaller sized files can be efficiently stored and delivered using the database as the storage mechanism. Larger files would probably be best stored using the file system, especially if they will be modified/updated often. (blob fragmentation becomes an issue in regards to performance.)
Here's an additional point to keep in mind. One of the reasons supporting the use of a database to store the blobs is ACID compliance. However, the approach that the testers used in the white paper, (Bulk Logged option of SQL Server,) which doubled SQL Server throughput, effectively changed the 'D' in ACID to a 'd,' as the blob data was not logged with the initial writes for the transaction. Therefore, if full ACID compliance is an important requirement for your system, halve the SQL Server throughput figures for database writes when comparing file I/O to database blob I/O.
One thing that I haven't seen anyone mention yet but is definitely worth noting is that there are issues associated with storing large amounts of images in most filesystems too. For example if you take the approach mentioned above and name each image file after the primary key, on most filesystems you will run into issues if you try to put all of the images in one big directory once you reach a very large number of images (e.g. in the hundreds of thousands or millions).
Once common solution to this is to hash them out into a balanced tree of subdirectories.
Something nobody has mentioned is that the DB guarantees atomic actions, transactional integrity and deals with concurrency. Even referentially integrity is out of the window with a filesystem - so how do you know your file names are really still correct?
If you have your images in a file-system and someone is reading the file as you're writing a new version or even deleting the file - what happens?
We use blobs because they're easier to manage (backup, replication, transfer) too. They work well for us.
The problem with storing only filepaths to images in a database is that the database's integrity can no longer be forced.
If the actual image pointed to by the filepath becomes unavailable, the database unwittingly has an integrity error.
Given that the images are the actual data being sought after, and that they can be managed easier (the images won't suddenly disappear) in one integrated database rather than having to interface with some kind of filesystem (if the filesystem is independently accessed, the images MIGHT suddenly "disappear"), I'd go for storing them directly as a BLOB or such.
At a company where I used to work we stored 155 million images in an Oracle 8i (then 9i) database. 7.5TB worth.
Normally, I'm storngly against taking the most expensive and hardest to scale part of your infrastructure (the database) and putting all load into it. On the other hand: It greatly simplifies backup strategy, especially when you have multiple web servers and need to somehow keep the data synchronized.
Like most other things, It depends on the expected size and Budget.
We have implemented a document imaging system that stores all it's images in SQL2005 blob fields. There are several hundred GB at the moment and we are seeing excellent response times and little or no performance degradation. In addition, fr regulatory compliance, we have a middleware layer that archives newly posted documents to an optical jukebox system which exposes them as a standard NTFS file system.
We've been very pleased with the results, particularly with respect to:
Ease of Replication and Backup
Ability to easily implement a document versioning system
If this is web-based application then there could be advantages to storing the images on a third-party storage delivery network, such as Amazon's S3 or the Nirvanix platform.
Assumption: Application is web enabled/web based
I'm surprised no one has really mentioned this ... delegate it out to others who are specialists -> use a 3rd party image/file hosting provider.
Store your files on a paid online service like
Amazon S3
Moso Cloud Storage
Another StackOverflow threads talking about this here.
This thread explains why you should use a 3rd party hosting provider.
It's so worth it. They store it efficiently. No bandwith getting uploaded from your servers to client requests, etc.
If you're not on SQL Server 2008 and you have some solid reasons for putting specific image files in the database, then you could take the "both" approach and use the file system as a temporary cache and use the database as the master repository.
For example, your business logic can check if an image file exists on disc before serving it up, retrieving from the database when necessary. This buys you the capability of multiple web servers and fewer sync issues.
I'm not sure how much of a "real world" example this is, but I currently have an application out there that stores details for a trading card game, including the images for the cards. Granted the record count for the database is only 2851 records to date, but given the fact that certain cards have are released multiple times and have alternate artwork, it was actually more efficient sizewise to scan the "primary square" of the artwork and then dynamically generate the border and miscellaneous effects for the card when requested.
The original creator of this image library created a data access class that renders the image based on the request, and it does it quite fast for viewing and individual card.
This also eases deployment/updates when new cards are released, instead of zipping up an entire folder of images and sending those down the pipe and ensuring the proper folder structure is created, I simply update the database and have the user download it again. This currently sizes up to 56MB, which isn't great, but I'm working on an incremental update feature for future releases. In addition, there is a "no images" version of the application that allows those over dial-up to get the application without the download delay.
This solution has worked great to date since the application itself is targeted as a single instance on the desktop. There is a web site where all of this data is archived for online access, but I would in no way use the same solution for this. I agree the file access would be preferable because it would scale better to the frequency and volume of requests being made for the images.
Hopefully this isn't too much babble, but I saw the topic and wanted to provide some my insights from a relatively successful small/medium scale application.
SQL Server 2008 offers a solution that has the best of both worlds : The filestream data type.
Manage it like a regular table and have the performance of the file system.
It depends on the number of images you are going to store and also their sizes. I have used databases to store images in the past and my experience has been fairly good.
IMO, Pros of using database to store images are,
A. You don't need FS structure to hold your images
B. Database indexes perform better than FS trees when more number of items are to be stored
C. Smartly tuned database perform good job at caching the query results
D. Backups are simple. It also works well if you have replication set up and content is delivered from a server near to user. In such cases, explicit synchronization is not required.
If your images are going to be small (say < 64k) and the storage engine of your db supports inline (in record) BLOBs, it improves performance further as no indirection is required (Locality of reference is achieved).
Storing images may be a bad idea when you are dealing with small number of huge sized images. Another problem with storing images in db is that, metadata like creation, modification dates must handled by your application.
I have recently created a PHP/MySQL app which stores PDFs/Word files in a MySQL table (as big as 40MB per file so far).
Pros:
Uploaded files are replicated to backup server along with everything else, no separate backup strategy is needed (peace of mind).
Setting up the web server is slightly simpler because I don't need to have an uploads/ folder and tell all my applications where it is.
I get to use transactions for edits to improve data integrity - I don't have to worry about orphaned and missing files
Cons:
mysqldump now takes a looooong time because there is 500MB of file data in one of the tables.
Overall not very memory/cpu efficient when compared to filesystem
I'd call my implementation a success, it takes care of backup requirements and simplifies the layout of the project. The performance is fine for the 20-30 people who use the app.
Im my experience I had to manage both situations: images stored in database and images on the file system with path stored in db.
The first solution, images in database, is somewhat "cleaner" as your data access layer will have to deal only with database objects; but this is good only when you have to deal with low numbers.
Obviously database access performance when you deal with binary large objects is degrading, and the database dimensions will grow a lot, causing again performance loss... and normally database space is much more expensive than file system space.
On the other hand having large binary objects stored in file system will cause you to have backup plans that have to consider both database and file system, and this can be an issue for some systems.
Another reason to go for file system is when you have to share your images data (or sounds, video, whatever) with third party access: in this days I'm developing a web app that uses images that have to be accessed from "outside" my web farm in such a way that a database access to retrieve binary data is simply impossible. So sometimes there are also design considerations that will drive you to a choice.
Consider also, when making this choice, if you have to deal with permission and authentication when accessing binary objects: these requisites normally can be solved in an easier way when data are stored in db.
I once worked on an image processing application. We stored the uploaded images in a directory that was something like /images/[today's date]/[id number]. But we also extracted the metadata (exif data) from the images and stored that in the database, along with a timestamp and such.
In a previous project i stored images on the filesystem, and that caused a lot of headaches with backups, replication, and the filesystem getting out of sync with the database.
In my latest project i'm storing images in the database, and caching them on the filesystem, and it works really well. I've had no problems so far.
Second the recommendation on file paths. I've worked on a couple of projects that needed to manage large-ish asset collections, and any attempts to store things directly in the DB resulted in pain and frustration long-term.
The only real "pro" I can think of regarding storing them in the DB is the potential for easy of individual image assets. If there are no file paths to use, and all images are streamed straight out of the DB, there's no danger of a user finding files they shouldn't have access to.
That seems like it would be better solved with an intermediary script pulling data from a web-inaccessible file store, though. So the DB storage isn't REALLY necessary.
The word on the street is that unless you are a database vendor trying to prove that your database can do it (like, let's say Microsoft boasting about Terraserver storing a bajillion images in SQL Server) it's not a very good idea. When the alternative - storing images on file servers and paths in the database is so much easier, why bother? Blob fields are kind of like the off-road capabilities of SUVs - most people don't use them, those who do usually get in trouble, and then there are those who do, but only for the fun of it.
Storing an image in the database still means that the image data ends up somewhere in the file system but obscured so that you cannot access it directly.
+ves:
database integrity
its easy to manage since you don't have to worry about keeping the filesystem in sync when an image is added or deleted
-ves:
performance penalty -- a database lookup is usually slower that a filesystem lookup
you cannot edit the image directly (crop, resize)
Both methods are common and practiced. Have a look at the advantages and disadvantages. Either way, you'll have to think about how to overcome the disadvantages. Storing in database usually means tweaking database parameters and implement some kind of caching. Using filesystem requires you to find some way of keeping filesystem+database in sync.

Practical method of renaming files for high volume shared media server PHP / mySQL

Ok I am in the midst of developing a shared system/service of sorts. Where people will be able to upload there own media to the server(s). I am using PHP and mySQL for the majority of the build, and am currently using a single server environment. However I need this to be scaleable as I do intend on moving the media to a cluster of servers in the next 6 months leaving the site/service on its own server. Anyway thats a mute point.
My goal, or hope rather is to come up with an extremely low risk naming convention that runs little possibility ever of running into a collision with another file when renaming the file upon upload. I have read to date many concepts and find that UUID (GUID) is the best candidate for my over all needs as it has a number so high of possibilities that I dont think I could ever reach that many shared images ever.
My problem is coming up with a function that generates a UUID preferable v3 or v5 (I understand they are the same, but v5 currently doesn't comply 100% with the standard of UUID). Knowing little about UUID and the constraints there of that makes them unique and or valid when trying to regex over them later when and or if needed I can't seem to come up with a viable solution. Nor do I know which I should really go with v3 or v5. or v4 for that matter. So I am looking for advice as well as help on a function that will return the desired version UUID type.
Save your breath I haven't tried anything yet as I don't know where to begin currently. With that, I intend on saving these files across many folders to offset the loads caused by large directory listings. So I am also reducing my risk of collision there as well. I am also storing these names in a DB with there associated folders and other information tied to each image, so another problem I see there is when I randomly generate a UUID for a file to be renamed I don't want to query the DB multiple times in the event of a collision so I may actually want to return maybe 5 UUID per function call and see what if any have a match in my query where ill use the first one that doesnt have a match.
Anyway I know this is a lot of reading, I know theres no code with it, hopefully the lot of you don't end up down voting this cause theres to much reading, and assume this is a poor question/discussion. As I would seriously like to know how to tackle this from the begining so I can scale up as needed with as little hassel as possible.
If you are going to store a reference to each file in the database anyway .. why don't you use the MySQL auto_increment id to name your files? If you scale the DB to a cluster, the ID is still unique (being a PK, it must be unique!), so why waste precious CPU time with the UUID generation and stuff? this is not what UUIDs are made for.
I'd go for the easiest way (and i've seen that in many other systems, though):
upload file
when upload succeded, insert DB reference (with the path determined by 3.); fetch auto_incremented $ID
rename file to ${YEAR}/{$MONTH}/${DAY}/{$ID} (adjust if you need a more granular path, when too many files uploaded per day)
when rename failed, delete DB reference and show error message
update DB reference with the actual actual path in the file system
My goal, or hope rather is to come up with an extremely low risk
naming convention that runs little possibility ever of running into a
collision with another file when renaming the file upon upload. I have
read to date many concepts and find that UUID (GUID) is the best
candidate for my over all needs as it has a number so high of
possibilities that I dont think I could ever reach that many shared
images ever.
You could build a number (which you would then implement as UUID) made up of:
Date (YYYYMMDD)
Server (NNN)
Counter of images uploaded on that server that day
This number will never generate any collisions since it always increments, and can scale up to one thousand servers. Say that you get at most one million images per day on each server, that's around 43 bits of information. Add other 32 of randomness so that an UUID can't be guessed (in less than 2^31 attempts on average). You have some fifty-odd bits left to allow for further scaling.
Or you could store some digits in BCD to make them human-readable:
20120917-0172-4123-8456-7890d0b931b9
could be image 1234567890, random d0b931b9, uploaded on server 0172 on September 17th, 2012.
The scheme might even double as "directory spreading" scheme: once an image has an UUID which maps to, say, 20120917-125-00001827-d0b931b9, that means server 125, and you can store it in a directory structure called d0/b9/31/b9/20120917-125-00001827.jpg.
The naming convention ensures uniqueness, and the random bit ensure that the directory structure is "flat" (filling equally, with no directories too fuller than others), optimizing retrieval time.

Is it faster or better to use MySQL instead of text files or file names for order of images with PHP?

I have images being stored in folders related to articles on my PHP web site, and would like to set the order to display the images based on author input. I started by naming the files with a number in front of them, but was considering recording the order in a text file instead to avoid renaming every file and retaining their original file names, or possibly storing the order in a MySQL table.
My question is about best practice and speed - every time the article is loaded, it will have to find out the order of images to display them. I was hoping to get some input about which method would be best and fastest.
For example, is it much slower to read a list of file names in a folder with PHP, or open a text file and read the contents, compared to making MySQL query and update statements?
I'd say a lot depends on your base hardware/filesystem/mysql connection performances. A single access to disk, just to read images is most likely going to be your quickest option. But you'd need to name your files manually ahead.
Mysql requires a TCP or *NIX socket connection, and this might slow down things (a lot depends on the number of pictures you have, and the "quality" of your db link, though). If you have a lot of files, performance hit might be negligible. Just reading from a file might be quicker nevertheless, without bothering to set up a DB connection; you'd still have to write down ID/filename correspondence for the ordering though.
Something I'd try out in your situation is to take a look at the php stat command, and see if it can help you out sorting the pictures. Depending on the number of pictures you have (it works better with lower numbers), performance might not get a serious performance hit, and you'd be able NOT to keep a separate list of picture/creation date tuples. As your number of pictures grow, the file list approach seems to me like a reasonable way to solve the problem. Just benchmarking the thing as the number of pictures increases can tell you the truth, though. Since, I think, you can expect to have lot of variability, depending on your specific context.
if your concern is performance why don't you save the list (maybe already formatted in HTML) to a file. When your page is loaded just read the file with
$code = file_get_contents("cache_file.html")
and output to the user. The fastest solution is to store the file as .html and let apache serve it directly, but this works only if your page doesn't have any other dinamic part.
to ensure that your cache file is up to date you can make it invalid and recreate it after some time (the specific time depends from the frequency in image changes) or check if the directory is changed after the cache file creation date. If you can trigger the changes in the image directory (for example if the changes are made from a piece of code that you wrote you can always ensure that you cache is refreshed when the images are changed)
Hope this helps
This smells like premature optimization.
My question is about best practice and speed - every time the article is loaded, it will have to find out the order of images to display them.
So? A query like "select filename, title from images where articleId=$articleId order by 'order'" will execute within a fraction of a second. It really doesn't matter. Do whatever is the easiest to do, and might I suggest that being the SQL option.
imho, using mysql would be slower, but oh so much easier. if the mysql server is hosted on the same server (or within dedicated space on the same server, like cloud linux), then it probably wont save too much time
edit
if you want to do a test, you can use the microtime function to time exactly how long it takes to append and sort the files, and how long it takes to get it all from mysql

Is file_exist() in PHP a very expensive operation?

I'm adding avatars to a forum engine I'm designing, and I'm debating whether to do something simple (forum image is named .png) and use PHP to check if the file exists before displaying it, or to do something a bit more complicated (but not much) and use a database field to contain the name of the image to show.
I'd much rather go with the file_exists() method personally, as that gives me an easy way to fall back to a "default" avatar if the current one doesn't exist (yet), and its simple to implement code wise. However, I'm worried about performance, since this will be run once per user shown per pageload on the forum read pages. So I'd like to know, does the file_exists() function in PHP cause any major slowdowns that would cause significant performance hits in high traffic conditions?
If not, great. If it does, what is your opinion on alternatives for keeping track of a user-uploaded image? Thanks!
PS: The code differences I can see are that the file checking versions lets the files do the talking, while the database form trusts that the database is accurate and doesn't bother to check. (its just a url that gets passed to the browser of course.)
As well as what the other posters have said, the result of file_exists() is automatically cached by PHP to improve performance.
However, if you're already reading user info from the database, you may as well store the information in there. If the user is only allowed one avatar, you could just store a single bit in a column for "has avatar" (1/0), and then have the filename the same as the user id, and use something like SELECT CONCAT(IF(has_avatar, id, 'default'), '.png') AS avatar FROM users
You could also consider storing the actual image in the database as a BLOB. Put it in its own table rather than attaching it as a column to the user table. This has the benefit that it makes your forum very easy to back up - you just export the database.
Since your web server will already be doing a lot of (the equivalent of) file_exists() operations in the process of showing your web page, one more run by your script probably won't have a measurable impact. The web server will probably do at least:
one for each subdirectory of the web root (to check existence and for symlinks)
one to check for a .htaccess file for each subdirectory of the web root
one for the existence of your script
This is not considering more of them that PHP might do itself.
In actual performance testing, you will discover file_exists to be very fast. As it is, in php, when the same url is "stat"'d twice, the second call is just pulled from php's internal stat cache.
And that's just in the php run scope. Even between runs, the filesystem/os will tend to aggressively put the file into the filesystem cache, and if the file is small enough, not only will the file exists test come straight out of memory, but the entire file will too.
Here's some real data to back my theory:
I was just doing some performance tests of linux command line utilities "find" and "xargs". In the proceeds, I performed a file exists test on 13000 files, 100 times each, in under 30 seconds, so thats averaging 43,000 stat tests per second, so sure, on the fine scale its slow if your comparing it to say, the time it takes to divide 9 by 8 , but in a real world scenario, you would need to be doing this an awful lot of times to see a notable performance problem.
If you have 43 thousand users concurrently accessing your page, during the period of a second, I think you are going to have much bigger concerns than the time it takes to copy the status of the existence of a file more-or-less out of memory on the average case scenario.
At least with PHP4, I've found that a call to a file_exists was definitely killing our application - it was made very repetidly deep in a library, so we really had to use a profiler to find it. Removing the call increased the computation of some pages a dozen times (the call was made verrry repetidly).
It may be possible that in PHP5 they cache file_exists, but at least with PHP4 that was not the case.
Now, if you are not in a loop, obviously, file_exists won't be a big deal.
file_exists() is not slow per se. The real issue is how your system is configured and where the performance bottlenecks are. Remember, databases have to store things on disk too, so either way you're potentially facing disk activity. On the other hand, both databases and file systems usually have some form of transparent caching to optimized repeat access.
You could easily go either way, since chances are your performance bottleneck will be elsewhere. The only place where I can see it being an obvious choice would be if you're on some kind of oversold shared hosting where there's a ton of disk contention, but maybe database access is on a separate cluster and faster (or vice versa).
In the past I've stored the image metadata in a database (including its name) so that we could generate useful stats. More importantly, storing image data (not the file itself, just the metadata) is conducive to change. What if in the future you need to "approve" the image, or you want to delete it without deleting the file?
As per the "default" avatar... well if the record isn't found for that user, just use the default one.
Either way, file_exists() or db, it shouldn't be much of a bottleneck to worry about. One solution, however, is much more expandable.
If performance is your only consideration a file_exists() will be much less expensive then a database lookup.
After all this is just a directory lookup using system calls. After the first execution of the script most of the relevent directory will be cached in storage so there is very little actual I/O involved, and, "file_exists()" is such a common operation that it and the underlying system calls will be highly optimised on any common php/os combination.
As John II noted. If extra functionality and user inteface features are a priority then a database would be the way to go.

PHP to store images in MySQL or not?

I have built a small web application in PHP where users must first log in. Once they have logged in, I intend on showing a small thumbnail as part of their "profile".
I will have to ensure the image is below a particular size to conserve space, or ensure it is a particular resolution, or both, or even perhaps use something like image magick to scale it down.
Not sure what the best approach for that is yet, any ideas welcome.
Also, I have been trying to work out if it is better to store the image in the users table of MySQL as a blob, or maybe a separate images table with a unique id, and just store the appropriate image id in the users table, or simply save the uploaded file on the server (via an upload page as well) and save the file as theUsersUniqueUsername.jpg.
Best option?
I found a tutorial on saving images to mysql here:
http://www.phpriot.com/articles/images-in-mysql
I am only a hobby programmer, and haven't ever done anything like this before, so examples, and/or a lot of detail is greatly appreciated.
Always depends of context, but usually, I store a user image on the filesystem in a folder called /content/user/{user_id}.jpg and try to bother the database as little as possible.
I would recommend storing the image as a file and then have the file URI in the database. If you store all the images in the database, you might have some problems with scaling at a later date.
Check out this answer too:
Microsoft's advice for SQL Server used to be, for speed and size, store images in the file system, with links in the database. I think they've softened their preference a bit, but I still consider it a better idea certainly for size, since it will take up no space in the database.
The overhead using BLOB is a lot less than most people would have you believe, especially if you set it up right. If you use a separate server just running the DB to store binary files then you can in fact use no file-system at all and avoid any overhead from the file-system
That said the easiest/best way unless you have a couple of servers to yourself is storing them in the filesystem
Do not store the absolute URL of the file in your DB, just the unique part (and possibly a folder or two), e.g. 2009/uniqueImageName.jpg or just uniqueImageName.jpg.
Then in your pages just add the host and other folders onto the front, that way you have some flexibility in moving your images - all you'll need to change is a line or two in your PHP/ASP.NET page.
There is no need to store outside the document root for security - a .htaccess file with DENY FROM ALL will work the same and provide more flexibility
No need to 'shunt' images so much for security, just have a getImage.php page or something, and then instead of inserting the actual URL in the src of the image, use something like getImage.php?file=uniqueImageName.jpg.
Then the getImage.php file can check if the user is authorised and grab the image (or not).
Use a name which is guaranteed to be unique (preferably an integer i.e. primary key) when storing, some file-system (i.e. Windows) are case-insensitive, so JoeBloggs.jpg and joebloggs.jpg are unique for the database, but not for the file-system so one will overwrite another.
Use a separate table for the images, and store the primary key of the image in the users table. If you ever want to add more fields or make changes in future it will be easier - it's also good practice.
If you are worried about SEO and things like that, store the image's original file name in another field when you are uploading, you can then use this in your output (such as in the alt tag).
Challenging the Conventional Wisdom!
Of course it is context dependent, but I have a very large application with thousands of images and documents stored as BLOBS in a MySQL database (average size=2MB) and the application runs fine on a server with 256MB of memory. The secret is correct database structure. Always keep two separate tables, one of which stores the basic information about the file, and the other table should just contain the blob plus a primary key for accessing it. All basic queries will be run against the details table, and the other table is only access when the file is actually needed, and it is accessed using an indexed key so performance is extremely good.
The advantages of storing files in the database are multiple:
Much easier backup systems are required, as you do not need to back up the file system
Controlling file security is much easier as you can validate before releasing the binary (yes, you can store the file in a non-public directory and have a script read and regurgitate the file, but performance will not be noticeably faster.
(Similar to #1) It cleanly separates "user content" and "system content", making migrations and cloning easier.
Easier to manage files, track/store version changes, etc, as you need fewer script modifications to add version controls in.
If performance is a big issue and security and backups aren't (or if you have a good fs backup system) then you can store it the the FS, but even then I often store files (in the case of images) in the DB and building a caching script that writes the image to a cache folder after the first time it's used (yes, this uses more HD space, but that is almost never a limiting factor).
Anyway, obviously FS works well in many instances, but I personally find DB management much easier and more flexible, and if written well the performance penalties are extremely small.
We created a shop that stored images in the DB. It worked great during development but once we tested it on the production servers the page load time was far too high, and it added unneccessary load to the DB servers.
While it seems attractive to store binary files in the DB, fetching and manipulating them adds extra complexity that can be avoided by just keeping files on the file system and storing paths / metadata in the DB.
This is one of those eternal debates, with excellent arguments on both sides, but for my money I would keep images away from the DB.
I recently saw this tip's list: http://www.ajaxline.com/32-tips-to-speed-up-your-mysql-queries
Tip 17:
For your web application, images and other binary assets should normally be stored as files. That is, store only a reference to the file rather than the file itself in the database.
So just save the file path to the image :)
I have implemented both solutions (file system and database-persisted images) in previous projects. In my opinion, you should store images in your database. Here's why:
File system storage is more complicated when your app servers are clustered. You have to have shared storage. Even if your current environment is not clustered, this makes it more difficult to scale up when you need to.
You should be using a CDN for your static content anyways, and set your app up as the origin. This means that your app will only be hit once for a given image, then it will be cached on the CDN. CloudFront is dirt cheap and simple to set up...there's no reason not to use it. Save your bandwidth for your dynamic content.
It's much quicker (and thus cheaper) to develop database persisted images
You get referential integrity with database persisted images. If you're storing images on the file system, you will inevitably have orphan files with no matching database records, or you'll have database records with broken file links. This WILL happen...it's just a matter of time. You'll have to write something to clean these up.
Anyways, my two cents.
What's the blob datatype for anyway, if not for storing files?
If your application involves authorisation prior to accessing the files, the changes are that you're a) storing the files outside of DOCUMENT_ROOT (so they can't be accessed directly; and possibly b) sending the entire contents of the files through the application (of course, maybe you're doing some sort of temporarilly-move-to-hashed-but-publicly-accessible-filename thing). So the memory overhead is there anyway, and you might as well be retrieving the data from the database.
If you must store files in a filesystem, do as Andreas suggested above, and name them using something you already know (i.e. the primary key of the relevant table).
I think that most database engines are so advanced already that storing BLOB's of data does not produce any disadvantages (bloated db etc). One advantage is that you don't have any broken links when the image is in the database already. That being said, I have myself always done so that I store the file on disk and give the URI to the database. It depends on the usage. It may be easier to handle img-in-db if the page is very dynamic and changes often - no namespace -problems. I have to say that it ends down to what you prefer.
I would suggest you do not store the image in your db. Instead since every user will be having a unique id associated with his/her profile in the db, use that id to store the image physically on the server.
e.g. if a user has id 23, you can store an image in www.yourname.com/users/profile_images/23.jpg. Then to display, you can check if the image exists, and display it accordingly else display your generic icon.
As the others suggested:
Store the images in the filesystem
Do not bother to store the filename, just use the user id (or anything else that "you already know")
Put static data on a different server (even if you just use "static.yourdomain.com" as an alias to your normal server)
Why ?
The bigger your database gets the slower it will get.
Storing your image in the database will increase your database size.
Storing the filename will increase your database size.
Putting static data on a different server (alias):
Makes scaling up a lot easier
Most browsers will not send more than two requests to the same server, by putting static data on a "second" server you speed up the loading
After researching for days, I made a system storing images and binaries on the database.
It was just great. I have now 100% control over the files, like access control, image sizing (I don't scale the images dynamically, of course), statistics, backup and maintenance.
In my speed tests, the sistem is now 10x slower. However, it's still not in production and I will implement system cache and other optimizations.
Check this real example, still in development, on a SHARED host, using a MVC:
http://www.gt8.com.br/salaodocalcado/calcados/meia-pata/
In this example, if a user is logged, he can see different images. All products images and others binaries are in DB, not cached, not in FS.
I have made some tests in a dedicated server and results were so far beyond the expectations.
So, in my personal opinion, although it needs a major effort to achieve it, storing images in DB is worth and the benefits are worth much more the cons.
As everybody else told you, never store images in a database.
A filesystem is used to store files -> images are files -> store them in filesystem :-)
Just tested my img's as blob, so. This solution working slower than images on server as file. Loading time should be same with images from DB or http but is't. Why? Im sure, when images are files on server, browser can caching it and loading only once, first time. When image going form DB, every time is loaded again. That's my oppinion. Maybe Im wrong about browser caching, but working slower (blob). Sry my English, whatever ;P
These are the pros of both solutions
In BLOBS :
1) pros : the easiness to mange clusters since you do not have to handle tricky points like file syncs between servers
2) DB backups will be exhaustive also
In files
1) Native caching handly (and that's the missing point of previous comments, with refresh and headers that you won't have to redesign in DB (DB are not handling last modification time by default)
2) Easiness of resizing later on
3) Easiness of moderation (just go through your folders to check if everything is correct)
For all these reasons and since the two pros of databases are easier to replicate on file system I strongly recommend files !
In my case, i store files in file system. In my images folder i create new folder for each item named based on item id (row from db). And name images in an order starting from 0. So if i have a table named Items like this:
Items
|-----|------|-----|
| ID | Name | Date|
|-----|------|-----|
| 29 | Test1| 2014|
|-----|------|-----|
| 30 | Test2| 2015|
|-----|------|-----|
| 31 | Test3| 2016|
|-----|------|-----|
my images directory looks like something like:
images/
29/
30/
31/
images/29/0.png
images/29/1.jpeg
images/29/2.gif
etc.

Categories