database management for images - php

Little bit of background information: The previous project manager was fired due to not delivering the project on time. I have little experiencing coding, but am now leading the team to finish the website.
The website itself is similar with Ebay where an item is added for sale. Images and documents will are associated with the item, but hosted in folders that are created when the image is uploaded. The dev team has asked me "how to manage the folders with the documents in relation to the item listing". There will be between 1-10 images/documents uploaded per item and will be between 1000-2000 items listed at one point in time (if not more).
From looking around, I believe the easiest solution is to name the folder by the item number and list the reference in MySql. Each item will have an individual item number and should be no duplicates. Are there better solutions for the folder management?

As mister said images could be renamed with the productid-docid-imageid-timestamp
If the images are not retrieved very often storing the images in db as blob and printing the image with different name may help.

What you want to be careful with is that most filesystems have a limit on how many items can be stored in a folder; on Linux the limit is typically around 30000. With the numbers you give there should be little concern there, but you should still plan for the system to be future proof.
I have found it to be quite useful to store images by their hash. For instance, create a SHA1 hash of the image, e.g.: cce7190663c547d026a6bf8fc8d2f40b3b1b9ea5. Then store the image in a directory structure based on this hash with a few levels of folders:
cce/719/066/3c5/cce7190663c547d026a6bf8fc8d2f40b3b1b9ea5.jpg
This uses the first 12 characters of the hash to form a folder structure 4 levels deep, then the file name is the entire hash. Increase or decrease the folder depth as necessary. This allows you to store quite a lot of images (((16^3)^4) * limit) without hitting the filesystem limits. You then save this path in a database with other information about the image and which items it belongs to. This method also effectively de-duplicates your data storage, you'll never store the same image twice.

It used to be that filesystem performance would deteriorate if there were too many files in a directory, so the common wisdom was to limit to ~1,000 items in any directory.
Try creating a directory structure around the item_id (padded), so #1002003 might be 001002003, which could be found in 001/002/001002003.jpg.
Since you're storing more than one image per item, you might have one more level, e.g. 001/002/003/001002003_1.jpg.
Use the full ID as the item's name in the final directory (001002003.jpg, not 003.jpg). It'll come in handy later.
Hope that helps.

Related

Storing massive amounts of images. Storage schema and consequences? or is there a better way?

I am working on an auction listing project. Part of that project allows users to upload photos of the items(sometimes hundreds of items) they will have for sale on a given date. So I have multiple questions that all lead to how to do this efficiently and effectively allow easy deletion after the date has passed.
1. Folder management: My thought is to give each date its own folder in that folder each user/listing for that day and the listing itself would have its own folder with a unique identifier in case they have multiple listings for the same date. So the folder schema would be: date/user/uniquefoldername/images.jpg Thoughts? Is there a better way keeping in mind that I want to be able to easily delete the folder and it contents after the date has passed?
2. Consequences: Its possible that I could have up to 1500 listings on any given date and if each user adds 100 photos(worst case scenario) that is 150,000 for any given date or 4,500,000 in a month(users are allowed to post listings up to a month in advance). Given the schema above what, if any, would be the consequences? Other than large amount of holy crap I need more storage!(We will deal with storage issues as needed).
3. Storage and retrieval: My thought is to store only the path to the image folder and then write a script to retrieve all of them for display in a lightbox like setup(think Facebook). Thoughts? Is there a way to do this more efficiently?
4. Removal: My thought is to cronjob daily to remove all folders after where at least 24 hours has passed since the date of the listing and then update the database to reflect the removal of the folder to avoid ppl trying to view images that have been removed. Can this be done effectively? Meaning should I do remove them all in one shot or break it up and run it on an hourly basis to avoid overloading the server?
5. Is there a better way all around? At some point we will be adding an image server or exploring other storage avenues but until then I need to do this as efficiently as possible to keep the server load a minimum from back end going ons
6. Given the scale of data processing and the removal process is this possible to do without sacrificing server performance?
Some background: We are looking at roughly 400k page views per day, so 12 million page views a month. We are expecting in excess of 10k members with the capability of posting listings and over well 100k registered users(registration is only required to view certain aspects of the site) of the general public and 12-15 million unique visitors a year. Each listing when presented to the public will interact with 4 tables in the database, as it is now on paper, obviously this may change as the scope becomes clearer. There will be several cron jobs running daily, 2 of which we see as server intensive, newsletter and photo deletion. The newsletter being extensively interactive with the DB as it will contain zip code spencific data which utilizes a table in the db on a per user basis. So with all this in mind you can see why I am concerned about getting the image storage schema correct the first time.

single folder or many folders for storing 8 million images of hundreds of stores?

I am developing a price comparison site with hundreds of stores and thousands of product images for each store.
In total there are 8 million images.
Each image has this file name format: StoreID-productID.jpg
Each image size is less than 10KB. average 7KB.
The site is developed with php and mysql.
I have linux dedicated server.
I have a lot of disk space. so, that is not an issue.
My questions are:
Should I keep all the images of all the stores in a single folder?
So, 8 million images will be in a single folder.
Will it effect efficiency in retrieving them?
At present I am using this method and would like to know if there are any disadvantages of this method.
Should I keep a folder for each store and keep the images of that store in that folder?
like:
images/storeID1/
images/storeID2/
....
Please suggest.
Thank you
Depends on the file system you are using. (EXT3, NTFS, FAT, etc).
Each will have different folder size limits and performance characteristics.
For 8 million files t will be safest to separate as many folders as you can. Then you have the option to make them different physical drives if you run into scaling issues.
If you're on Linux, its easy to mount another drive right among your existing folder structure. You could alternatively use a symbolic link.
images/Store1/FILES...
images/Store2 ---> /mount/SDA01/Store2/ (symbolic link to a separate drive)
Limits
See this SuperUser question for more detail about different File System limits: https://superuser.com/questions/446282/max-files-per-directory-on-ntfs-vol-vs-fat32
Note these are the absolute limits of what the system can handle. Performance at the upper bound of those limits will definitely suffer.
On one of my recent projects I had to store a large number of images as they came in over time. We decided that the best thing to do would be to seperate them into numbered directories with a set number of images in each directory. Our database would store the file name, directory, foreign key of the item it was associated with, a flag to allow us to disable it without needig to go and delete stuff, and the date it was downloaded. It's worked great since.

Image Gallery System - Which Approach is Better?

I am implementing an image upload system in PHP, The following are required:
Have categories
Allow users to comment on images
Allow rating of images
For that, I have 2 approaches in mind:
1. Implement the categorization by folders
Each category will have its own folder, and PHP will detect categories via those folders.
Pros
Structured look, easily locatable images.
Use of native PHP functions to manipulate and collect information about folders and files
Cons
Multiple categorization is a pain
Need to save the full path in the database
2. Implement the categorization by database
Each image in the database will have a catID (or multiple catIDs), and PHP will query the database to get the images
Pros
Easily implemented multi-categories
Only image name is saved
Cons
Seems more messy
Need to query the database a lot.
Which do you think is better? Or is there a third, completely different, approach that I'm missing?
Just a note, I don't need code, I can implement that myself, I'm looking to find what to implement.
Would love to hear from you.
I believe that the second option is better, a DB is giving you much more flexibility, and I think better performance then file system, if you set the right indexes.
In the filesystem approach you are limited to only 1 category per image, when in the DB you can set multiple categories on an image.
The con that Db is more messy, sorry I can't find a reason way in the db it will be more messy, maybe you mean that the files are not organized on the file system, but you still need to organize the files on the file system and divide them to multiple folders for better performance, and if you want to get all the images that have been uploaded you query the db for all of them, which will be much faster then ls on all the categories folders.
In organize the files in the file system when using the DB approach I mean that you need to divide them to several folders, actually it depends on how you predict the upload of the images will be:
If you predict that the upload will be spread on long time then I think that better to put the files in directories per range on time(day, week, month) example if I upload an image now it will go to
"/web_path/uploaded_photos/week4_2012/[some_generated_string].jpg"
If you don't know how to predict the uploads, then I suggest you will divide the files into folders on something generic like the first two letters in MD5 hash on the image name, for example if my file name is "photo_2012.jpg" the hash will be "c02d73bb3219be105159ac8e38ebdac2" so the path in the files system will be "/web_path/uploaded_photos/c/0/[some_generated_string].jpg"
The second con that need to query the DB a lot is not quite true, cause you will need the same amount of queries on the file system which are far more slower.
Good luck.
PS
Don't you forget to generate a new file name to any image that have been uploaded so there will be no collisions in different users uploaded same image name, or the same user.
I'd be inclined to go with the database approach. You list the need to query the database a lot as a con, but that's what databases are built for. As you pointed out yourself, a hierarchical structure has serious limitations when it comes to items that fall into more than one category, and while you can use native PHP functions to navigate the tree, would that really be quicker or more efficient than running SQL queries?
Naturally the actual file data needs to go somewhere, and BLOBS are problematic to put it mildly, so I'd store the actual files in the filesystem, but all the data about the images (the metadata) would be better off in a database. The added flexibility the database gives you is worth the work involved.
The second solution (database) is actually a TAG/LABEL system of categorizing data.
And that is the way to go, biggest examples being Gmail and Stackoverflow.
Only thing you need to be careful about is how to model tags. If the tags are not normalized properly, querying from database becomes expensive.
Use folders only to make file storage reliable, storing certain amount of files per folder, i.e.
/b/e/beach001.jpg
as for your dilemma, it is not a question at all.
From your conditions you can say it yourself that database is the only solution.
Since you need a database to store comments and ratings, you should store categories in database as well. Sometime later you may also want to store image captions and description; database allows you to do that. And I would not worry about querying the database a lot.
Whether to store the image itself in database or filesystem is a separate issue which is discussed here.
Note about storing images in filesystem: do not store thousands of images in a single directory; it could cause performance issues for the OS. Instead invent a way to organize images in sub directories. You can group them by dates, filenames, randomly etc. Some conventions:
upload date: month/year
/uploaded_images
/2010/01
/2010/02
upload date: month-year
/uploaded_images
/2010-01
/2010-02
md5 hash of image name: first character
/uploaded_images
/0/
/1/
.
.
.
/e/
/f/
batches of thousands
/uploaded_images
/00001000/
/00002000/
/00003000/
I eventually went with the best answer of this question: Effeciently storing user uploaded images on the file system.
It works like a charm. Thanks for all of the answers!

PHP Should I store image paths in a database?

I will have a website with a bunch of companies and each company will be able to upload their logo. Is it a good idea to just create a folder for each company who signs up, so it would be
companies/user1/logo.jpg and companies/user2/logo.jpg and just store everyone in a folder, that way I don't need the path to reference the image?
Or should I store them in one folder like company_logos/gaegha724252.jpg and they will all be random file names, and the path would be stored in the database associated with that company?
What are the advantages and disadvantages?
Thanks!
Using Folders for Organization
Advantages: They are logically clear to someone fiddling with the system on the back end- that's about it really.
Disadvantages: 'harder' to clean up when you delete a company, etc. and you have to make sure none of your directory names overlap, generally more work from the get go.
Using Images in One Folder
Advantages It's technically a bit easier to clean up and not all that much work.
Disadvantages You'll have to write at minimum a very basic collision detection algorithm and a very basic 'random name generator'.
Using the Database to Store Images
Caution: Many lives have been lost in this argument!
Advantages: Referential integrity, backing up/restoring is simpler, categorization
Disadvantages: Fraught with pitfalls, potentially slower, more advanced storage/retrieval techniques, potential performance issues and increase of network requests. Also, most cheap hosting providers' databases are way too terrible for this to be a good idea.
I highly recommend just using a hashed file name and storing it (the filename) in the database and then storing the images in a folder (or many folders) on disk. This should be much easier in the long run and perform better in general without getting too complicated.
I would go even further: calculate MD5 sum of each file before storing it to filesystem. You may use first two characters as the directory name of 1st level, next two characters as a directory of 2nd level:
vv 1st level
61f57fe906dffc16597b7e461e5fce6d.jpg
^^ 2nd level
As the a hashing algorithm has equal distribution, this will distribute your files equally among folders (the idea comes from how Squid organizes it's file cache). The server should return URL like this (e.g. no notion about directories):
http://server.com/images/61f57fe906dffc16597b7e461e5fce6d.jpg
and you may apply mod_rewrite to actually rewrite this url to something like this:
/storage/images/61/f5/7fe906dffc16597b7e461e5fce6d.jpg
This will also add some degree anonymity and hide the real image name. More over, if your clients will intend to upload the same contents, it will end up in the same file, which will save your disc space. Beware when removing the file from one client: it may also be used by others!
store them as "company_logos/125.jpg", where 125 is an unique id (primary key in your database).
Depending on how many companies you expect, creating a folder for each company could quickly get ridiculous. Also, reading a folder structure from disk will be much slower than reading from a database.
You could store the image location in the database, or you could use the ID solution. You could also store the image itself in the database if you wanted, using the "blob" type. Although other questions have tackled this issue:
Storing Images in DB - Yea or Nay?
I think it would be best to either store the image name in the database, or use the ID method.
If it is going to be just a few hundred records or so, I wouldn't bother storing the pics outside the db.

PHP to store images in MySQL or not?

I have built a small web application in PHP where users must first log in. Once they have logged in, I intend on showing a small thumbnail as part of their "profile".
I will have to ensure the image is below a particular size to conserve space, or ensure it is a particular resolution, or both, or even perhaps use something like image magick to scale it down.
Not sure what the best approach for that is yet, any ideas welcome.
Also, I have been trying to work out if it is better to store the image in the users table of MySQL as a blob, or maybe a separate images table with a unique id, and just store the appropriate image id in the users table, or simply save the uploaded file on the server (via an upload page as well) and save the file as theUsersUniqueUsername.jpg.
Best option?
I found a tutorial on saving images to mysql here:
http://www.phpriot.com/articles/images-in-mysql
I am only a hobby programmer, and haven't ever done anything like this before, so examples, and/or a lot of detail is greatly appreciated.
Always depends of context, but usually, I store a user image on the filesystem in a folder called /content/user/{user_id}.jpg and try to bother the database as little as possible.
I would recommend storing the image as a file and then have the file URI in the database. If you store all the images in the database, you might have some problems with scaling at a later date.
Check out this answer too:
Microsoft's advice for SQL Server used to be, for speed and size, store images in the file system, with links in the database. I think they've softened their preference a bit, but I still consider it a better idea certainly for size, since it will take up no space in the database.
The overhead using BLOB is a lot less than most people would have you believe, especially if you set it up right. If you use a separate server just running the DB to store binary files then you can in fact use no file-system at all and avoid any overhead from the file-system
That said the easiest/best way unless you have a couple of servers to yourself is storing them in the filesystem
Do not store the absolute URL of the file in your DB, just the unique part (and possibly a folder or two), e.g. 2009/uniqueImageName.jpg or just uniqueImageName.jpg.
Then in your pages just add the host and other folders onto the front, that way you have some flexibility in moving your images - all you'll need to change is a line or two in your PHP/ASP.NET page.
There is no need to store outside the document root for security - a .htaccess file with DENY FROM ALL will work the same and provide more flexibility
No need to 'shunt' images so much for security, just have a getImage.php page or something, and then instead of inserting the actual URL in the src of the image, use something like getImage.php?file=uniqueImageName.jpg.
Then the getImage.php file can check if the user is authorised and grab the image (or not).
Use a name which is guaranteed to be unique (preferably an integer i.e. primary key) when storing, some file-system (i.e. Windows) are case-insensitive, so JoeBloggs.jpg and joebloggs.jpg are unique for the database, but not for the file-system so one will overwrite another.
Use a separate table for the images, and store the primary key of the image in the users table. If you ever want to add more fields or make changes in future it will be easier - it's also good practice.
If you are worried about SEO and things like that, store the image's original file name in another field when you are uploading, you can then use this in your output (such as in the alt tag).
Challenging the Conventional Wisdom!
Of course it is context dependent, but I have a very large application with thousands of images and documents stored as BLOBS in a MySQL database (average size=2MB) and the application runs fine on a server with 256MB of memory. The secret is correct database structure. Always keep two separate tables, one of which stores the basic information about the file, and the other table should just contain the blob plus a primary key for accessing it. All basic queries will be run against the details table, and the other table is only access when the file is actually needed, and it is accessed using an indexed key so performance is extremely good.
The advantages of storing files in the database are multiple:
Much easier backup systems are required, as you do not need to back up the file system
Controlling file security is much easier as you can validate before releasing the binary (yes, you can store the file in a non-public directory and have a script read and regurgitate the file, but performance will not be noticeably faster.
(Similar to #1) It cleanly separates "user content" and "system content", making migrations and cloning easier.
Easier to manage files, track/store version changes, etc, as you need fewer script modifications to add version controls in.
If performance is a big issue and security and backups aren't (or if you have a good fs backup system) then you can store it the the FS, but even then I often store files (in the case of images) in the DB and building a caching script that writes the image to a cache folder after the first time it's used (yes, this uses more HD space, but that is almost never a limiting factor).
Anyway, obviously FS works well in many instances, but I personally find DB management much easier and more flexible, and if written well the performance penalties are extremely small.
We created a shop that stored images in the DB. It worked great during development but once we tested it on the production servers the page load time was far too high, and it added unneccessary load to the DB servers.
While it seems attractive to store binary files in the DB, fetching and manipulating them adds extra complexity that can be avoided by just keeping files on the file system and storing paths / metadata in the DB.
This is one of those eternal debates, with excellent arguments on both sides, but for my money I would keep images away from the DB.
I recently saw this tip's list: http://www.ajaxline.com/32-tips-to-speed-up-your-mysql-queries
Tip 17:
For your web application, images and other binary assets should normally be stored as files. That is, store only a reference to the file rather than the file itself in the database.
So just save the file path to the image :)
I have implemented both solutions (file system and database-persisted images) in previous projects. In my opinion, you should store images in your database. Here's why:
File system storage is more complicated when your app servers are clustered. You have to have shared storage. Even if your current environment is not clustered, this makes it more difficult to scale up when you need to.
You should be using a CDN for your static content anyways, and set your app up as the origin. This means that your app will only be hit once for a given image, then it will be cached on the CDN. CloudFront is dirt cheap and simple to set up...there's no reason not to use it. Save your bandwidth for your dynamic content.
It's much quicker (and thus cheaper) to develop database persisted images
You get referential integrity with database persisted images. If you're storing images on the file system, you will inevitably have orphan files with no matching database records, or you'll have database records with broken file links. This WILL happen...it's just a matter of time. You'll have to write something to clean these up.
Anyways, my two cents.
What's the blob datatype for anyway, if not for storing files?
If your application involves authorisation prior to accessing the files, the changes are that you're a) storing the files outside of DOCUMENT_ROOT (so they can't be accessed directly; and possibly b) sending the entire contents of the files through the application (of course, maybe you're doing some sort of temporarilly-move-to-hashed-but-publicly-accessible-filename thing). So the memory overhead is there anyway, and you might as well be retrieving the data from the database.
If you must store files in a filesystem, do as Andreas suggested above, and name them using something you already know (i.e. the primary key of the relevant table).
I think that most database engines are so advanced already that storing BLOB's of data does not produce any disadvantages (bloated db etc). One advantage is that you don't have any broken links when the image is in the database already. That being said, I have myself always done so that I store the file on disk and give the URI to the database. It depends on the usage. It may be easier to handle img-in-db if the page is very dynamic and changes often - no namespace -problems. I have to say that it ends down to what you prefer.
I would suggest you do not store the image in your db. Instead since every user will be having a unique id associated with his/her profile in the db, use that id to store the image physically on the server.
e.g. if a user has id 23, you can store an image in www.yourname.com/users/profile_images/23.jpg. Then to display, you can check if the image exists, and display it accordingly else display your generic icon.
As the others suggested:
Store the images in the filesystem
Do not bother to store the filename, just use the user id (or anything else that "you already know")
Put static data on a different server (even if you just use "static.yourdomain.com" as an alias to your normal server)
Why ?
The bigger your database gets the slower it will get.
Storing your image in the database will increase your database size.
Storing the filename will increase your database size.
Putting static data on a different server (alias):
Makes scaling up a lot easier
Most browsers will not send more than two requests to the same server, by putting static data on a "second" server you speed up the loading
After researching for days, I made a system storing images and binaries on the database.
It was just great. I have now 100% control over the files, like access control, image sizing (I don't scale the images dynamically, of course), statistics, backup and maintenance.
In my speed tests, the sistem is now 10x slower. However, it's still not in production and I will implement system cache and other optimizations.
Check this real example, still in development, on a SHARED host, using a MVC:
http://www.gt8.com.br/salaodocalcado/calcados/meia-pata/
In this example, if a user is logged, he can see different images. All products images and others binaries are in DB, not cached, not in FS.
I have made some tests in a dedicated server and results were so far beyond the expectations.
So, in my personal opinion, although it needs a major effort to achieve it, storing images in DB is worth and the benefits are worth much more the cons.
As everybody else told you, never store images in a database.
A filesystem is used to store files -> images are files -> store them in filesystem :-)
Just tested my img's as blob, so. This solution working slower than images on server as file. Loading time should be same with images from DB or http but is't. Why? Im sure, when images are files on server, browser can caching it and loading only once, first time. When image going form DB, every time is loaded again. That's my oppinion. Maybe Im wrong about browser caching, but working slower (blob). Sry my English, whatever ;P
These are the pros of both solutions
In BLOBS :
1) pros : the easiness to mange clusters since you do not have to handle tricky points like file syncs between servers
2) DB backups will be exhaustive also
In files
1) Native caching handly (and that's the missing point of previous comments, with refresh and headers that you won't have to redesign in DB (DB are not handling last modification time by default)
2) Easiness of resizing later on
3) Easiness of moderation (just go through your folders to check if everything is correct)
For all these reasons and since the two pros of databases are easier to replicate on file system I strongly recommend files !
In my case, i store files in file system. In my images folder i create new folder for each item named based on item id (row from db). And name images in an order starting from 0. So if i have a table named Items like this:
Items
|-----|------|-----|
| ID | Name | Date|
|-----|------|-----|
| 29 | Test1| 2014|
|-----|------|-----|
| 30 | Test2| 2015|
|-----|------|-----|
| 31 | Test3| 2016|
|-----|------|-----|
my images directory looks like something like:
images/
29/
30/
31/
images/29/0.png
images/29/1.jpeg
images/29/2.gif
etc.

Categories