File / Image Replication

File / Image Replication - php

I have a simple question and wish to hear others' experiences regarding which is the best way to replicate images across multiple hosts.
I have determined that storing images in the database and then using database replication over multiple hosts would result in maximum availability.
The worry I have with the filesystem is the difficulty synchronising the images (e.g I don't want 5 servers all hitting the same server for images!).
Now, the only concerns I have with storing images in the database is the extra queries hitting the database and the extra handling i'd have to put in place in apache if I wanted 'virtual' image links to point to database entries. (e.g AddHandler)
As far as my understanding goes:
If you have a script serving up the
images: Each image would require a
database call.
If you display the images inline as
binary data: Which could be done in
a single database call.
To provide external / linkable
images you would have to add a
addHandler for the extension you
wish to 'fake' and point it to your
scripting language (e.g php, asp).
I might have missed something, but I'm curious if anyone has any better ideas?
Edit:
Tom has suggested using mod_rewrite to save using an AddHandler, I have accepted as a proposed solution to the AddHandler issue; however I don't yet feel like I have a complete solution yet so please, please, keep answering ;)
A few have suggested using lighttpd over Apache. How different are the ISAPI modules for lighttpd?

If you store images in the database, you take an extra database hit plus you lose the innate caching/file serving optimizations in your web server. Apache will serve a static image much faster than PHP can manage it.
In our large app environments, we use up to 4 clusters:
App server cluster
Web service/data service cluster
Static resource (image, documents, multi-media) cluster
Database cluster
You'd be surprised how much traffic a static resource server can handle. Since it's not really computing (no app logic), a response can be optimized like crazy. If you go with a separate static resource cluster, you also leave yourself open to change just that portion of your architecture. For instance, in some benchmarks lighttpd is even faster at serving static resources than apache. If you have a separate cluster, you can change your http server there without changing anything else in your app environment.
I'd start with a 2-machine static resource cluster and see how that performs. That's another benefit of separating functions - you can scale out only where you need it. As far as synchronizing files, take a look at existing file synchronization tools versus rolling your own. You may find something that does what you need without having to write a line of code.

Serving the images from wherever you decide to store them is a trivial problem; I won't discuss how to solve it.
Deciding where to store them is the real decision you need to make. You need to think about what your goals are:
Redundancy of hardware
Lots of cheap storage
Read-scaling
Write-scaling
The last two are not the same and will definitely cause problems.
If you are confident that the size of this image library will not exceed the disc you're happy to put on your web servers (say, 200G at the time of writing, as being the largest high speed server-grade discs that can be obtained; I assume you want to use 1U web servers so you won't be able to store more than that in raid1, depending on your vendor), then you can get very good read-scaling by placing a copy of all the images on every web server.
Of course you might want to keep a master copy somewhere too, and have a daemon or process which syncs them from time to time, and have monitoring to check that they remain in sync and this daemon works, but these are details. Keeping a copy on every web server will make read-scaling pretty much perfect.
But keeping a copy everywhere will ruin write-scalability, as every single web server will have to write every changed / new file. Therefore your total write throughput will be limited to the slowest single web server in the cluster.
"Sharding" your image data between many servers will give good read/write scalability, but is a nontrivial exercise. It may also allow you to use cheap(ish) storage.
Having a single central server (or active/passive pair or something) with expensive IO hardware will give better write-throughput than using "cheap" IO hardware everywhere, but you'll then be limited by read-scalability.

Having your images in a database doesn't necessarily mean a database call for each one; you could cache these separately on each host (e.g. in temporary files) when they are retrieved. The source images would still be in the database and easy to synchronise across servers.
You also don't really need to add Apache handlers to serve an image through a PHP script whilst maintaining nice urls- you can make urls like http://server/image.php/param1/param2/param3.JPG and read the parameters through $_SERVER['PATH_INFO'] . You could also remove the 'image.php' portion of the URL (if you needed to) using mod_rewrite.

What you are looking for already exists and is called MogileFS
Target setup involves mogilefsd, replicated mysql databases and lighttd/perlbal for serving files; It will bring you failover, fine grained file replication (for exemple, you can decide to duplicate end-user images on several physical devices, and to keep only one physical instance of thumbnails). Load balancing can also be achieved quite easily.

Related

Store Image as BLOB or upload file and store URL? [duplicate]

This question already has answers here:
Storing Images in DB - Yea or Nay?
(56 answers)
Closed 9 years ago.
In the context of a web application, my old boss always said put a reference to an image in the database, not the image itself. I tend to agree that storing an url vs. the image itself in the DB is a good idea, but where I work now, we store a lot of images in the database.
The only reason I can think of is perhaps it's more secure? You don't want someone having a direct link to an url? But if that is the case, you can always have the web site/server handle images, like handlers in asp.net so that a user needs to authenticate to view the image. I'm also thinking performance would be hurt by pulling out the images from the database. Any other reasons why it might be a good/not so good idea to store images in a database?
Exact Duplicate: User Images: Database or filesystem storage?
Exact Duplicate: Storing images in database: Yea or nay?
Exact Duplicate: Should I store my images in the database or folders?
Exact Duplicate: Would you store binary data in database or folders?
Exact Duplicate: Store pictures as files or or the database for a web app?
Exact Duplicate: Storing a small number of images: blob or fs?
Exact Duplicate: store image in filesystem or database?

Pros of putting images in a Database.
Transactions. When you save the blob, you can commit it just like any other piece of DB data. That means you can commit the blob along with any of the associate meta-data and be assured that the two are in sync. If you run out of disk space? No commit. File didn't upload completely? No commit. Silly application error? No commit. If keeping the images and their associated meta data consistent with each other is important to your application, then the transactions that a DB can provide can be a boon.
One system to manage. Need to back up the meta data and blobs? Back up the database. Need to replicate them? Replicate the database. Need to recover from a partial system failure? Reload the DB and roll the logs forward. All of the advantages that DBs bring to data in general (volume mapping, storage control, backups, replication, recovery, etc.) apply to your blobs. More consistency, easier management.
Security. Databases have very fine grained security features that can be leveraged. Schemas, user roles, even things like "read only views" to give secure access to a subset of data. All of those features work with tables holding blobs as well.
Centralized management. Related to #2, but basically the DBAs (as if they don't have enough power) get to manage one thing: the database. Modern databases (especially the larger ones) work very well with large installations across several machines. Single source of management simplifies procedures, simplifies knowledge transfer.
Most modern databases handle blobs just fine. With first class support of blobs in your data tier, you can easily stream blobs from the DB to the client. While there are operations that you can do that will "suck in" the entire blob all at once, if you don't need that facility, then don't use it. Study the SQL interface for your DB and leverage its features. No reason to treat them like "big strings" that are treated monolithically and turn your blobs in to big, memory gobbling, cache smashing bombs.
Just like you can set up dedicated file servers for images, you can set up dedicated blob servers in your database. Give them dedicated disk volumes, dedicated schemas, dedicated caches, etc. All of your data in the DB isn't the same, or behaves the same, no reason to configure it all the same. Good databases have the fine level of control.
The primary nit regarding serving up an blob from a DB is ensuring that your HTTP layer actually leverages all of the HTTP protocol to perform the service.
Many naive implementations simply grab the blob, and dump them wholesale down the socket. But HTTP has several important features well suited to streaming images, etc. Notably caching headers, ETags, and chunked transfer to allow clients to request "pieces" of the blob.
Ensure that your HTTP service is properly honoring all of those requests, and your DB can be a very good Web citizen. By caching the files in a filesystem for serving by the HTTP server, you gain some of those advantages "for free" (since a good server will do that anyway for "static" resources), but make sure if you do that, that you honor things like modification dates etc. for images.
For example, someone requests spaceshuttle.jpg, an image created on Jan 1, 2009. That ends up cached on the file system on the request date, say, Feb 1, 2009. Later, the image is purged from the cache (FIFO policy, or whatever), and someone, later, on Mar 1, 2009 requests it again. Well, now it has a Mar 1, 2009 "create date", even though the entire time its create date was really Jan 1. So, you can see, especially if your cache turns around a lot, clients that may be using If-Modified headers may be getting more data than they actually need, since the server THINKS the resource has changed, when in fact it has not.
If you keep the cache creation date in sync with the actual creation date, this can be less of a problem.
But the point is that it's something to think through about the entire problem in order be a "good web citizen", and save you and your clients potentially some bandwidth etc.
I've just gone through all this for a Java project serving videos from a DB, and it all works a treat.

If you on occasion need to retrieve an image and it has to be available on several different web servers. But I think that's pretty much it.
If it doesn't have to be available on several servers, it's always better to put them in the file system.
If it has to be available on several servers and there's actually some kind of load in the system, you'll need some kind of distributed storage.
We're talking an edge case here, where you can avoid adding an additional level of complexity to your system by leveraging the database.
Other than that, don't do it.

I understand that the majority of database professionals will cross their fingers and hiss at you if you store images in the database (or even mention it). Yes, there are definitely performance and storage implications when using the database as the repository for large blocks of binary data of any kind (images just tend to be the most common bits of data that can't be normalized). However, there are most certainly circumstances where database storage of images is not only allowable but advisable.
For instance, in my old job we had an application where users would attach images to several different points of a report that they were writing, and those images had to be printed out when it was done. These reports were moved about via SQL Server replication, and it would have introduced a HUGE headache to try to manage these images and file paths across multiple systems and servers with any sort of reliability. Storing them in the database gave us all of that "for free," and the reporting tool didn't have to go out to the file system to retrieve the image.

My general advice would be not to limit yourself to one approach or the other - go with the technique that fits the situation. File systems are very good at storing files, and databases are very good at providing bite-sized chunks of data on request. On the other hand, one of my company's products has a requirement to store the entire state of the application in the database, which means that file attachments go in there as well. With our DB server (SQL Server 2005) I've yet to run into observable performance problems even with large customers and databases.
Microsoft's SQL 2008 gives you the best of both worlds with the FileStream feature - might be worth checking out. http://technet.microsoft.com/en-us/library/bb933993.aspx

One of the advantages of storing images into database is that it's portable across the systems and independent on filesystem(s) layout.

The simplest / most performant / most scalable solution is to store your images on the file system. If security is a concern, put them in a location that is not accessible by the web server and write a script that handles security and serves up the files.
Assuming your web/app server and DB server are different machines, you will take a few hits by putting images in the DB: (1) Network latency between the two machines, (2) DB connection overhead, (3) consuming an additional DB connection for each image served. I would be more concerned about the last point: if your site serves a lot of images, your web servers are going to be consuming many DB connections and could exhaust your connection pools.

If your application runs on multiple servers, I'd store the reference copy of your images in the database and then cache them on demand on the filesystems. Doing so is just way less of an error prone pain in the ass than trying to sync filesystems laterally.
If your application is on a single server, then yeah, stick to the filesystem and have the database maintain a path to the data.

Most SQL databases are of course not designed with serving up images in mind, but there is a certain amount of convenience associated with having them in the database.
For example, if you already have a database running and have replication configured. You instantly have an HA image store rather than trying to work some rsync or nfs based filesystem replication. Also, having a bunch of web processes (or designing some new service) to write files to disk increases your complexity a bit. Really it's just more moving parts.
At the very least, I would recommend keeping 'meta' data about the image (such as any permissions, who owns it, etc) and the actual data separated into different tables so it will be fairly easy to switch to a different data store down the line. That coupled with some sort of CDN or caching should give you pretty good performance up to a point, so I suppose it depends on how scalable this application needs to be and how you balance that with ease of implementation.

You don't have to store the URL (if you feel this is unsafe). You can just store a unique id that references the image elsewhere.
Database storage tends to be more expensive and costly to maintain than a file system - hence I wouldn't store LOTS of images in a database.

database for data
filesystem for files

disaster recovery is absolutely no fun when you have terabytes of image data stored in the database. You're better off finding a better way to distribute your data to make it more reliable etc... Of course all the overhead (mentioned above) is multiplied when replicating and so on...
Just don't do it!

This really seems like a KISS (keep it simple stupid) problem. File systems are made to easily handle storing picture files, but it is not easy to do in a database and easy to mess up the data. Why take a performance hit and all the difficulty in the sql and rendering when you can just worry about file security? You can also handle mixed systems ewith NFS or CIFS. File systems are mature technologies. Much simpler, more robust.

I stored images in a database for a demonstration application. The reason I did it was security - deleting a record that I shouldn't have wasn't a big problem, but deleting a file I shouldn't have might have been a problem!
If performance became an issue, I would have investigated whether rogue file deletion was a real possibility or not.

If it are images which are pulled out the database on a regular basis, I would always try to use the filesystem.
If it were images which need to pulled out once in a while, and saving them in the database makes life easier, I have no problem at all with this.

images/videos/mp3s on Network file system using php

I did some Google searches and can't seem to find what i want. I'm designing my web site to use MYSQL, PHP Web Servers. multiple web servers with load balancers and a MySql Custer for scaling is planed so far. But then i get to images/videos/mp3s. I need a file system multiple servers can read files from and write files to. So one web server can run the MySQL, Networked File System and Web Server, but as the site scales the site can be switched to multiple servers. Does anyone have any examples, tutorials or resources to help me on this? The site runs on Ubuntu Servers. My original idea was to just store the images in MySQL(I know how to do that and have working examples) so all servers could read/write but other people told me thats a bad idea and i should use a file system(but don't want to use the local one, as i don't think it san scale for large sites).

There are Three systems that come to mind - Mogilefs, Mongodb GridFS and a cloud based storage solution.
MogileFS (OMG Files!) was developed for Livejournal and stores metadata in Mysql. It uses that to find the actual disk with the appropriate file and streams it out.
MongoDB GridFS is a lot newer, and probably easier to get going, certainly for a smaller system. It uses a new 'NoSql' database to store parts of files across its database, assembling as required. Searching around for information will find plenty of information.
Finally, you could simply avoid the whole issue and just upload images into Amazon's S3, or Rackspace Cloudfiles. I've done the latter before (though the site was already running inside Rackspace's system) and it's not very difficult, again with plenty of examples around.
For S3 there is also a command-line tool, s3cmd that can be set to sync (or, better) upload and then delete a directory full of files into an S3 'bucket'.

First storing images/large files is not really possible with MySQL because of the maximum size limitation
To quote this answer Choosing data type for MySQL?
MySQL is incapable of working with any data that is larger than max_allowed_packet (default: 1M) in size, unless you construct complicated and memory intense workarounds at the server side. This further restricts what can be done with TEXT/BLOB-like types, and generally makes the LARGETEXT/LARGEBLOB type useless in a default configuration.
Now for storage and upgrade compatibility why not just store them on an NAS or Raid system that you can continue to tack drives onto. Then in your DB just store a path to the file. Much lest db intensive and allows for decent scalability.

Storing profile pictures. (Database or Filesystem?) [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Storing Images in DB - Yea or Nay?
I need to store profile pictures. Lots of them.
I'm not sure if I should store them in the database. I'm not even sure if that's a good idea to begin with, or if I should just store them in a separate directory on the server, and disallow access to them with HTAccess.
But I'm not overly familiar with HTAccess and when I have used snippets to disallow access to a folder, it has just never worked.
I am using winhost.com to host my sites, so I would assume that HTAccess would work.
Can anyone suggest which way would be better for storing tens of thousands of profile pictures on a single server? I have read many blogs, forum posts etc that I've found on Google, and am a little bit more confused since half of them suggest one thing, and the other half disagree and suggest using a database would be perfectly fine.

Personal experience says that storing lots of image in a database makes the database very slow to back up. That can be irritating when you come to run repeatable tests, or update the DB schema and you want to take an ad-hoc backup, as well as in a general case. Also, depending on database, storing blobs (which inevitably means that you're storing rows of non-fixed length) can make querying the table quite slow - although that can easily be fixed with appropriate indexing.
If you store them in the filesystem and serve them directly with your webserver as you suggest, one problem you will find is how to appropriately access-control them if you want only logged-in users to see them. That will depend on the design of your application and may not be a problem.
Two other options:
you can store them in the filesystem and serve them with an application page, so that it can e.g. check access control before fetching the image and sending it to the client.
you can use X-SendFile: headers if your webserver supports them to serve a file on the filesystem - the application page tells the webserver the file to fetch, and the webserver will fetch the file and send it. Potentially the application and the image files can live on different machines if you use e.g. FastCGI, and the image is never sent over the FastCGI connection.
You may also want to consider cacheing - if you write any programmatic way to send the file, you'll need to add additional logic so that the image can be cached by the browser, or you'll just end up serving the image over and over again and upping your bandwidth costs.

There is a trade off - it will depend on your exact situation and needs. The benefits of each include
Filesystem
Performance, especially caching and I/O
Database
Easier to scale out to multiple web servers
Easier to administer (backup, security etc)
I'm guessing that you are using MySQL, but on the off chance that you have a SQL 2008 DB, have a look at FileStream in this SO article - this gives the best of both worlds.

I'd definitely root for storing only the image path in the database. Storing the image data will slow your site down and put extra strain on your system.
The only case I could imagine an advantage in storing the image data inside the database would be, if you're planning on moving the site around. Then you wouldn't have to worry about filepaths etc..

Serving images through HTTP. Load balanced highly available architecture

I'm planning a system for serving image files from a server cluster with load-balancing. I'm battling with the architechture and whether to save the actual image files as blobs in the database or in filesystem.
My problem is that, the database connection is required anyways as the users need to be authenticated. Different users have access only to contents of their friends and items uploaded by themselves. Since the connection is required anyways, maybe the images could be retrieved from there aswell?
Images should be stored with no single point of failure. And obviously, the system should be fast.
For database approach:
The database is separate from rest of my application, so my applications main database won't get bloated by all the images. Database would be easy to scale as I just need to add more servers to the cluster. Problem is, that I've heard this might be a slow system from a website with millions, even billions of photos.
For filesystem:
I would be really interested in knowing how could one design a system, where the webservers are load balanced, and none of them is too important for the overall system. All the servers should use a common storage, so they can access the same files in the cluster.
What do you think? Which is the best solution in this case?
What kind of overall architechture and servers would you recommend for a image serving cluster? Note: This cluster only serves images. Applications servers are a whole different story.

I definitely wouldn't store them in the database. If you need to use PHP for authentication, then do that as quickly as possible and use X-SendFile to hand over the actual image serving to your web server.
For the filesystem it sounds like MogileFS would be a good fit.
For the web server I'd suggest nginx. If you can adapt your authentication mechanism to use one of the existing modules, or write your own module for it, you could omit PHP completely (there's already a MogileFS client module).

What's the best way to manage multiple media servers, and file allocations between them?

I have a file host website thats burning through 2gbit of bandwidth, so I need to start adding secondary media servers to store the files. What would be the best way to manage a multiple server setup, with a large amount of files? Preferably through php only.
Currently, I only have around 100Gb of files... so I could get a 2nd server, mirror all content between them, and then round robin the traffic 50/50, 33/33/33, etc. But once the total amount of files grows beyond the capacity of a single server, this wont work.
The idea that I had was to have a list of media servers stored in the DB with the amounts of free space left on each server. Once a file is uploaded, php will choose to which server the file is actually uploaded to, and spread out all the files evenly among the servers.
Was hoping to get some more input/inspiration.
Cant use any 3rd party services like Amazon. The files range from several bytes to a gigabyte.
Thanks

You could try MogileFS. It is a distributed file system. Has a good API for PHP. You can create categories and upload a file to that category. For each category you can define on how many servers it should be distributed. You can use the API to get a URL to that file on a random node.

If you are doing as much data transfer as you say, it would seem whatever it is you are doing is growing quite rapidly.
It might be worth your while to contact your hosting provider and see if they offer any sort of shared storage solutions via iscsi, nas, or other means. Ideally the storage would not only start out large enough to store everything you have on it, but it would also be able to dynamically grow beyond your needs. I know my hosting provider offers a solution like this.
If they do not, you might consider colocating your servers somewhere that either does offer a service like that, or would allow you install your own storage server (which could be built cheaply from off the shelf components and software like Freenas or Openfiler).
Once you have a centralized storage platform, you could then add web-servers to your hearts content and load balance them based on load, all while accessing the same central storage repository.
Not only is this the correct way to do it, it would offer you much more redundancy and expandability in the future if you endeavor continues to grow at the pace it is currently growing.
The other solutions offered using a database repository of what is stored where, would work, but it not only adds an extra layer of complexity into the fold, but an extra layer of processing between your visitors and the data they wish to access.
What if you lost a hard disk, do you lose 1/3 or 1/2 of all your data?
Should the heavy IO's of static content be on the same spindles as the rest of your operating system and application data?

Your best bet is really to get your files into some sort of storage that scales. Storing files locally should only be done with good reason (they are sensitive, private, etc.)
Your best bet is to move your content into the cloud. Mosso's CloudFiles or Amazon's S3 will both allow you to store an almost infinite amount of files. All your content is then accessible through an API. If you want, you can then use MySQL to track meta-data for easy searching, and let the service handle the actual storage of the files.

i think your own idea is not the worst one. get a bunch of servers, and for every file store which server(s) it's on. if new files are uploaded, use most-free-space first*. every server handles it's own delivery (instead of piping through the main server).
pros:
use multiple servers for a single file. e.g. for cutekitten.jpg: filepath="server1\cutekitten.jpg;server2\cutekitten.jpg", and then choose the server depending on the server load (or randomly, or alternating, ...)
if you're careful you may be able to move around files automatically depending on the current load. so if your cute-kitten image gets reddited/slashdotted hard, move it to the server with the lowest load and update the entry.
you could do this with a cron-job. just log the downloads for the last xx minutes. try some formular like (downloads-per-minutefilesize(product of serverloads)) for weighting. pick tresholds for increasing/decreasing the number of servers those files are distributed to.
if you add a new server, it's relativley painless (just add the address to the server pool)
cons:
homebrew solutions are always risky
your load distribution algorithm must be well tested, otherwise bad things could happen (everything mirrored everywhere)
constantly moving files around for balancing adds additional server load
* or use a mixed weighting algorithm: free-space, server-load, file-popularity
disclaimer: never been in the situation myself, just guessing.

Consider HDFS, which is part of Apache's Hadoop. This will integrate with PHP, but you'll be setting up a second application. This will also solve all your points of balancing among servers and handling things when your file space usage exceeds one server's ability. It's not purely in PHP, though, but I don't think that's what you meant when you said "pure" anyway.
See http://hadoop.apache.org/core/docs/current/hdfs_design.html for the idea of it. They cover the whole idea of how it handles large files, many files, replication, etc.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.