CakePHP High-Availability Server Farm setup - php

I am currently working on configuring my CakePHP (1.3) based web app to run in a HA Setup. I have 4 web boxes running the app itself a MySQL cluster for database backend. I have users uploading 12,000 - 24,000 images a week (35-70 GB). The app then generates 2 additional files from the original, a thumbnail and a medium size image for preview. This means a total of 36,000 - 72,000 possible files added to the repositories each week.
What I am trying to wrap my head around is how to handle large numbers of static file request coming from users trying to view these images. I mean I can have have multiple web boxes serving only static files with a load-balancer dispatching the requests.
But does anyone on here have any ideas on how to keep all static file servers in sync?
If any of you have any experiences you would like to share, or any useful links for me, it would be very appreciated.
Thanks,
serialk

It's quite a thorny problem.
Technically you can get a high-availability shared directory through something like NFS (or SMB if you like), using DRBD and Linux-HA for an active/passive setup. Such a setup will have good availability against single server loss, however, such a setup is quite wasteful and not easy to scale - you'd have to have the app itself decide which server(s) to go to, configure NFS mounts etc, and it all gets rather complicated.
So I'd probably prompt for avoiding keeping the images in a filesystem at all - or at least, not the conventional kind. I am assuming that you need this to be flexible to add more storage in the future - if you can keep the storage and IO requirement constant, DRBD, HA NFS is probably a good system.
For storing files in a flexible "cloud", either
Tahoe LAFS
Or perhaps, at a push, Cassandra, which would require a bit more integration but maybe better in some ways.
MySQL-cluster is not great for big blobs as it (mostly) keeps the data in ram; also the high consistency it provides requires a lot of locking which makes updates scale (relatively) badly at high workloads.
But you could still consider putting the images in mysql-cluster anyway, particularly as you have already set it up - it would require no more operational overhead.

Related

images/videos/mp3s on Network file system using php

I did some Google searches and can't seem to find what i want. I'm designing my web site to use MYSQL, PHP Web Servers. multiple web servers with load balancers and a MySql Custer for scaling is planed so far. But then i get to images/videos/mp3s. I need a file system multiple servers can read files from and write files to. So one web server can run the MySQL, Networked File System and Web Server, but as the site scales the site can be switched to multiple servers. Does anyone have any examples, tutorials or resources to help me on this? The site runs on Ubuntu Servers. My original idea was to just store the images in MySQL(I know how to do that and have working examples) so all servers could read/write but other people told me thats a bad idea and i should use a file system(but don't want to use the local one, as i don't think it san scale for large sites).
There are Three systems that come to mind - Mogilefs, Mongodb GridFS and a cloud based storage solution.
MogileFS (OMG Files!) was developed for Livejournal and stores metadata in Mysql. It uses that to find the actual disk with the appropriate file and streams it out.
MongoDB GridFS is a lot newer, and probably easier to get going, certainly for a smaller system. It uses a new 'NoSql' database to store parts of files across its database, assembling as required. Searching around for information will find plenty of information.
Finally, you could simply avoid the whole issue and just upload images into Amazon's S3, or Rackspace Cloudfiles. I've done the latter before (though the site was already running inside Rackspace's system) and it's not very difficult, again with plenty of examples around.
For S3 there is also a command-line tool, s3cmd that can be set to sync (or, better) upload and then delete a directory full of files into an S3 'bucket'.
First storing images/large files is not really possible with MySQL because of the maximum size limitation
To quote this answer Choosing data type for MySQL?
MySQL is incapable of working with any data that is larger than max_allowed_packet (default: 1M) in size, unless you construct complicated and memory intense workarounds at the server side. This further restricts what can be done with TEXT/BLOB-like types, and generally makes the LARGETEXT/LARGEBLOB type useless in a default configuration.
Now for storage and upgrade compatibility why not just store them on an NAS or Raid system that you can continue to tack drives onto. Then in your DB just store a path to the file. Much lest db intensive and allows for decent scalability.

Serving images through HTTP. Load balanced highly available architecture

I'm planning a system for serving image files from a server cluster with load-balancing. I'm battling with the architechture and whether to save the actual image files as blobs in the database or in filesystem.
My problem is that, the database connection is required anyways as the users need to be authenticated. Different users have access only to contents of their friends and items uploaded by themselves. Since the connection is required anyways, maybe the images could be retrieved from there aswell?
Images should be stored with no single point of failure. And obviously, the system should be fast.
For database approach:
The database is separate from rest of my application, so my applications main database won't get bloated by all the images. Database would be easy to scale as I just need to add more servers to the cluster. Problem is, that I've heard this might be a slow system from a website with millions, even billions of photos.
For filesystem:
I would be really interested in knowing how could one design a system, where the webservers are load balanced, and none of them is too important for the overall system. All the servers should use a common storage, so they can access the same files in the cluster.
What do you think? Which is the best solution in this case?
What kind of overall architechture and servers would you recommend for a image serving cluster? Note: This cluster only serves images. Applications servers are a whole different story.
I definitely wouldn't store them in the database. If you need to use PHP for authentication, then do that as quickly as possible and use X-SendFile to hand over the actual image serving to your web server.
For the filesystem it sounds like MogileFS would be a good fit.
For the web server I'd suggest nginx. If you can adapt your authentication mechanism to use one of the existing modules, or write your own module for it, you could omit PHP completely (there's already a MogileFS client module).

cloud drive vs. cloud files (or should we not bother?)

The web application is in the process of moving from a standalone server to a pair of servers behind a load-balancer, and contains a 50GB directory of user-created data that is growing rapidly. On rackspace, the only way to add disk space dynamically is by also doubling RAM and monthly cost, which isn't necessary. So, to cloud files it is (unless anyone has another solution in mind?). Using JungleDisk, I can move the files to a cloud files container, and can mount the cloud container on both the servers, and create a symbolic link from the directories where the content was to the mounted drive. This would require no code modification. Alternatively, I could interface directly with cloud files using their PHP API, but this would require massive code changes (all the paths? really?). Is there any inherent problem with taking the easy way out in this case? I set up a model and it seems to work well, but I usually seem to be missing something.
Thanks,
Brandon
I think mounting the drive makes a lot of sense for your scenario, but to be honest I haven't tried it with any load. The good news is that you could always try the easy approach and then refactor if it doesn't perform under load. I'd hope Rackspace accounted and tested for this exact scenario, it seems logical to me.
For some extraneous information, we faced the same question here and did a cost comparison of using Cloud Site vs Cloud Files. We had to factor in both bandwidth and amount of storage into the costs because communication between Sites/Servers and Cloud Files still incurs bandwidth charges. In other words, do you have a lot of files that sit around, or do you have a few files that get accessed often.
We spend a lot of time talking with RackSpace support about performance and scalability differences between Cloud Sites and Cloud Files - I'd recommend giving them a call. We ultimately chose to just use Sites because of our needs, the cost difference was pretty insignificant as it scaled. Also because the Cloud Files API didn't have the granular security that we needed, so we would have to have written a gateway service anyway.

What's the best way to manage multiple media servers, and file allocations between them?

I have a file host website thats burning through 2gbit of bandwidth, so I need to start adding secondary media servers to store the files. What would be the best way to manage a multiple server setup, with a large amount of files? Preferably through php only.
Currently, I only have around 100Gb of files... so I could get a 2nd server, mirror all content between them, and then round robin the traffic 50/50, 33/33/33, etc. But once the total amount of files grows beyond the capacity of a single server, this wont work.
The idea that I had was to have a list of media servers stored in the DB with the amounts of free space left on each server. Once a file is uploaded, php will choose to which server the file is actually uploaded to, and spread out all the files evenly among the servers.
Was hoping to get some more input/inspiration.
Cant use any 3rd party services like Amazon. The files range from several bytes to a gigabyte.
Thanks
You could try MogileFS. It is a distributed file system. Has a good API for PHP. You can create categories and upload a file to that category. For each category you can define on how many servers it should be distributed. You can use the API to get a URL to that file on a random node.
If you are doing as much data transfer as you say, it would seem whatever it is you are doing is growing quite rapidly.
It might be worth your while to contact your hosting provider and see if they offer any sort of shared storage solutions via iscsi, nas, or other means. Ideally the storage would not only start out large enough to store everything you have on it, but it would also be able to dynamically grow beyond your needs. I know my hosting provider offers a solution like this.
If they do not, you might consider colocating your servers somewhere that either does offer a service like that, or would allow you install your own storage server (which could be built cheaply from off the shelf components and software like Freenas or Openfiler).
Once you have a centralized storage platform, you could then add web-servers to your hearts content and load balance them based on load, all while accessing the same central storage repository.
Not only is this the correct way to do it, it would offer you much more redundancy and expandability in the future if you endeavor continues to grow at the pace it is currently growing.
The other solutions offered using a database repository of what is stored where, would work, but it not only adds an extra layer of complexity into the fold, but an extra layer of processing between your visitors and the data they wish to access.
What if you lost a hard disk, do you lose 1/3 or 1/2 of all your data?
Should the heavy IO's of static content be on the same spindles as the rest of your operating system and application data?
Your best bet is really to get your files into some sort of storage that scales. Storing files locally should only be done with good reason (they are sensitive, private, etc.)
Your best bet is to move your content into the cloud. Mosso's CloudFiles or Amazon's S3 will both allow you to store an almost infinite amount of files. All your content is then accessible through an API. If you want, you can then use MySQL to track meta-data for easy searching, and let the service handle the actual storage of the files.
i think your own idea is not the worst one. get a bunch of servers, and for every file store which server(s) it's on. if new files are uploaded, use most-free-space first*. every server handles it's own delivery (instead of piping through the main server).
pros:
use multiple servers for a single file. e.g. for cutekitten.jpg: filepath="server1\cutekitten.jpg;server2\cutekitten.jpg", and then choose the server depending on the server load (or randomly, or alternating, ...)
if you're careful you may be able to move around files automatically depending on the current load. so if your cute-kitten image gets reddited/slashdotted hard, move it to the server with the lowest load and update the entry.
you could do this with a cron-job. just log the downloads for the last xx minutes. try some formular like (downloads-per-minutefilesize(product of serverloads)) for weighting. pick tresholds for increasing/decreasing the number of servers those files are distributed to.
if you add a new server, it's relativley painless (just add the address to the server pool)
cons:
homebrew solutions are always risky
your load distribution algorithm must be well tested, otherwise bad things could happen (everything mirrored everywhere)
constantly moving files around for balancing adds additional server load
* or use a mixed weighting algorithm: free-space, server-load, file-popularity
disclaimer: never been in the situation myself, just guessing.
Consider HDFS, which is part of Apache's Hadoop. This will integrate with PHP, but you'll be setting up a second application. This will also solve all your points of balancing among servers and handling things when your file space usage exceeds one server's ability. It's not purely in PHP, though, but I don't think that's what you meant when you said "pure" anyway.
See http://hadoop.apache.org/core/docs/current/hdfs_design.html for the idea of it. They cover the whole idea of how it handles large files, many files, replication, etc.

File / Image Replication

I have a simple question and wish to hear others' experiences regarding which is the best way to replicate images across multiple hosts.
I have determined that storing images in the database and then using database replication over multiple hosts would result in maximum availability.
The worry I have with the filesystem is the difficulty synchronising the images (e.g I don't want 5 servers all hitting the same server for images!).
Now, the only concerns I have with storing images in the database is the extra queries hitting the database and the extra handling i'd have to put in place in apache if I wanted 'virtual' image links to point to database entries. (e.g AddHandler)
As far as my understanding goes:
If you have a script serving up the
images: Each image would require a
database call.
If you display the images inline as
binary data: Which could be done in
a single database call.
To provide external / linkable
images you would have to add a
addHandler for the extension you
wish to 'fake' and point it to your
scripting language (e.g php, asp).
I might have missed something, but I'm curious if anyone has any better ideas?
Edit:
Tom has suggested using mod_rewrite to save using an AddHandler, I have accepted as a proposed solution to the AddHandler issue; however I don't yet feel like I have a complete solution yet so please, please, keep answering ;)
A few have suggested using lighttpd over Apache. How different are the ISAPI modules for lighttpd?
If you store images in the database, you take an extra database hit plus you lose the innate caching/file serving optimizations in your web server. Apache will serve a static image much faster than PHP can manage it.
In our large app environments, we use up to 4 clusters:
App server cluster
Web service/data service cluster
Static resource (image, documents, multi-media) cluster
Database cluster
You'd be surprised how much traffic a static resource server can handle. Since it's not really computing (no app logic), a response can be optimized like crazy. If you go with a separate static resource cluster, you also leave yourself open to change just that portion of your architecture. For instance, in some benchmarks lighttpd is even faster at serving static resources than apache. If you have a separate cluster, you can change your http server there without changing anything else in your app environment.
I'd start with a 2-machine static resource cluster and see how that performs. That's another benefit of separating functions - you can scale out only where you need it. As far as synchronizing files, take a look at existing file synchronization tools versus rolling your own. You may find something that does what you need without having to write a line of code.
Serving the images from wherever you decide to store them is a trivial problem; I won't discuss how to solve it.
Deciding where to store them is the real decision you need to make. You need to think about what your goals are:
Redundancy of hardware
Lots of cheap storage
Read-scaling
Write-scaling
The last two are not the same and will definitely cause problems.
If you are confident that the size of this image library will not exceed the disc you're happy to put on your web servers (say, 200G at the time of writing, as being the largest high speed server-grade discs that can be obtained; I assume you want to use 1U web servers so you won't be able to store more than that in raid1, depending on your vendor), then you can get very good read-scaling by placing a copy of all the images on every web server.
Of course you might want to keep a master copy somewhere too, and have a daemon or process which syncs them from time to time, and have monitoring to check that they remain in sync and this daemon works, but these are details. Keeping a copy on every web server will make read-scaling pretty much perfect.
But keeping a copy everywhere will ruin write-scalability, as every single web server will have to write every changed / new file. Therefore your total write throughput will be limited to the slowest single web server in the cluster.
"Sharding" your image data between many servers will give good read/write scalability, but is a nontrivial exercise. It may also allow you to use cheap(ish) storage.
Having a single central server (or active/passive pair or something) with expensive IO hardware will give better write-throughput than using "cheap" IO hardware everywhere, but you'll then be limited by read-scalability.
Having your images in a database doesn't necessarily mean a database call for each one; you could cache these separately on each host (e.g. in temporary files) when they are retrieved. The source images would still be in the database and easy to synchronise across servers.
You also don't really need to add Apache handlers to serve an image through a PHP script whilst maintaining nice urls- you can make urls like http://server/image.php/param1/param2/param3.JPG and read the parameters through $_SERVER['PATH_INFO'] . You could also remove the 'image.php' portion of the URL (if you needed to) using mod_rewrite.
What you are looking for already exists and is called MogileFS
Target setup involves mogilefsd, replicated mysql databases and lighttd/perlbal for serving files; It will bring you failover, fine grained file replication (for exemple, you can decide to duplicate end-user images on several physical devices, and to keep only one physical instance of thumbnails). Load balancing can also be achieved quite easily.

Categories