In order to maintain user uploaded images in website becomes very tough as the number of images are increasing. In the long run the available disk space will come to 0 bytes.
Amazon generally provides unlimited space for their S3 service. If we want to provide unlimited space to our website what are the possible ways?
Shard your servers that host the images. Once one fills up, add another. You'll have to write some custom code for uploading to know where the uploaded files are and when a server is almost full.
The only way I know of is to minimize all your files (eg index.php), and, when your website begins to make money, by another HDD. I think thats all Amazon does, anyway.
Related
I want some little guidance from you all. I have a multimedia based site which is hosted on a traditional Linux based, LAMP hosting. As the site has maximum of Images /Video content,there are around 30000+ posts and database size is around 20-25MB but the file system usage is of 10GB and Bandwidth of around 800-900 GB ( of allowed 1 TB ) is getting utilized every month.
Now,after a little brainstorming and seeing my alternatives here and there, I have come up with two options
Increase / Get a bigger hosting plan.
Get my static content stored on Amazon S3.
While the first plan will be a simple option, I am actually looking forward for the second one, i.e. storing my static content on Amazon S3. The website i have is totally custom-coded and based on PHP+MySQL. I went through this http://undesigned.org.za/2007/10/22/amazon-s3-php-class/ and it gave me a fair idea.
I would love to know pros/cons when I consider hosting static content on s3.
Please give your inputs.
Increase / Get a bigger hosting plan.
I would not do that. The reason is, storage is cheap, while the other components of a "bigger hosting plan" will cost you dearly without providing an immediate benefit (more memory is expensive if you don't need it)
Get my static content stored on Amazon S3.
This is the way to go. S3 is very inexpensive, it is a no-brainer. Having said that, since we are talking video here, I would recommend a third option:
[3.] Store video on AWS S3 and serve through CloudFront. It is still rather inexpensive by comparison, given the spectacular bandwidth and global distribution. CloudFront is Amazon's CDN for blazing fast speeds to any location.
If you want to save on bandwidth, you may also consider using Amazon Elastic Transcoder for high-quality compression (to minimize your bandwidth usage).
Traditional hosting is way too expensive for this.
Bigger Hosting Plan
going for a bigger hosting plan is not a permanent solution because
As the static content images/videos always grow in size. this time your need is 1 TB the next time it will increase more. So, you will be again in the same situation.
With the growth of users and static content your bandwidth will also increase and will cost you more.
Your database size is not so big and We can assume you are not using a lot of CPU power and memory. So you will only be using more disk space and paying for larger CPU and memory which you are not using.
Technically it is not good to server all your requests from a single server. Browser has a limited simultaneous requests per domain.
S3/ Cloud storage for static content
s3 or other cloud storage is good option for static contents. following are the benefits.
You don't need to worry about the storage space it auto scales and available in abundance.
If your site is accessible in different location worldwide you can manage a cdn to improve the speed of content delivered from the nearest location.
The bandwidth is very cheap as compared to the traditional hosting.
It will also decrease burden from your server by uploading your files and serving from s3.
These are some of the benefits for using s3 over traditional hosting. As s3 is specifically built to server the static contents. Decision is yours :)
If you're looking at the long term, at some point you might not be able to afford a server that will hold all of your data. I think S3 is a good option for a case like yours for the following reasons:
You don't need to worry about large file uploads tying down your server. With Cross Origin Resource Sharing, you can upload files directly from the client to your S3 bucket.
Modern browsers will often load parallel requests when a webpage requests content from different domains. If you have your pictures coming from yourbucket.s3.amazonaws.com and the rest of your website loading from yourdomain.com, your users might experience a shorter load time since these requests will be run in parallel.
At some point, you might want to use a Content Distribution Network (CDN) to serve your media. When this happens, you could use Amazon's cloudfront with out of the box support for S3, or you can use another CDN - most popular CDNs these days do support serving content from S3 buckets.
It's a problem you'll never have to worry about. Amazon takes care of redundancy, availability, backups, failovers, etc. That's a big load off your shoulders leaving you with other things to take care of knowing your media is being stored in a way that's scalable and future-proof (at least the foreseeable future).
I use Jelastic to host a PHP application. Editors can upload pictures through the application that are stored in the file system. These pictures are stored within the document root and are served on the frontend as e.g. http://example.com/uploads/123/picture.jpeg
For the NGinx application server, I have enabled vertical scaling but have a single node, i.e. no horizontal scaling.
Picture uploads are not reliable. When I update a picture #1 through my PHP admin interface, then update another one, picture #1 has changed back to the old picture.
My question: Are picture uploads sync'ed across multiple cloudlets on a single node? What will happen if I scale horizontally to multiple nodes?
My question: Are picture uploads sync'ed across multiple cloudlets on a single node?
I think there is a terminology problem here.
Cloudlet: A composite resource unit composed of RAM and CPU usage. 1 Cloudlet = 128MB RAM and approx. 200MHz CPU. A server (Jelastic refers to this as a 'node') typically uses multiple cloudlets; e.g. it may use several GB RAM and/or several GHz CPU at any given moment.
More details at http://kb.layershift.com/introducing-cloudlets
Each node is a self-contained (virtual) server, with its own filesystem. So if you have a single NGINX PHP application server, it doesn't matter if it uses 1 or 100 cloudlets (remember, this is only a measurement of RAM and CPU consumption!), it has 1 filesystem and all of the files that you successfully write there will be available for any subsequent requests.
What will happen if I scale horizontally to multiple nodes?
Right, you have to be careful here. If your application is writing to the local filesystem, you have a problem when dealing with multiple horizontally scaled servers. This is a very typical scaling problem that every application must deal with.
If we're simply talking about static resources (e.g. images), one of the best and simplest ways to handle this issue is to upload all of those to a single server. For example if you have 4 NGINX PHP servers - let's say they load balance your-application.com - you might make one of those servers (or perhaps a completely separate environment) images.your-application.com
So you perform the uploads to images.your-application.com, and reference that directly in your HTML when you wish to display those uploaded images.
Remember, images.your-application.com is only responsible for serving the actual images; so it's really lightweight and should handle a decent volume simply with vertical scaling - which is completely automatic on Jelastic.
When you need to scale images.your-application.com, the easy way is to take a CDN service (CloudFlare, Incapsula etc.). This will leave images.your-application.com only handling the uploads and the small amount of download traffic which is not already cached at the CDN.
Having the same issue, please read this jelastic tutorial.
In summary, jelastic have a script which help you with sincronization, you just have to execute the script and indicate the folders you want to sync in all nodes.
Then, everytime you upload a file to those folders, in cuestion of seconds or minutes the files will be available for all nodes; the time is depending of the file size.
I appreciate your comments to help me decide on the following.
My requirements:
I have a site hosted on a shared server and I'm going to provide content to my users. About 60 GB of content (about 2000 files 30mb each. Users will have access to only 20 files at a time), I calculate about 100 GB monthly bandwidth usage.
Once a user registers for the content, links will be accessible for the user to download. But I want the links to expire in 7 days, with the posibility to increase the expiration time.
I think that the disk space and bandwidth calls for a service like Amazon S3 or Rackspace Cloud files (or is there an alternative? )
To manage the expiration I plan to somehow obtain links that expire (I think S3 has that feature, not Rackspace) OR control the expiration date on my database and have a batch process that will rename on a daily basis all 200 files on the cloud and on my database (in case a user copied the direct link, it won't work the next day, only my webpage will have the updated links). PHP is used for programming.
So what do you think? Cloud file hosting is the way to go? Which one? Does managing the links makes sense that way or it is too difficult to do that through programming (send commands to the cloud server...)
EDIT:
Some host companies have Unlimited space and Bandwidth on their shared plans.. I asked their support staff and they said that they really honor the "unlimited" deal. So 100 GB of transfer a month is ok, the only thing to look out is CPU usage. So going shared hosting is one more alternative to choose from..
FOLLOWUP:
So digging more into this I found that the TOS of the Unlimited plans say that it is not permitted to use the space primarily to host multimedia files. So I decided to go with Amazon s3 and the solution provided by Tom Andersen.
Thanks for the input.
I personally don't think you necessarily need to go to a cloud based solution for this. It may be a little costly. You could simply get a dedicated server instead. One provider that comes to mind gives 3,000 GB/month of bandwidth on some of their lowest level plans. That is on a 10Mbit uplink; you can upgrade to 100Mbps for $10/mo of 1Gbit for $20/mo. I won't mention any names, but you can search for dedicated servers and possibly find one to your liking.
As for expiring the files, just implement that in PHP backed by a database. You won't have to move files around, store all the files in a directory not accessible from the web, and use a PHP script to determine if the link is valid, and if so read the contents of the file and pass them through to the browser. If the link is invalid, you can show an error message instead. It's a pretty simple concept and I think there are a lot of pre-written scripts that do that available, but depending on your needs, it isn't too difficult to do it yourself.
Cloud hosting has advantages, but right now I think its costly and if you aren't trying to spread the load geographically or plan on supporting thousands of simultaneous users and need the elasticity of the cloud, you could possibly use a dedicated server instead.
Hope that helps.
I can't speak for S3 but I use Rackspace Cloud files and servers.
It's good in that you don't pay for incoming bandwidth, so uploads are super cheap.
I would do it like this:
Upload all the files you need to a 'private' container
Create a public container with CDN enabled
That'll give you a special url like http://c3214146.r65.ce3.rackcdn.com
Make your own CNAME DNS record for your domain point to that, like: http://cdn.yourdomain.com
When a user requests a file, use the COPY api operation with a long random filename to do a server side copy from the private container to the public container.
Store the filename in a mysql DB for your app
Once the file expires, use the DELETE api operation, then the PURGE api operation to get it out of the CDN .. finally delete the record from the mysql table.
With the PURGE command .. I heard it doesn't work 100% of the time and it may leave the file around for an extra day .. also in the docs it says to reserve it's use for only emergency things.
Edit: I just heard, there's a 25 purge per day limit.
However personally I've just used delete on objects and found that took it out the CDN straight away. In summary, the worst case would be that the file would still be accessible on some CDN nodes for 24 hours after deletion.
Edit: You can change the TTL (caching time) on the CDN nodes .. default is 72 hours so might pay to set it to something lower .. but not so low that you loose the advantage of CDN.
The advantages I find with the CDN are:
It pushes content right out to end users far away from the USA servers and gives super fast download times for them
If you have a super popular file .. it won't take out your site when 1000 people start trying to download it .. as they'd all get copies pushed out the whatever CDN node they were closest to.
You don't have to rename the files on S3 every day. Just make them private (which is default), and hand out time limited urls for day or a week to anyone who is authorized.
I would consider making the links only good for 20 mins, so that a user has to re-login in order to re-download the files. Then they can't even share the links they get from you.
Currently I am looking to move my websites images to a storage service. I have two websites developed in PHP and ASP.NET.
Using Amazon S3 service we can host all our images and videos to serve web pages. But there are some limitations using S3 service when we want to serve images.
If website needs different thumbnail images with different sizes from original image, it is tough. We have again need to subscribe for EC2 also. Though the data transfer from S3 to EC2 is free, it takes time for data transfer before processing image resize operation.
Uploading number of files in zip format and unzipping in S3 is not possible to reduce number of uploads.
Downloading multiple files from S3 is not possible in case if we want to shift to another provider.
Image names are case sensitive in S3. Which will not load images if image name does not match with request.
Among all these first one is very important thing since image resize is general requirement.
Which provider is best suitable to achieve my goal. Can I move to Google AppEngine only for the purpose of image hosting or is there any other vendor who can provide above services?
I've stumbled upon a nice company called Cloudinary that provides CDN image storage service - they also provide a variety of ways that allow on the fly image manipulation (Cropping will mainly concern you as you we're talking about different sized thumbnails).
I'm not sure how they compete with other companies like maxcdn in site speed enhancement but from what I can see - they have more options when it come to image manipulation.
S3 is really slow and also not distributed. Cloudfront in comparison is also one of the slowest and most expensive CDNs you can get. The only advantage is that if you're using other AWS already you'll get one bill.
I blogged about different CDNs and ran some tests:
http://till.klampaeckel.de/blog/archives/100-Shopping-for-a-CDN.html
As for the setup, I'd suggest something that uses origin-pull. So you host the images yourself and the CDN requests a copy of it the first time it's requested.
This would also mean you can use a script to "dynamically" generate the images because they'll be pulled only once or so. Just have to set appropriate cache headers. The images would then be cached until you purge the CDN's cache.
HTH
I've just come across CloudFlare - from what I understand from their site, you shouldn't need to make any changes to your website. Apparently all you need to do is change your DNS settings. Even provides a free option.
If you're using EC2, then S3 is your best option. The "best practice" is to simply pre-render the image in all sizes and upload with different names. I.e.:
/images/image_a123.large.jpg
/images/image_a123.med.jpg
/images/image_a123.thumb.jpg
This practice is in use by Digg, Twitter (once upon a time, maybe not with twimg...), and a host of other companies.
It may not be ideal, but it's the fastest and most simple way to do it. In terms of switching to another provider, you'll likely not do that because of the amount of work to transfer all of the files anyway. If you've got 1,000,000 images or 3,000,000 images, you've still got many megabytes of files.
Fortunately, S3 has an import/export service. You can send them an empty hard drive and they'll format it and download your data to it for a small fee.
In terms of your concern about case sensitivity, you won't find a provider that doesn't have case sensitivity. If your code is written properly, you'll normalize all names to uppercase or lowercase, or use some sort of base 64 ID system that takes care of case for you.
All in all, S3 is going to give you the best "bang for your buck", and it has CloudFront support if you want to speed it up. Not using S3 because of reasons 3 and 4 is nonsense, as they'll likely apply anywhere you go.
I have a file host website thats burning through 2gbit of bandwidth, so I need to start adding secondary media servers to store the files. What would be the best way to manage a multiple server setup, with a large amount of files? Preferably through php only.
Currently, I only have around 100Gb of files... so I could get a 2nd server, mirror all content between them, and then round robin the traffic 50/50, 33/33/33, etc. But once the total amount of files grows beyond the capacity of a single server, this wont work.
The idea that I had was to have a list of media servers stored in the DB with the amounts of free space left on each server. Once a file is uploaded, php will choose to which server the file is actually uploaded to, and spread out all the files evenly among the servers.
Was hoping to get some more input/inspiration.
Cant use any 3rd party services like Amazon. The files range from several bytes to a gigabyte.
Thanks
You could try MogileFS. It is a distributed file system. Has a good API for PHP. You can create categories and upload a file to that category. For each category you can define on how many servers it should be distributed. You can use the API to get a URL to that file on a random node.
If you are doing as much data transfer as you say, it would seem whatever it is you are doing is growing quite rapidly.
It might be worth your while to contact your hosting provider and see if they offer any sort of shared storage solutions via iscsi, nas, or other means. Ideally the storage would not only start out large enough to store everything you have on it, but it would also be able to dynamically grow beyond your needs. I know my hosting provider offers a solution like this.
If they do not, you might consider colocating your servers somewhere that either does offer a service like that, or would allow you install your own storage server (which could be built cheaply from off the shelf components and software like Freenas or Openfiler).
Once you have a centralized storage platform, you could then add web-servers to your hearts content and load balance them based on load, all while accessing the same central storage repository.
Not only is this the correct way to do it, it would offer you much more redundancy and expandability in the future if you endeavor continues to grow at the pace it is currently growing.
The other solutions offered using a database repository of what is stored where, would work, but it not only adds an extra layer of complexity into the fold, but an extra layer of processing between your visitors and the data they wish to access.
What if you lost a hard disk, do you lose 1/3 or 1/2 of all your data?
Should the heavy IO's of static content be on the same spindles as the rest of your operating system and application data?
Your best bet is really to get your files into some sort of storage that scales. Storing files locally should only be done with good reason (they are sensitive, private, etc.)
Your best bet is to move your content into the cloud. Mosso's CloudFiles or Amazon's S3 will both allow you to store an almost infinite amount of files. All your content is then accessible through an API. If you want, you can then use MySQL to track meta-data for easy searching, and let the service handle the actual storage of the files.
i think your own idea is not the worst one. get a bunch of servers, and for every file store which server(s) it's on. if new files are uploaded, use most-free-space first*. every server handles it's own delivery (instead of piping through the main server).
pros:
use multiple servers for a single file. e.g. for cutekitten.jpg: filepath="server1\cutekitten.jpg;server2\cutekitten.jpg", and then choose the server depending on the server load (or randomly, or alternating, ...)
if you're careful you may be able to move around files automatically depending on the current load. so if your cute-kitten image gets reddited/slashdotted hard, move it to the server with the lowest load and update the entry.
you could do this with a cron-job. just log the downloads for the last xx minutes. try some formular like (downloads-per-minutefilesize(product of serverloads)) for weighting. pick tresholds for increasing/decreasing the number of servers those files are distributed to.
if you add a new server, it's relativley painless (just add the address to the server pool)
cons:
homebrew solutions are always risky
your load distribution algorithm must be well tested, otherwise bad things could happen (everything mirrored everywhere)
constantly moving files around for balancing adds additional server load
* or use a mixed weighting algorithm: free-space, server-load, file-popularity
disclaimer: never been in the situation myself, just guessing.
Consider HDFS, which is part of Apache's Hadoop. This will integrate with PHP, but you'll be setting up a second application. This will also solve all your points of balancing among servers and handling things when your file space usage exceeds one server's ability. It's not purely in PHP, though, but I don't think that's what you meant when you said "pure" anyway.
See http://hadoop.apache.org/core/docs/current/hdfs_design.html for the idea of it. They cover the whole idea of how it handles large files, many files, replication, etc.