Importing, resizing and uploading millions of images to Amazon S3

Importing, resizing and uploading millions of images to Amazon S3 - php

We are using PHP with CodeIgniter to import millions of images from hundreds of sources, resizing them locally and then uploading the resized version to Amazon S3. The process is however taking much longer than expected, and we're looking for alternatives to speed things up. For more details:
A lookup is made in our MySQL database table for images which have not yet been resized. The result is a set of images.
Each image is imported individually using cURL, and temporarily hosted on our server during processing. They are imported locally because the library doesn't allow resizing/cropping of external images. According to some tests the speed difference when importing from different external sources have been between 80-140 seconds (for the entire process, using 200 images per test), so the external source can definitely slow things down.
The current image is resized using the image_moo library, which creates a copy of the image
The resized image is uploaded to Amazon S3 using a CodeIgniter S3 library
The S3 URL for the new resized image is then saved in the database table, before starting with the next image
The process is taking 0.5-1 second per image, meaning all current images would take a month to resize and upload to S3. The major problem with that is that we are constantly adding new sources for images, and expect to have at least 30-50 million images before the end of 2011, compared to current 4 million at the start of May.
I have noticed one answer in StackOverflow which might be a good complement to our solution, where images are resized and uploaded on the fly, but since we don't want any unnecessary delay when people visit pages, we need to make certain that as many images as possible are already uploaded. Besides this, we want multiple size formats of the images, and currently only upload the most important one because of this speed issue. Ideally, we would have at least three size formats (for example one thumbnail, one normal and one large) for each imported image.
Someone suggested making bulk uploads to S3 a few days ago - any experience in how much this could save would be helpful.
Replies to any part of the question would be helpful if you have some experience of similar process. Part of the code (simplified)
$newpic=$picloc.'-'.$width.'x'.$height.'.jpg';
$pic = $this->image_moo
->load($picloc.'.jpg')
->resize($width,$height,TRUE)
->save($newpic,'jpg');
if ($this->image_moo->errors) {
// Do stuff if something goes wrong, for example if image no longer exists - this doesn't happen very often so is not a great concern
}
else {
if (S3::putObject(
S3::inputFile($newpic),
'someplace',
str_replace('./upload/','', $newpic),
S3::ACL_PUBLIC_READ,
array(),
array(
"Content-Type" => "image/jpeg",
)))
{ // save URL to resized image in database, unlink files etc, then start next image

Why not add some wrapping logic that lets you define ranges or groups of images and then run the script several times on the server. If you can have four of these processes running at the same time on different sets of images then it'll finish four times faster!
If you're stuck trying to get through a really big backlog at the moment you could look at spinning up some Amazon EC2 instances and using them to further parallelize the process.

I suggest you split your script into 2 scripts which run concurrently. One would fetch remote images to a local source, simply doing so for any/all images that have not yet been processed or cached locally yet. Since the remote sources add a fair bit of delay to your requests you will benefit from constantly fetching remote images, not only doing so as you process each one.
Concurrently you use a second script to resize any locally cached images and upload them to Amazon S3. Alternately you can split this part of the process as well using one script for resizing to a local file then another to upload any resized files to S3.
The first part (fetch remote source image) would greatly benefit from running multiple concurrent instances like James C suggests above.

Related

converting image to all sizes on upload vs resizing on request via php

So I have a platform for users which allows them to upload a fair amount of pictures. At the moment, I have my server resizing and saving all images individually to my CDN (so I can pick the best option in order to reduce load time when a user requests to view it), but it seems very wasteful in regards to server storage.
The images are being converted into resolutions of 1200px, 500px, 140px, 40px and 24px.
What I'm wondering is, would it be more efficient to just save the file at 1200px, then serve it via PHP at the requested size using something like ImageMagick? Would there be any major trade-offs and if so, is it worth it?
What I'm doing right now:
https://v1x-3.hbcdn.net/user/filename-500x500.jpg
An example of what I could do:
https://v1x-3.hbcdn.net/image.php?type=user&file=filename&resolution=500
Cheers.

No it's not, because:
you have a small number of sizes
if you will not use caching (image generation on first request only) you can DDOS yourself (image processing its a cpu affected process)
have to do extra work if will use CDN like Cloudflare for HTTP-caching
It makes sense if you have a lot sizes of images, for example, API that supports multiple Andoid/IOS devices, meaning iphone 3 supports 320x320 image only and if you dont have users with such device, your server never creates such image.
Advice:
During image generation, use optimization it reduces image size with imperceptible loss of quality.

Images upload with PHP: how to optimize space, bandwidth and performance

I'm planning to develop an area where the users can upload pictures. I know how to upload a picture on the server using PHP but the problem is what is the best practice to develop a performing system.
The idea is to display in different pages thumbs and I would like to know if it's a better idea to save two different images (thumb + original) on the server or if it's better to save just the original and create all the thumbs on the fly. Thumb + original means more space on the server, whereas the option "thumbs on the fly" means most likely a server overload.
I found couple of good scripts to resize and cropping on the fly but not sure if it's a good idea to use especially if the web site has few thousands visitor per day (or may be more in the future just to be optimistic/pessimistic).

Absolutely generate and save the thumbnails on disk. Storage is cheap.

You can generate some thumbnails and save them on disk but in the long term that's problematic due to different devices needing different sizes, different formats, etc.
If you are already saving the uploaded images on S3, Azure Storage, or Google Cloud I recommend to use some on the fly image processing service like imglab or cloudinary.
With these services you can generate many different types of cropping, and serving them in different (modern) formats like WebP or AVIF so you don't need to generate them before hand. SEO will be improved wit this option too.
Additionaly images will be behind a global CDN so users will get the images in a fast way independent or their location.

Load speed on an image-intensive webpage

I'm currently working on a portfolio site wherein I would have to load a large quantity of photos on the same page.
These images are loaded dynamically through PHP, and at the moment I have opted to save thumbnail versions of these images beforehand and load these.
My concern, however, is that this might not be an optimal solution should the site have a large number of users: I would basically have duplicates of each image multiplied by the number of images users have uploaded.
My question for you is whether or not there are better solutions to this problem? A way to load a page as fast as possible without compromising too much space?
Thanks a ton.

Loading a webpage full of lots of images will be slow. The reason for this is because of the bandwidth needed to transfer these images.
You have a couple of options
Load full images in tiled mode. These images will be full images, just resized to fit in a "thumbnail" view. The advantage of this is that you are only saving 1 image, but that image is full sized, and will take a long time to load.
Load the thumbnails as you said you are doing. The advantage to this is performance, but you need to store two copies of each image. Also, depending on how you tackle the thumbnail creation, you may require users to upload two copies of the image to provide their own thumbnail... which could stink.
Load thumbnails, but dynamically generate them on upload. You are essentially keeping two copies of the image on disk, but you are dynamically creating it through some php image modification API. This will load faster, but still eats up disk space. It also minimizes user/administrative requirements to provide a thumbnail.
Load thumbnails on-demand, as the page is requested. This approach would take some testing, as I've never tried it. Basically, you would invoke the php image modification API (or better yet out-source to a native solution!) to create a one-time-use (or cached) thumbnail to be used. You might say "OMG, that'll take so long!". I think this approach might actually be usable if you apply an appropriate caching mechanism so you aren't constantly recreating the same thumbnails. It will keep down on bandwidth, and since the limiting factor here is the network connection, it might be faster then just sending the full images (since the limiting factor of creating the thumbnails would now be CPU/Memory/Hard Disk).
I think #4 is an interesting concept, and might be worth exploring.

What i think is :
A. Take Advantage of Cache (System Cache , Cache In Headers , HTTP Cache .. all Cache )
B. Don't generate Thumb all the time
C. Use a Job Queing System such as Gearman or beanstalkd to generate thumbs so that you don't have to do it instantly
D. Use Imagick is more efficient
E. Paginate
F. Example only generate thumb when the original file has been modified
$file = "a.jpg" ;
$thumbFile = "a.thumb.jpg" ;
$createThumb = true;
if(is_file($thumbFile))
{
if((filemtime($file) - 10) < filemtime($thumbFile));
{
$createThumb = false;
}
}
if($createThumb === true)
{
$thumb = new Imagick();
$thumb->readImage($file);
$thumb->thumbnailImage(50, null);
$thumb->writeImage($thumbFile);
$thumb->destroy();
}

Consider using a sprite or a collage of all the images, so that only one larger image is loaded, saving bandwidth and decreasing page load time.
Also, as suggested already, pagination and and async loading can improve it some.
References:
http://css-tricks.com/css-sprites/
http://en.wikipedia.org/wiki/Sprite_(computer_graphics)#Sprites_by_CSS

Yes, caching the thumbnails is a good idea, and will work out fine when done right. This kind of thing is a great place for horizontal scaling.
You should look into using a
CDN. (Or
possibly implementing something similar.) The caching system will
generate images of commonly-requested sizes, and groom those off if
they become infrequently requested at a later time.
You could also scale horizontally and transplant this service to another server or
vserver in your network or cloud.
It is a disk-intensive and bandwidth-intensive thing, image delivery. Not as bad as video!

Generate thumbnail of large images on-the-fly. (Some hint)
Load images asynchronously (Try JAIL)
Paginate
Caching is also an option, works from the second time the site is loaded though.

PHP Image Optimizer

We are building a web app which will have a lot of images being uploaded. Which is the best Solution to optimize these images and store it in on the website ?
And also is there a way i can also auto enhance the images which are being uploaded ?

Do not store images in DB, store them in file system (as real files). You'll probably need to store information about them in DB though, e.g., filename, time of upload, size, owner etc.
Filenames must be unique. You might use yyyymmddhhiissnnnn, where yyyymmdd is year, month and date, hhiiss - hour, minutes and seconds, nnnn - number of image in that second, i.e., 0001 for first image, 0002 for second image etc. This will give you unique filenames with fine ordering.
Think about making some logic directory structure. Storing millions of images in single folder is not a good idea, so you will need to have something like images/<x>/<y>/<z>/<filename>. This could also be spanned across multiple servers.
Keep original images. You never know what you will want to do with images after year or two. You can convert them to some common format though, i.e., if you allow to upload JPG, PNG and other formats, you might store all of them as JPG.
Create and store all kinds of resized images that are necessary in your website. For example, social networks often have 3 kinds of resized images - one for displaying with user comments in various places (very small), one for displaying in profile page (quite small, but not like icon; might be some 240x320 pixels) and one for "full size" viewing (often smaller than original). Filenames of these related images should be similar to filenames of original images, e.g., suffixes _icon, _profile and _full might be added to filenames of original images. Depending on your resources and the amount of images that are being uploaded at the same time, you can do this either in realtime (in the same HTTP request) or use some background processing (cron job that continuously checks if there are new images to be converted).
As for auto enhancing images - it is possible, but only if you know exactly what must be done with images. I think that analyzing every image and deciding what should be done with it might be too complex and take too much resources.

All good suggestions from binaryLV above. Adding to his suggestion #5, you also probably want to optimize the thumbnails you create. When images are uploaded, they are likely to have metadata that is unnecessary for the thumbnails to have. You can losslessly remove the metadata to make the thumbnail sizes smaller, as suggested here: http://code.google.com/speed/page-speed/docs/payload.html#CompressImages. I personally use jpegtran on the images for my website to automatically optimize my thumbnails whenever they are created. If you ever need the metadata, you can get it from the original image.
Something else to consider if you plan to display these images for users is to host your images on a cookie-free domain or sub-domain as mentioned here: http://developer.yahoo.com/performance/rules.html#cookie_free. If the images are hosted on a domain or sub-domain that has cookies, then every image will send along an unnecessary cookie. It can save a few KB per image requested, which can add up to a decent amount, especially on restricted bandwidth connection such as on a mobile device.

Image caching vs. image processing in PHP and S3

Here is the thing. Right now I have this e-commerce web site where people can send a lot of pictures for their products. All the images are stored at Amazon's S3. When we need a thumbnail or something, I check over S3 if there is one available. If not, I process one and send it to S3 and display it on the browser. Every different sized thumbnail gets stored at S3, and checking the thumbnail availability at every request is kind of money consuming. I'm afraid I'll pay a lot once the site starts to get more attention (if it gets...).
Thinking about alternatives, I was thinking on keeping only the original images at S3 and process the images on the fly at every request. I imagine that in that way I would by on CPU usage, but I hasn't made any benchmarks to see how far can I go. The thing is that I wouldn't expend money making requests and storing more images on S3 and I could cache everything on the user's browser. I know it's not that safe to do that, so that is why I'm bringing this question here.
What do you think? How do you think I could solve this?

I would resize at the time of upload and store all version in S3.
For example if you have a larger image ( 1200x1200 ~200kb ) and create 3 resized version ( 300x300, 120x120, and 60x60 ) you only add about 16% or 32kb ( for my test image, YMMV ). Lets say you need to store a million images; that is roughly 30 GB more, or $4.5 extra a month. Flickr reported to have 2 billion images ( in 2007 ) that is ~$9k extra a month, not too bad if you are that big.
Another major advantage is you will be able to use Amazon's CloudFront.

If you're proxying from S3 to your clients (which it sounds like you're doing), consider two optimizations:
At upload time, resize the images at once and upload as a package (tar, XML, whatever)
Cache these image packages on your front end nodes.
The 'image package' will reduce the number of PUT/GET/DELETE operations, which aren't free in S3. If you have 4 image sizes, you'll cut down by 4.
The cache will further reduce S3 traffic, since I figure the work flow is usually see a thumbnail -> click it for a larger image.
On top of that, you can implement a 'hot images' cache that is actively pushed to your web nodes so it's pre-cached if you're using a cluster.
Also, I don't recommend using Slicehost<->S3. The transit costs are going to kill you. You should really use EC2 to save a ton of bandwidth(Money!!).
If you aren't proxying, but handing your clients S3 URL's for the images, you'll definitely want to preprocess all of your images. Then you don't have to check for them, but just pass the URL's to your client.
Re-processing the images every time is costly. You'll find that if you can assume that all images are resized, the amount of effort on your web nodes goes down and everything will speed up. This is especially true since you aren't firing off multiple S3 requests.

Keep a local cache of:
Which images are in S3
A cache of the most popular images
Then in both circumstances you have a local reference. If the image isn't in the local cache, you can check a local cache to see if it is in S3. Saves on S3 traffic for your most popular items and saves on latency when checking S3 for an item not in the local cache.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.