PHP - best way to cache and serve image and video files - php

I'm running an app like 9gag where uers can upload and watch images and videos so same images and videos are requested up to 100 times per minute which puts a big workload on the SSD so storing last used media in RAM and serve it from there would be better.
I've read that memcached and redis aren't good for that but without good explanations why not, can someone explain? Is vanish a better solution and does it work with PHP?
I need the best solution preferable using PHP.

I would definitely not advise you to store these types of workloads in Memcached or Redis and I would also not advise you to have these workloads processed by PHP.
Varnish is indeed the way to go here.
Why not Memcached & Redis?
Memcached and Redis are distributed key value stores. They are extremely fast and scalable and are perfect to store small values that change on a regular basis.
Image and video files are quite large and wouldn't really fit well in these memory-only databases. Keep in mind that Redis and Memcached aren't directly accessible from the web, they are caches that you would call from a web application.
That means there is additional latency running them through an application runtime like PHP.
Why not PHP?
Don't get me wrong, I'm a huge PHP fan and have been part of the PHP community since 2007. PHP is great for building web pages, but not so great for processing binary data.
These types of workloads that you're looking to process can easily overwhelm a PHP-FPM or PHP-CLI runtime.
It is possible to use PHP, but you'll need so many servers to handle video and image processing at large scale, that it will become an operational burden.
Why Varnish?
Varnish is a reverse caching proxy that sits in front of your web application, unlike distributed caches like Memcached and Redis that sit behind your web application.
This mean you can just store images and videos on the disk of your webserver and Varnish will cache requested content in memory without having to access the webserver upon every request.
Varnish is built for large-scale HTTP processing and is extremely good at handling HTTP responses of any size at large scale.
Varnish is software that is used by CDNs and OTT video streaming platforms to deliver imagery and online video.
Using video protocols like HLS, MPEG-DASH or CMAF, these streaming videos are chunked up in segments and indexed in manifest files.
A single Varnish server can serve these with sub-millisecond latency with a bandwidth up to 500 Gbps and a concurrency of about 100,000 requests.
The amount of machines you need will be way less than if you'd do this in PHP.
The Varnish Configuration Language, which is the domain-specific programming language that comes with Varnish, can also be used to perform certain customization tasks within the request/response flow.
The VCL code is only required to extend standard behavior, whereas in regular development languages like PHP you have to define all the behavior in code.
Here's a couple of Varnish-related resources:
The Varnish Developer Portal: https://www.varnish-software.com/developers/
The Varnish documentation: http://varnish-cache.org/docs/
The Varnish 6 By Example book that I wrote: https://info.varnish-software.com/resources/varnish-6-by-example-book
Maybe even Varnish Enterprise?
The only challenge is caching massive amounts of image/video content. Because Varnish stores everything in memory, you'll need enough memory to store all the content.
Although you can scale Varnish horizontally and use consistent hashing algorithms to balance the content across multiple Varnish servers, you'll probably still need quite a number of servers. This depends on the amount of content that needs to be stored in cache at all times.
If your origin web platform is powerful enough to handle requests for uncached long-tail content, Varnish could store the hot content in memory and trigger caches misses for that long-tail content. That way you might not need a lot of caching servers. This mainly depends on the traffic patterns of your platform.
The open source version of Varnish does have a file storage engine, but it behaves really poorly and is prone to disk fragmentation at large scale. This will slow you down quite significantly as write operations increase.
To tackle this issue Varnish Software, the commercial entity behind the open source project, came up with the Massive Storage Engine (MSE). MSE tackles the typical issues that come with file caching in a very powerful way.
The technology is used by some of the biggest video streaming platforms in the world.
See https://docs.varnish-software.com/varnish-cache-plus/features/mse/ for more information about MSE.
Varnish Enterprise and MSE are not free and open source. It's up to you to figure out what would be the cheaper solution from a total cost of ownership point of view: managing a lot of memory-based open source Varnish servers or paying the license fees of a limited amount of Varnish Enterprise servers with MSE.

Related

Server side minify leads to PHP process bottleneck on high traffic site. What are my options?

I am currently tasked with finding a solution for a serious PHP bottleneck which is apparently caused by server-side minification of CSS and JS when our sites are under high load.
Some details and what I have found out so far
I inherited a web application running on Wordpress and which uses a complex constellation of Doctrine, Memcached and W3 Total Cache for minification and caching. When under heavy load our application begins to slow down rapidly. So far we have narrowed part of the problem down to the server-side minification process. Preliminary analysis has shown that the number PHP processes start to stack up under load, and when reaching the process limit of 500 processes, start to slow everything down. Something which is also mentioned by the author of the minify library.
Solutions I have evaluated so far
Pre-minification
The most logical solution would be to pre-minify any of the files before going live. Unfortunately our workflow demands that non-developers should be able to edit said files on our production servers (i.e. after the web app has gone live). Therefore I think that pre-processing is out of the question, as it limits the editability of minified files.
Serving unminified files
75% of our users are accessing our web application with their mobile devices, especially smartphones. Unminified JS and CSS amounts to 432KB and is reduced by 60-80% in size when minified. Therefore serving unminified files, while solving the performance and editability problem, is for the sake of mobile users out of the question.
I understand that this is as much a technical problem as it is a workflow problem and I guess we are open to working on both as long as we end up with a better overall performance.
My questions
Is there a reasonable compromise which solves the PHP bottleneck
problem, allows for non-devs to make changes to live CSS/JS and
still serves reasonably sized files to clients.
If there is no such one-size-fits-all solution, what can I do to
better our workflow and / or server-side behaviour?
EDIT: Because there were some questions / comments regarding the server configuration, our servers run Debian and are equipped with 32GB of RAM and 24 core CPUs.
You can run a css/javascript compilation service like Gulp or Grunt via Node.js that minifies all your js and css assets on change.
This service can run in production but that is not recommended without some architectural setup ( having multiple versioned compiled files and auto-checking them via gulp or another extension ).
I emphasize that patching features into production and directly
editing it is strongly discouraged as it can present live issues to
your visitors reducing your credibility.
http://gulpjs.com/
Using Gulp/Grunt would require you to change how you write your css/javascript files.
I would solve this with 2 solutions - first, removing any WP-CRON operation that runs every time a user runs the application and move that to actual CRON on the server. Second I would use load balancing so that a single server is not taking the load of the work. That is your real problem and even if you fix your perceived code issues you are still faced with the load issue.
I don't believe you need to change your workflow at all or go down the road of major modification to your existing system.
The WP-CRON tasks that runs each time a page is loaded causes significant load and slowness. You can shift this from the users visiting running that process to your server just running it at the server level. This reduces load. It is also most likely running these processes that you believe are slowing down the site.
See this guide:
http://www.inmotionhosting.com/support/website/wordpress/disabling-the-wp-cronphp-in-wordpress
Next - load balancing. Having a single server supplying all your users when you have a lot of traffic is a terrible idea. You need to split the webservers load.
I'm not sure where or how you are hosted but I would move things to AWS. Setup my WordPress site for load balancing # AWS: http://www.mornin.org/blog/scalable-wordpress-amazon-web-services/
This will involve:
Load Balancer
EC2 Instances running PHP/Apache
RDS for your database storage for all EC2 instances
S3 Storage for the sites media
For user sessions I suggest you just setup stickiness on the load balancer so users are continually served the same node they arrived on.
You can get a detailed guide on how to do this here:
http://www.mornin.org/blog/scalable-wordpress-amazon-web-services/
Or at server fault another approach:
https://serverfault.com/questions/571658/load-balancing-wordpress-on-amazon-web-services-managing-changes
The assumption here is that if you are high traffic you are making revenue from this high traffic so anytime your service responds slowly it will turn away users or possibly discourage them from returning. Changing the software could help - but you're treating the symptom not the illness. The illness is that your server comes under heavy load. This isn't uncommon with WordPress and high traffic, so you need to spread the load instead of try and micro-optimize. The difference is the optimizations will be small gains while the load balancing and spread of load actually solves the problem.
Finally - consider using a CDN for serving all of your media. This loads media faster and it removes load from your system by reducing the amount of requests to the server and it's output to the clients. It also loads pages faster consistently for people wherever they are visiting from by supplying media from nodes closest to them. At AWS this is called CloudFront. WordPress also offers this service free via Jetpack (I believe) but it does not handle all media from my understanding.
I like the idea of using GulpJS. One thing you might consider is to have a wp-cron or even just a system cron that runs every 5 minutes or so and then runs a gulp task to minify and concatenate your css and js files.
Another option that doesn't require scheduling but is based off of watching the file system for changes and then triggering a Gulp build to happen is to use incron (inotify cron). Check out the incron man page. Incron is great in that it triggers actions based on file system events such as file changes. You could use this to trigger a gulp build when any css file changes on the file system.
One caveat is that this is a Linux solution so if you're hosting on Windows you might have to look for something similar.
Edit:
Incron Documentation

What are the pros and cons to using AWS/S3 for static content?

I want some little guidance from you all. I have a multimedia based site which is hosted on a traditional Linux based, LAMP hosting. As the site has maximum of Images /Video content,there are around 30000+ posts and database size is around 20-25MB but the file system usage is of 10GB and Bandwidth of around 800-900 GB ( of allowed 1 TB ) is getting utilized every month.
Now,after a little brainstorming and seeing my alternatives here and there, I have come up with two options
Increase / Get a bigger hosting plan.
Get my static content stored on Amazon S3.
While the first plan will be a simple option, I am actually looking forward for the second one, i.e. storing my static content on Amazon S3. The website i have is totally custom-coded and based on PHP+MySQL. I went through this http://undesigned.org.za/2007/10/22/amazon-s3-php-class/ and it gave me a fair idea.
I would love to know pros/cons when I consider hosting static content on s3.
Please give your inputs.
Increase / Get a bigger hosting plan.
I would not do that. The reason is, storage is cheap, while the other components of a "bigger hosting plan" will cost you dearly without providing an immediate benefit (more memory is expensive if you don't need it)
Get my static content stored on Amazon S3.
This is the way to go. S3 is very inexpensive, it is a no-brainer. Having said that, since we are talking video here, I would recommend a third option:
[3.] Store video on AWS S3 and serve through CloudFront. It is still rather inexpensive by comparison, given the spectacular bandwidth and global distribution. CloudFront is Amazon's CDN for blazing fast speeds to any location.
If you want to save on bandwidth, you may also consider using Amazon Elastic Transcoder for high-quality compression (to minimize your bandwidth usage).
Traditional hosting is way too expensive for this.
Bigger Hosting Plan
going for a bigger hosting plan is not a permanent solution because
As the static content images/videos always grow in size. this time your need is 1 TB the next time it will increase more. So, you will be again in the same situation.
With the growth of users and static content your bandwidth will also increase and will cost you more.
Your database size is not so big and We can assume you are not using a lot of CPU power and memory. So you will only be using more disk space and paying for larger CPU and memory which you are not using.
Technically it is not good to server all your requests from a single server. Browser has a limited simultaneous requests per domain.
S3/ Cloud storage for static content
s3 or other cloud storage is good option for static contents. following are the benefits.
You don't need to worry about the storage space it auto scales and available in abundance.
If your site is accessible in different location worldwide you can manage a cdn to improve the speed of content delivered from the nearest location.
The bandwidth is very cheap as compared to the traditional hosting.
It will also decrease burden from your server by uploading your files and serving from s3.
These are some of the benefits for using s3 over traditional hosting. As s3 is specifically built to server the static contents. Decision is yours :)
If you're looking at the long term, at some point you might not be able to afford a server that will hold all of your data. I think S3 is a good option for a case like yours for the following reasons:
You don't need to worry about large file uploads tying down your server. With Cross Origin Resource Sharing, you can upload files directly from the client to your S3 bucket.
Modern browsers will often load parallel requests when a webpage requests content from different domains. If you have your pictures coming from yourbucket.s3.amazonaws.com and the rest of your website loading from yourdomain.com, your users might experience a shorter load time since these requests will be run in parallel.
At some point, you might want to use a Content Distribution Network (CDN) to serve your media. When this happens, you could use Amazon's cloudfront with out of the box support for S3, or you can use another CDN - most popular CDNs these days do support serving content from S3 buckets.
It's a problem you'll never have to worry about. Amazon takes care of redundancy, availability, backups, failovers, etc. That's a big load off your shoulders leaving you with other things to take care of knowing your media is being stored in a way that's scalable and future-proof (at least the foreseeable future).

Scaling for TYPO3 site

I'm asked by a customer to deliver a TYPO3 based website with the following parameters:
- small amount of content (about 50 pages)
- very little change frequency
- average availabilty about 95%/day
- 20% of pages are restricted, only available after login
- No requirements for fancy typo3 extensions or something else (only Typo3 core)
- Medium sized pages
- Only limited digital assets (images etc.) included
I have the requirements to build an infrastructure to serve up to 1000 concurrent users. With the assumption of having an average think time of 30 sec. this would result in 33 Requests per second.
How could an infrastructure look like?
I know that system scaling is a highly individual task depending on the implementation of the system and needs testing, but I need a first indication where to start (single server, separating components to different servers,...).
Any idea?
Easier solution is EXT:nc_staticfilecache. This saves the static pages as HTML and your web server automatically delivers them through rewrite rules (in case of Apache through mod_rewrite). This works very well for static content and should already enable you to do >100req/s.
The even more fancier way is to use Varnish Cache. Varnish is a reverse proxy server that holds your web site content in memory and can run on a dedicated host. If you configure it correctly (send correct cache headers!), it serves you line speed (some million req/s). There is also a TYPO3 Extension moc_varnish, which e.g. purges the varnish cache, when a page is changed in TYPO3. Also support for edge side includes exists to e.g. only retrieve the user-specific data from TYPO3 and use the static parts of a page from varnish cache (everything except the "Welcome user Foo Bar".. ;)).
As mentioned: Don't forget to configure correct cache headers (Expires etc) for your assets. This already removes some load from your web server.
It's quite possible, already made something like this. You need at least one dedicated server with >= 8GB of RAM.
If we are speaking about infrastructure, the minimal combination is :
nginx/Varnish for front/load balancing
Apache HTTP Server
MySQL could be on standalone server, could be clustered
Performance optimization is very important in such cases.
Some links for further reading :
http://techblog.evo.pl/en/how-to-boost-speed-up-your-typo3-website-with-nginx/
http://www.fabrizio-branca.de/nginx-varnish-apache-magento-typo3.html
http://wiki.typo3.org/Performance_tuning
I'd put this on a single dedicated server (or well specified VPS) but maybe keep all the static assets on a third party CDN so you can focus on the dynamic stuff. I don't know Typo3 but can't see any reason why you couldn't have your db on the same server for this level of usage - there is sure to be caching options of various kinds. Or perhaps consider a cloud server, so if you need more oomph, just add more resources.
Edit: I don't think it is a good idea to build a scalable architecture just yet e.g. proxy servers and all that stuff. If it is slow and you find you really can't cope with one machine, scale up at that point. I'm of the view you can make do with a much simpler architecture given your expected traffic.
I would look into a virtual sserver or a ksm and a good mysql and php configuration. When I have a ksm I would tweak Linux and use iptables for traffic shaping. A dedicated root server would be nice but it's expensive. Then I would think about using a nginx or lighttpd webserver with eaccellerator and memcache. If that doesn't help I would try to compile php and mysql with optimize flags or I would try to compile it with the Intel C Compiler. ICC can optimize C code better then gcc. If the server has many ram I would use ramdisk.

When not to use memcache

Currently we are having a site which do a lot of api calls from our parent site for user details and other data. We are planning to cache all the details on our side. I am planning to use memcache for this. as this is a live site and so we are expecting heavier traffic in coming days(not that like FB but again my server is also not like them ;) ) so I need your opinion what issues we can face if we are going for memcache and cross opinions of yours why shouldn't we go for it. Any other alternative will also help.
https://github.com/steveyen/community-site/blob/master/db_doc/main/WhyNotMemcached.wiki
Memcached is terrific! But not for every situation...
You have objects larger than 1MB.
Memcached is not for large media and streaming huge blobs.
Consider other solutions like: http://www.danga.com/mogilefs
You have keys larger than 250 chars.
If so, perhaps you're doing something wrong?
And, see this mailing list conversation on key size for suggestions.
Your hosting provider won't let you run memcached.
If you're on a low-end virtual private server (a slice of a machine), virtualization tech like vmware or xen might not be a great place to run memcached. Memcached really wants to take over and control a hunk of memory -- if that memory gets swapped out by the OS or hypervisor, performance goes away. Using virtualization, though, just to ease deployment across dedicated boxes is fine.
You're running in an insecure environment.
Remember, anyone can just telnet to any memcached server. If you're on a shared system, watch out!
You want persistence. Or, a database.
If you really just wish that memcached had a SQL interface, then you probably need to rethink your understanding of caching and memcached.
You should implement a generic caching layer for the API calls first. Within the domain of the caching layer you can then change the strategy which backend you want to use. If you then see that memcache is not fitting you can actually switch (and/or testwise monitor how it works compared with other backends).
Even better, you can first code this build upon the filesystem quite easily (which has multiple backends, too) without the hurdle to rely on another daemon, so already get started with caching - probably file system is already enough for your caching needs?
Memcache is fast, but it also can use a lot of memory if you want to get the most out of it. Whenever you hit the disk for I/O, you're increasing the latency of your application. Pull items that are frequently accessed and put them on memcache. For my large scale deployments, we cache sessions there because DB is slow as well as filesystem session storage.
A recommendation to add to your stack is APC. It caches PHP files and lessens the overall memory usage per page.
Alternative: Redis
Memcached is, obviously, limited by your available memory and will start to jettison data when memory thresholds are reached. You may want to look redis which is as fast (faster in some benchmarks) as memcached but allows the use of both volatile and non-volatile keys, more complex data structures, and the option of using virtual memory to put Least Recently Used (LRU) key values to disk.

CakePHP High-Availability Server Farm setup

I am currently working on configuring my CakePHP (1.3) based web app to run in a HA Setup. I have 4 web boxes running the app itself a MySQL cluster for database backend. I have users uploading 12,000 - 24,000 images a week (35-70 GB). The app then generates 2 additional files from the original, a thumbnail and a medium size image for preview. This means a total of 36,000 - 72,000 possible files added to the repositories each week.
What I am trying to wrap my head around is how to handle large numbers of static file request coming from users trying to view these images. I mean I can have have multiple web boxes serving only static files with a load-balancer dispatching the requests.
But does anyone on here have any ideas on how to keep all static file servers in sync?
If any of you have any experiences you would like to share, or any useful links for me, it would be very appreciated.
Thanks,
serialk
It's quite a thorny problem.
Technically you can get a high-availability shared directory through something like NFS (or SMB if you like), using DRBD and Linux-HA for an active/passive setup. Such a setup will have good availability against single server loss, however, such a setup is quite wasteful and not easy to scale - you'd have to have the app itself decide which server(s) to go to, configure NFS mounts etc, and it all gets rather complicated.
So I'd probably prompt for avoiding keeping the images in a filesystem at all - or at least, not the conventional kind. I am assuming that you need this to be flexible to add more storage in the future - if you can keep the storage and IO requirement constant, DRBD, HA NFS is probably a good system.
For storing files in a flexible "cloud", either
Tahoe LAFS
Or perhaps, at a push, Cassandra, which would require a bit more integration but maybe better in some ways.
MySQL-cluster is not great for big blobs as it (mostly) keeps the data in ram; also the high consistency it provides requires a lot of locking which makes updates scale (relatively) badly at high workloads.
But you could still consider putting the images in mysql-cluster anyway, particularly as you have already set it up - it would require no more operational overhead.

Categories