Scaling File Systems

Scaling File Systems - php

This could be a question for serverfault as well, but it also includes topics from here.
I am building a new web site that consist of 6 servers. 1 mysql, 1 web, 2 file processing servers, 2 file servers. In short, file processing servers process files and copy them to the file servers. In this case I have two options;
I can setup a web server for each file server and serve files directly from there. Like, file1.domain.com/file.zip. Some files (not all of them) will need authentication so I will authenticate users via memcache from those servers. 90% of the requests won't need any authentication.
Or I can setup NFS and serve files directly from the web server, like www.domain.com/fileserve.php?id=2323 (it's a basic example)
As the project is heavily based on the files, the second option might not be as effective as the first option, as it will consume more memory (even if I split files into chunks while serving)
The setup will stay same for a long time, so we won't be adding new file servers into the setup.
What are your ideas, which one is better? Or any different idea?
Thanks in advance,

Just me, but I would actually put a set of reverse proxy rules on the "web server" and then proxy HTTP requests (possibly load balanced if they have equal filesystems) back to a lightweight HTTP server on the file servers.
This gives you flexibility and the ability implement future caching, logging, filter chains, rewrite rules, authentication &c, &c. I find having a fronting web server as a proxy layer a very effective solution.

I recommend your option #1: allow the file servers to act as web servers. I have personally found NFS to be a little flaky when used under high volume.

You can also use Content Delivery Network such as simplecdn.com, they can solve bandwidth and server load issue.

Related

How to handle video and image upload to storage servers?

I am in the process of developing an application (in Go or possibly in PHP) where users needs to upload photos and images.
I have setup up a couple of ZFS (mirror) storage servers on different locations, but I am in doubt about how to let users best upload files. ZFS handles quotas and reservation.
I am running with a replicated Galera database on all servers, both for safety, but also for easy access to user accounts from each server. In other words, each server has a local copy of the database all the time. All users are virtual users only.
So far I have testes the following setup options:
Solution 1
Running SFTP (ProFTPD with module) or FTPS (Pure-FTPs with TLS) on the storage servers with virtual users.
This gives people direct access to the storage servers using a client like Filezilla. At the same time users can also upload using our web GUI from our main web server.
One advantage with this setup is that the FTP server can handle virtual users. Our web application will also send files via SFTP or FTPS.
One disadvantage is that FTP is meeh, annoying to firewall. Also I much prefer FTP over SSH (SFTP), rather than FTP over TLS (FTPS). However, only ProFTPD has a module for SSH, but it has been a real pain (many problems with non-working configuration options and file permission errors) to work with compared to PureFTPd, but PureFTPd only supports TLS.
Running with real SSH/SCP accounts and using PAM is not an option.
Solution 2
Mount the storage servers locally on the web server using NFS or CIFS (Samba is great at automatic resume in case a box goes down).
In this setup users can only upload via our main web server. The web server application, and the application running on the storage servers, then needs to support resumable uploads. I have been looking into using the tus protocol.
A disadvantage with both the above setups is that storage capacity needs to be managed somehow. When storage server 1 reaches its maximum number of users, the application needs to know this and then only create virtual users for storage server 2, 3, etc.
I have calculated how many users each storage server can hold and then have the web application check the database with virtual users to see when it needs to move newly created users to the next storage server.
This is rather old school, but it works.
Solution 3
Same as solution 2 (no FTP), but clone our web application upload thingy to each storage server and then redirect users (or provide them with a physical link to the storage server, s1.example.com, s2.example.com, etc.)
The possibly advantage with this setup is that users upload directly to the storage server they have been assigned to rather than go trough our main web server (preventing it from becoming a possible bottleneck).
Solution 4
Use GlusterFS on the storage servers and build a cluster that can easily be expanded. I have tested out GlusterFS and it works very well for this purpose.
The advantage with this setup is that I don't really need to care about where files physically go on which storage servers, and I can easily expand storage by adding more servers to the cluster.
However, the disadvantage here is again that our main web server might become a bottleneck.
I have also considered adding a load balancer and then use multiple web server in case our main web server becomes a bottleneck for uploading files.
In any case, I much prefer to keep it simple! I don't like adding stuff. I want it to be easy to maintain in the long run.
Any ideas, suggestions, and advice will be greatly appreciated.
How do you do it?

A web application should be agnostic of the underlying storage in case we are talking of file storage; Separation of concerns.
(S)FTP(S) on the other hand is not a storage method. It is a communication protocol. It does not preclude you from having a shared storage. See above.
ZFS does not come with the ability of shared storage included, so you are basically down to the following choices:
Which underlying filesystem?
Do I want to offer an additional access mode via (S)FTP(S)?
How do I make my filesystem available across multiple servers? GlusterFS, CIFS or NFS?
So, let us walk this through.
Filesystem
I know ZFS is intriguing, but here is the thing: xfs for example already has a maximum filesystem size of 8 exbibytes minus one byte. The specialist term for this is "a s...load". To give you a relation: The library of congress holds about 20TB of digital media - and would fit into that roughly 400k times. Even good ol' ext4 can hold 50k LoCs. And if you hold that much data, your FS is your smallest concern. Building the next couple of power plants to keep your stuff going presumably is.
Gist Nice to think about, but use whatever you feel comfortable with. I personally use xfs (on LVM) for pretty much everything.
Additional access methods
Sure, why not? Aside from the security nightmare (privilege escalation, anyone?). And ProFTPd, with it's in build coffee machine and kitchen sink is the last FTP server I would use for anything. It has a ginormous code base, which lends itself to accidentally introducing vulnerabilities.
Basically it boils down to the skills present in the project. Can you guys properly harden a system and an FTP server and monitor it for security incidents? Unless your answer is a confident "Yes, ofc, plenty with experience with it!" you should minimize the attack surface you present.
Gist Don't, unless you really know what you are doing. And if you have to ask, you probably do not. No offense intended, just stating facts.
Shared filesystem
Personally, I have made... less than perfect experiences with GlusterFS. The replication has quite some requirements when it comes to network latency and stuff. In a nutshell: if we are talking of multiple availability zones, say EMEA, APAC and NCSA, it is close to impossible. You'd be stuck to georeplication, which is less than ideal for the use case you describe.
NFS and CIFS on the other hand have the problem that there is no replication at all, and all clients need to access the same server instance in order to access the data - hardly a good idea if you think you need an underlying ZFS to get along.
Gist Shared filesystems at a global scale with halfway decent replication lags and access times are very hard to do and can get very expensive.
Haha, Smartypants, so what would you suggest?
Scale. Slowly. In the beginning, you should be able to get along with a simple FS based repository for your files. And then check various other means for large scale shared storage and migrate to it.
Taking the turn towards implementation, I would even go a step further, you should make your storage an interface:
// Storer takes the source and stores its contents under path for further reading via
// Retriever.
type Storer interface {
StreamTo(path string, source io.Reader) (err error)
}
// Retriever takes a path and streams the file it has stored under path to w.
type Retriever interface {
StreamFrom(path string, w io.Writer) (err error)
}
// Repository is a composite interface. It requires a
// repository to accept andf provide streams of files
type Repository interface {
Storer
Retriever
Close() error
}
Now, you can implement various storage methods quite easily:
// FileStore represents a filesystem based file Repository.
type FileStore struct {
basepath string
}
// StreamFrom statisfies the Retriever interface.
func (s *FileStore) StreamFrom(path string, w io.Writer) (err error) {
f, err := os.OpenFile(filepath.Join(s.basepath, path), os.O_RDONLY|os.O_EXCL, 0640)
if err != nil {
return handleErr(path, err)
}
defer f.Close()
_, err = io.Copy(w, f)
return err
}
Personally, I think this would be a great use case for GridFS, which, despite its name is not a filesystem, but a feature of MongoDB. As for the reasons:
MongoDB comes with a concept called replica sets to ensure availability with transparent automatic failover between servers
It comes with a rather simple mechanism of automatic data partitioning, called a sharded cluster
It comes with an indefinite number of access gateways called mongos query routers to access your sharded data.
For the client, aside from the connection URL, all this is transparent. So it does not make a difference (almost, aside from read preference and write concern) whether it's storage backend consists of a single server or a globally replicated sharded cluster with 600 nodes.
If done properly, there is not a single point of failure, you can replicate across availability zones while keeping the "hot" data close to the respective users.
I have created a repository on GitHub which contains an example of the interface suggestion and implements a filesystem based repository as well as a MongoDB repository. You might want to have a look at it. It lacks caching at the moment. In case you would like to see that implemented, please open an issue there.

Is NGINX safe for webapps that usually rely on htaccess?

First of all, you don't have to convince me that NGINX is good, I know that and want to use it for all web projects.
I was asked this question the other day, and didn't have a solid answer, as I haven't looked much at that specific situation.
The Question, elaborated
Old and new web software, such as Drupal and Wordpress, come with some .htaccess-files, and while there are plenty of example configs for nginx for them available with just a simple search.
How can I assure clients that nginx is secure for Drupal and Wordpress respectively?
How can the nginx configs out there cover such complex systems and keep them just as safe as with apache+htaccess?
I don't need complete answers here, but links to articles that tries to answer this or similar questions.
I want to convince some clients to move to nginx for the huge performance boost it would give them, but they need more proof that I have, in regards to security with systems that usually rely on htaccess.

What is an .htaccess file?
It's a partial web server configuration file that's dynamically parsed on each (applicable) request. It allows you to override certain aspects of the web server's behaviour on a per-directory basis by sprinkling configuration snippets in them.
Fundamentally it does nothing one central configuration file can't, it just decentralises that configuration and allows users to override that configuration without having direct access to the web server as such. That has proven very useful and popular in the shared-hosting model, where the host does not want to provide full access to the web server configuration and/or that would be impractical.
nginx fundamentally uses one configuration file only, but you can add to that by including other configuration files with the include directive, with which you could ultimately mimic a similar system. But nginx is also so ridiculously performant because it does not do dynamic configuration inclusion on a per-directory basis; that is a huge performance drain on every request in Apache. Any sane Apache configuration will disable .htaccess parsing to avoid that loss of performance, at which point there's no fundamental difference between it and nginx.
Even more fundamentally: what is a web server? It's a program that responds to HTTP requests with specific HTTP responses. Which response you get depends on the request you send. You can configure the web server to send specific responses to specific requests, based on a whole host of variables. Apache and nginx can equally be configured to respond to requests in the same way. As long as you do that, there's no difference in behaviour and thereby "security".

Multiple webservers (NGINX) behind Load Balancer, share settings (database connections etc.)

I'm designing a system that will require several web servers (NGINX) behind a load balancer.
My question is: what techniques do you suggest using for sharing settings among all webbservers (hosting a PHP-app)? Let's say that I have to change the credentials for the database connection. In that case I don't want to log in to every single server and change all of the config files.
What do you suggest I do to be able to update those variables in one place so it's accessible for all web servers. I've considered having a small server in the middle which all servers read from (through a scp-connection or such), but I don't want a single point of failure.

There are various solutions to automate server management like Puppet and Chef, but if you are just getting started with two our three servers, consider a tool that's like you send broadcast the same SSH session to multiple hosts. Terminator for Gnome is great and there's also CSSHX for mac (which I haven't tried).
It's great to be able to see both terminals at once in case there are small differences between the server from prior manual maintenance. In Terminator, you can easily switch between broadcasting your commands to a group of terminals and typing into a specific one. You can set up a layout that automatically starts up two joined sessions and SSH's into all the servers in a given cluster. In other words, it's as easy to use as a SSH session normally would be, but works for multiple servers.
When project size and complexity grows, consider switching to more fully automated server management.

Websocket complications

This is complicated, and not necessarily one question. I'd appreciate any possible help.
I've read that is is possible to have websockets without server access, but I cannot seem to find any examples that show how it is. I've come to that conclusion (that I believe I need this) based on the following two things:
I've been struggling for the past several hours trying to figure out how to even get websockets to work with the WAMP server I have on my machine, which I have root access. Installed composer, but cannot figure out how to install the composer.phar file to install ratchet. Have tried other PHP websocket implementations (would prefer that it be in PHP), but still cannot get them to work.
My current webhost I'm using to test things out on is a free host, and doesn't allow SSH access. So, even if I could figure out to get websockets with root access, it is a moot point when it comes to the host.
I've also found free VPS hosts by googling (of course, limited everything) but has full root access, but I'd prefer to keep something that allows more bandwidth (my free host is currently unlimited). And I've read that you can (and should) host the websocket server on a different subdomain than the HTTP server, and that it can even be run on a different domain entirely.
It also might eventually be cheaper to host my own site, of course have no real clue on that, but in that case I'd need to figure out how to even get websockets working on my machine.
So, if anyone can understand what I'm asking, several questions here, is it possible to use websockets without root access, and if so, how? How do I properly install ratchet websockets when I cannot figure out the composer.phar file (I have composer.json with the ratchet code in it but not sure if it's in the right directory), and this question is if the first question is not truly possible. Is it then possible to have websocket server on a VPS and have the HTTP server on an entirely different domain and if so, is there any documentation anywhere about it?
I mean, of course, there is an option of using AJAX and forcing the browser to reload a JS file every period of time that would use jQuery ajax to update a series of divs regardless of whether anything has been changed, but that could get complicated, and I'm not even sure if that is possible (I don't see why it wouldn't be), but then again I'd prefer websockets over that since I hear they are much less resource hungry than some sort of this paragraph would be.

A plain PHP file running under vanilla LAMP (i.e. mod_php under Apache) cannot handle WebSocket connections. It wouldn't be able to perform the protocol upgrade, let alone actually perform real-time communication, at least through Apache. In theory, you could have a very-long-running web request to a PHP file which runs a TCP server to serve WebSocket requests, but this is impractical and I doubt a shared host will actually allow PHP to do that.
There may be some shared hosts that make it possible WebSocket hosting with PHP, but they can't offer that without either SSH/shell access, or some other way to run PHP outside the web server. If they're just giving you a directory to upload PHP files to, and serving them with Apache, you're out of luck.
As for your trouble with Composer, I don't know if it's possible to run composer.phar on a shared host without some kind of shell access. Some hosts (e.g. Heroku) have specific support for Composer.
Regarding running a WebSocket server on an entirely different domain, you can indeed do that. Just point your JavaScript to connect to that domain, and make sure that the WebSocket server provides the necessary Cross-Origin Resource Sharing headers.

OK... you have a few questions, so I will try to answer them one by one.
1. What to use
You could use Socket.IO. Its a library for developing realtime web application based on JavaScript. It consists of 2 parts - client side (runs on the visitor browser) and server side. Basic usage does not require almost any background knowledge on Node.js. Here is an example tutorial for a simple chat app on the official Socket.IO website.
2. Hosting
Most of the hosting providers have control panel (cPanel) with the capebility to install/activate different Apache plugins and so on. First you should check if Node.js isn't available already, if not you could contact support and ask them if including this would be an option.
If you don't have any luck with your current hosting provider you could always switch hosts quickly as there are a lot of good deals out there. Google will definitely help you here.
Here is a list containing a few of the (maybe) best options. Keep in mind that although some hosting deals may be paid there are a lot of low cost options to choose from.
3. Bandwidth
As you are worried about "resource hungry" code maybe you can try hosting some of your content on Amazon CloudFront. It's a content delivery network that is widely used and guarantees quick connection and fast resource loading as the files are loaded from the closest to the client server. The best part is that you only pay for what you actually use, so if you don't have that much traffic it would be really cheap to run and still reliable!
Hope this helps ;)

Scaling for TYPO3 site

I'm asked by a customer to deliver a TYPO3 based website with the following parameters:
- small amount of content (about 50 pages)
- very little change frequency
- average availabilty about 95%/day
- 20% of pages are restricted, only available after login
- No requirements for fancy typo3 extensions or something else (only Typo3 core)
- Medium sized pages
- Only limited digital assets (images etc.) included
I have the requirements to build an infrastructure to serve up to 1000 concurrent users. With the assumption of having an average think time of 30 sec. this would result in 33 Requests per second.
How could an infrastructure look like?
I know that system scaling is a highly individual task depending on the implementation of the system and needs testing, but I need a first indication where to start (single server, separating components to different servers,...).
Any idea?

Easier solution is EXT:nc_staticfilecache. This saves the static pages as HTML and your web server automatically delivers them through rewrite rules (in case of Apache through mod_rewrite). This works very well for static content and should already enable you to do >100req/s.
The even more fancier way is to use Varnish Cache. Varnish is a reverse proxy server that holds your web site content in memory and can run on a dedicated host. If you configure it correctly (send correct cache headers!), it serves you line speed (some million req/s). There is also a TYPO3 Extension moc_varnish, which e.g. purges the varnish cache, when a page is changed in TYPO3. Also support for edge side includes exists to e.g. only retrieve the user-specific data from TYPO3 and use the static parts of a page from varnish cache (everything except the "Welcome user Foo Bar".. ;)).
As mentioned: Don't forget to configure correct cache headers (Expires etc) for your assets. This already removes some load from your web server.

It's quite possible, already made something like this. You need at least one dedicated server with >= 8GB of RAM.
If we are speaking about infrastructure, the minimal combination is :
nginx/Varnish for front/load balancing
Apache HTTP Server
MySQL could be on standalone server, could be clustered
Performance optimization is very important in such cases.
Some links for further reading :
http://techblog.evo.pl/en/how-to-boost-speed-up-your-typo3-website-with-nginx/
http://www.fabrizio-branca.de/nginx-varnish-apache-magento-typo3.html
http://wiki.typo3.org/Performance_tuning

I'd put this on a single dedicated server (or well specified VPS) but maybe keep all the static assets on a third party CDN so you can focus on the dynamic stuff. I don't know Typo3 but can't see any reason why you couldn't have your db on the same server for this level of usage - there is sure to be caching options of various kinds. Or perhaps consider a cloud server, so if you need more oomph, just add more resources.
Edit: I don't think it is a good idea to build a scalable architecture just yet e.g. proxy servers and all that stuff. If it is slow and you find you really can't cope with one machine, scale up at that point. I'm of the view you can make do with a much simpler architecture given your expected traffic.

I would look into a virtual sserver or a ksm and a good mysql and php configuration. When I have a ksm I would tweak Linux and use iptables for traffic shaping. A dedicated root server would be nice but it's expensive. Then I would think about using a nginx or lighttpd webserver with eaccellerator and memcache. If that doesn't help I would try to compile php and mysql with optimize flags or I would try to compile it with the Intel C Compiler. ICC can optimize C code better then gcc. If the server has many ram I would use ramdisk.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.