Will I run into load problems with this application stack? - php

I am designing a file download network.
The ultimate goal is to have an API that lets you directly upload a file to a storage server (no gateway or something). The file is then stored and referenced in a database.
When the file is requsted a server that currently holds the file is selected from the database and a http redirect is done (or an API gives the currently valid direct URL).
Background jobs take care of desired replication of the file for durability/scaling purposes.
Background jobs also move files around to ensure even workload on the servers regarding disk and bandwidth usage.
There is no Raid or something at any point. Every drive ist just hung into the server as JBOD. All the replication is at application level. If one server breaks down it is just marked as broken in the database and the background jobs take care of replication from healthy sources until the desired redundancy is reached again.
The system also needs accurate stats for monitoring / balancing and maby later billing.
So I thought about the following setup.
The environment is a classic Ubuntu, Apache2, PHP, MySql LAMP stack.
An url that hits the currently storage server is generated by the API (thats no problem far. Just a classic PHP website and MySQL Database)
Now it gets interesting...
The Storage server runs Apache2 and a PHP script catches the request. URL parameters (secure token hash) are validated. IP, Timestamp and filename are validated so the request is authorized. (No database connection required, just a PHP script that knows a secret token).
The PHP script sets the file hader to use apache2 mod_xsendfile
Apache delivers the file passed by mod_xsendfile and is configured to have the access log piped to another PHP script
Apache runs mod_logio and an access log is in Combined I/O log format but additionally estended with the %D variable (The time taken to serve the request, in microseconds.) to calculate the transfer speed spot bottlenecks int he network and stuff.
The piped access log then goes to a PHP script that parses the url (first folder is a "bucked" just as google storage or amazon s3 that is assigned one client. So the client is known) counts input/output traffic and increases database fields. For performance reasons i thought about having daily fields, and updating them like traffic = traffic+X and if no row has been updated create it.
I have to mention that the server will be low budget servers with massive strage.
The can have a close look at the intended setup in this thread on serverfault.
The key data is that the systems will have Gigabit throughput (maxed out 24/7) and the fiel requests will be rather large (so no images or loads of small files that produce high load by lots of log lines and requests). Maby on average 500MB or something!
The currently planned setup runs on a cheap consumer mainboard (asus), 2 GB DDR3 RAM and a AMD Athlon II X2 220, 2x 2.80GHz tray cpu.
Of course download managers and range requests will be an issue, but I think the average size of an access will be around at least 50 megs or so.
So my questions are:
Do I have any sever bottleneck in this flow? Can you spot any problems?
Am I right in assuming that mysql_affected_rows() can be directly read from the last request and does not do another request to the mysql server?
Do you think the system with the specs given above can handle this? If not, how could I improve? I think the first bottleneck would be the CPU wouldnt it?
What do you think about it? Do you have any suggestions for improvement? Maby something completely different? I thought about using Lighttpd and the mod_secdownload module. Unfortunately it cant check IP adress and I am not so flexible. It would have the advantage that the download validation would not need a php process to fire. But as it only runs short and doesnt read and output the data itself i think this is ok. Do you? I once did download using lighttpd on old throwaway pcs and the performance was awesome. I also thought about using nginx, but I have no experience with that. But
What do you think ab out the piped logging to a script that directly updates the database? Should I rather write requests to a job queue and update them in the database in a 2nd process that can handle delays? Or not do it at all but parse the log files at night? My thought that i would like to have it as real time as possible and dont have accumulated data somehwere else than in the central database. I also don't want to keep track on jobs running on all the servers. This could be a mess to maintain. There should be a simple unit test that generates a secured link, downlads it and checks whether everything worked and the logging has taken place.
Any further suggestions? I am happy for any input you may have!
I am also planning to open soure all of this. I just think there needs to be an open source alternative to the expensive storage services as amazon s3 that is oriented on file downloads.
I really searched a lot but didnt find anything like this out there that. Of course I would re use an existing solution. Preferrably open source. Do you know of anything like that?

MogileFS, http://code.google.com/p/mogilefs/ -- this is almost exactly thing, that you want.

Related

MySQL service periodically goes offline and gives ERROR 2002 (HY000): Can't connect to local MySQL server [duplicate]

I am currently using an AWS micro instance as a web server for a website that allows users to upload photos. Two questions:
1) When looking at my CloudWatch metrics, I have recently noticed CPU spikes, the website receives very little traffic at the moment, but becomes utterly unusable during these spikes. These spikes can last several hours and resetting the server does not eliminate the spikes.
2) Although seemingly unrelated, whenever I post a link of my website on Twitter, the server crashes (i.e.,Error Establishing a Database Connection). Once restarting Apache and MySQL, the website returns to normal functionality.
My only guess would be that the issue is somehow the result of deficiencies with the micro instance. Unfortunately, when I upgraded to the small instance, the site was actually slower due to fact that the micro instances can have two EC2 compute units.
Any suggestions?
If you want to stay in the free tier of AWS (micro instance), you should off load as much as possible away from your EC2 instance.
I would suggest you to upload the images directly to S3 instead of going through your web server (see some example for it here: http://aws.amazon.com/articles/1434).
S3 can also be used to serve most of your web pages (images, js, css...), instead of your weak web server. You can also add these files in S3 as origin to Amazon CloudFront (CDN) distribution to improve your application performance.
Another service that can help you in off loading the work is SQS (Simple Queue Service). Instead of working with online requests from users, you can send some requests (upload done, for example) as a message to SQS and have your reader process these messages on its own pace. This is good way to handel momentary load cause by several users working simultaneously with your service.
Another service is DynamoDB (managed NoSQL DB service). You can put on dynamoDB most of your current MySQL data and queries. Amazon DynamoDB also has a free tier that you can enjoy.
With the combination of the above, you can have your micro instance handling the few remaining dynamic pages until you need to scale your service with your growing success.
Wait… I'm sorry. Did you say you were running both Apache and MySQL Server on a micro instance?
First of all, that's never a good idea. Secondly, as documented, micros have low I/O and can only burst to 2 ECUs.
If you want to continue using a resource-constrained micro instance, you need to (a) put MySQL somewhere else, and (b) use something like Nginx instead of Apache as it requires far fewer resources to run. Otherwise, you should seriously consider sizing up to something larger.
I had the same issue: As far as I understand the problem is that AWS will slow you down when you reach a predefined usage. This means that they allow for a small burst but after that things will become horribly slow.
You can test that by logging in and doing something. If you use the CPU for a couple of seconds then the whole box will become extremely slow. After that you'll have to wait without doing anything at all to get things back to "normal".
That was the main reason I went for VPS instead of AWS.

PHP Image Generation

So, for a simple test game, I'm working on generating user images based on their current in-game avatar. I got this idea from Club Penguin and GTA V. They both generate images of the current in-game avatar.
I created a script to simply put a few images together and print out the final image to the client. It's similar to how Club Penguin does it, I believe: http://cdn.avatar.clubpenguin.com/%7B13bcb2a5-2e21-442c-b8e4-10516be6abc6%7D/cp?size=300
As you can see, the penguin is wearing multiple clothing items. The items are each different images located at http://mobcdn.clubpenguin.com/game/items/images/paper/image/300/ (ex: http://mobcdn.clubpenguin.com/game/items/images/paper/image/300/210.png)
Anyway, I've already made the script and all, but I have a few questions.
When going to Club Penguin's or Grand Theft Auto's avatar generator, you'll notice it finishes the request so fast. Even when it's a new user, (so before it has a chance to cache the image since it hasn't been generated yet), it finishes in under a second.
How could I possibly speed up the image generation process? Right now I'm just using PHP, but I could definitely switch over to another language. I know a few others too and I'm willing to learn. Which language can provide the fastest web-image generator (it has to connect to a database first to grab the user avatar info)?
For server specs, how much RAM and all that fun stuff would be an okay amount? Right now I'm using an OVH cloud server (VPS Cloud 2) to test it and it's fine and all. But, if someone with experience with this could help, what might happen if I started getting a lot more traffic and there were people with 100+ image requests being made per client when they first log in (relationship system that shows their friend's avatar). I'll probably use Cloudflare and other caching tools to help so that most of them get cached for a maximum of 24 hours, but I can't completely rely on that.
tl;dr:
Two main questions:
What's the fastest way to generate avatars on the web (right now I'm using PHP)?
What are some good server specs for around 100+ daily unique clients (at minimum) using this server for generating these avatars?
Edit: Another question, which webserver could process more requests for this? Right now I'm using Apache for this server, but my other servers are using nginx for other API things (like logging users in, getting info, etc).
IMHO, language is not the bottleneck. PHP is fast enough for real-time small images processing. You just need right algorithm. Also, check out bytecode caching engines such as APC, or XCache, or even HHVM. They can significantly improve PHP performance.
I think, any VPS can do the job until you have >20 concurrent requests. The more clients use service at the same time the more RAM you need. You can easily determine your script memory needs and other performance info by using profiler, such as XHProf.
Nginx or Lighttpd in FastCGI mode use less RAM than Apache http server and they can handle more concurrent connections. But is's not important until you have many concurrent connections.
Yes, PHP is can do this job fast and flexible(example generate.php?size=32)
I know only German webspaces, but they have also an English interface. www.nitrado.net

Quick writing to log file after http request

I currently finished building a Web server who's main responsibility is to simply take the contents of the body data in each http post request and write it to a log file. The contents of the post data is obfuscated when received. So i'm un obfuscating the post data and writing it to a log file on the server. The contents after obfuscated is a series of random key value pairs that differ between every request. It is not fixed data.
The server is running Linux with 2.6+ kernel. Server is configured to handle heavy traffic (open files limit 32k, etc). The application is written in Python using web.py framework. The http server is Gunicorn behind Nginx.
After using Apache Benchmark to do some load testing, I noticed that it can handle up to about 600-700 requests per second without any log writing issues. Linux natively does a good job at buffering. Problems start to occur when more than this many requests per second attempt to write to the same file at same moment. Data will not get written and information will be lost. I know that "the writing directly to a file" design might not have been the right solution from the get go.
So i'm wondering if anyone can propose a solution that I can implement quickly without altering too much infrastructure and code that can overcome this problem?
I have read about in memory storage like Redis, but I have realized that if data is sitting in memory during server failure then that data is lost. I have read in the docs that redis can be configured as a persistent store, there just needs to be enough memory on the server for Redis to do it. This solution would mean that I would have to write a script that would dump the data from Redis (memory) to the Log file at a certain interval.
I am wondering if there is even a quicker solution? Any help would be greatly appreciated!
One possible option what I can think of is a separate logging process. So that your web.py can be shielded for performance issue. This is classical way of handling logging module. You can use IPC or any other bus communication infrastructure. With this you will be able to address two issues -
Logging will not be a huge bottle neck for high capacity call flows.
A separate module can ensure/provide switch off/on facility.
As such there would not be any huge/significant process memory usage.
However, you should bear in mind below points -
You need be sure that logging is restricted to just logging. It must not be a data store for business processing. Else you may have many synchronization problem in your business logic.
The logging process (here I mean actual Unix process) will become critical and slightly complex (i.e you may have to handle a form of IPC).
HTH!

Splitting form submissions to speed up transfer time

I have a simple CRM system that allows sales to put in customer info and upload appropriate files to create a project.
The system is already being hosted in the cloud. But the office internet upload speed is horrendous. One file may take up to 15 minutes or more to finish, causing a bottleneck in the sales process.
Upgrading our office internet is not an option; what other good solutions are out there?
I propose splitting the project submission form into 2 parts. Project info fields are posted directly to our cloud server webapp and stored in the appropriate DB table, the file submission will actually be submitted to a LAN server with a simple DB and api that will allow the cloud-hosted server webapp to communicate with to retrieve the file if ever needed again via a download link. Details need to be worked out for this set-up. But this is what I want to do in general.
Is this a good approach to solving this slow upload problem? I've never done this before, so are there also any obstacles to this implementation (cross-domain restrictions is something that comes into mind, but I believe that can be fixed with using an iFrame)?
If bandwidth is the bottleneck, then you need a solution that doesn't chew up all your bandwidth. You mentioned that you can't upgrade your bandwidth - what about putting in a second connection?
If not, the files need to stay on the LAN a little longer. It sounds like your plan would be to keep the files on the LAN forever, but you can store them locally initially and then push them later.
When you do copy the files out to the cloud, be sure to compress them and also setup rate limiting (so they take up maybe 10% of your available bandwidth during business hours).
Also put some monitoring in place to make sure the files are being sent in a timely manner.
I hope nobody needs to download those files! :(

Apache/PHP test server throttle by request

I've recently set up a test server running in a virtual machine on my computer so I can do such things as interactive debugging with XDebug. For the most part it's pretty sweet, but I've run into a snag when running multiple requests to the server at once from the same client.
The problem is that guest-host network connection doesn't really exist as a physical connection, so it will run as fast as the computer hardware will allow. This isn't usually a big issue, but I'm trying to implement APC file upload monitoring, and this requires an AJAX request to run in parallel to the file upload to monitor its performance. In the real world, the network would introduce lag and latency and suchlike, leaving enough unused bandwidth for the AjAX request to run in parallel with the file upload. However, in the test machine, the AJAX request can't fetch any data from the server until the upload is finished as there's absolutely no bandwidth left available to it.
Is it possible to set up some kind of bandwidth management in the virtual machine (in Apache, PHP or some Linux utility) that could limit the bandwidth available per HTTP request? For example so that each request is limited to 1mbps, but several requests can exist between the client and the server at the same time? I'm hoping that if this can be done it will allow the AJAX request to fetch its data while the upload is progressing instead of being stalled until the upload actually completes.
I tried a utility called IPRelay, but I don't seem able to get it to work, or at least not in a way that limits per request.
What you're asking for is called Traffic Shaping.
Lighttpd (an alternative to Apache) supports this natively
For Apache, there are a few ways of doing it.
mod_bandwidth - A 3pd module (that hasn't been updated recently) which appears to do the same thing.
mod_bwshare - 3pd module designed to combat DOS attacks, but may be helpful.
Here's a ServerFault Question that may be relevant...
Thanks for the reply. However, I found a handy little utility for Linux called iprelay that lets me throttle connections, it seems to let me have multiple connections open with each connection throttled to the specified limit. That's what I've been using today for testing my APC code and it all seems to be working fine.

Categories