Handling hundreds of simultaneous requests in rails - php

I am writing a ruby on rails application and one of the most important featuers of the website is live voting. We fully expect that we will get 10k voting requests in as little as 1 minutes. Along with other requests that means we could be getting a ton of requests.
My initial idea is to set up the server to use apache + phusion, however, for the voting specifically I'm thinking about writing a php script on side and to write/read the information in memcached. The data only needs to persist for about 15 minutes, so writing to the database 10,000 times in 1 minute seems pointless. We also need to mark the ip of the user so they don't vote twice thus being extra complicated in memcached.
If anyone has any suggestions or ideas to make this work as best as possible, please help.

If you're architecting an app for this kind of massive influx, you're going to need to strip down the essential components of it to the absolute minimum.
Using a full Rails stack for that kind of intensity isn't really practical, nor necessary. It would be much better to build a very thin Rack layer that handles the voting by making direct DB calls, skipping even an ORM, basically being a wrapper around an INSERT statement. This is something Sinatra and Sequel, which serves as an efficient query generator, might help with.
You should also be sure to tune your database properly, plus run many load tests against it to be sure it performs as expected, with a healthy margin for higher loading.
Making 10,000 DB calls in a minute isn't a big deal, each call will take only a fraction of a millisecond on a properly tuned stack. Memcached could offer higher performance especially if the results are not intended to be permanent. Memcached has an atomic increment operator which is exactly what you're looking for when simply tabulating votes. Redis is also a very capable temporary store.
Another idea is to scrap the DB altogether and write a persistent server process that speaks a simple JSON-based protocol. Eventmachine is great for throwing these things together if you're committed to Ruby, as is NodeJS if you're willing to build out a specialized tally server in JavaScript.
10,000 operations in a minute is easily achievable even on modest hardware using a specialized server processes without the overhead of a full DB stack.
You will just have to be sure that your scope is very well defined so you can test and heavily abuse your implementation prior to deploying it.
Since what you're describing is, at the very core, something equivalent to a hash lookup, the essential code is simply:
contest = #contest[contest_id]
unless (contest[:voted][ip])
contest[:voted][ip] = true
contest[:votes][entry_id] += 1
end
Running this several hundred thousand times in a second is entirely practical, so the only overhead would be wrapping a JSON layer around it.

Related

Loader.io test "Maintain client load"

I just wonder how "Maintain client load" test' work and how to properly configure our environment (LAMP + nginX) to get the best result ? Can anyone explain me this test?
loader.io engineer here. I fully expect this question to be closed by noon, but I'll take a stab at explaining it anyway.
"maintain load" tests are kind of a strange beast. It may help to think of any loader test in terms of a "workload", which consists of the list of URLs you are testing.
In loader, you specify a number of clients for your test, and each client takes a copy of the workload and runs it. If the client is in "maintain load" mode, it iterates over the URLs in the workload repeatedly - maintaining its load. All the other clients do the same.
Below is a visualization of what the pattern of requests looks like, taken from a loader.io blog post
This has some interesting side-effects. If you configure your test to ramp up the number of clients, what we see quite often is that response times at the beginning of a test are low, so clients are iterating fast over their workload. As more clients are added, responses get slower, effectively slowing down the request rate. This can make maintain load tests difficult to reason about, and that's why I personally don't recommend starting with maintain load tests.
As far as configuring your stack for best results, it depends on what "best results" means for you and what you are even doing with your stack. There is no silver bullet. If you're serving a static website then cache the heck out of it for best performance. If you have a complex app making database queries on every request and rendering things - profile your code, db queries, and everything else to tune your performance.
Define some requirements and set some performance goals - e.g. do you expect to a hundred page views within an hour? A minute? Figure out what those requirements are and then go ahead and test it.
Once you have your requirements, you can use loader.io and/or other load testing tools in a much more meaningful way. If your current performance doesn't match your requirements and goals, you can use these tools to check your progress. Start with small tests that your servers handle easily and increase it until things break. Then optimize your code/database queries/etc and test again to see how much you've improved.

What happens when my PHP website will start having a LOT of members?

This is something I am really curious about and I do not really understand how is that possible.
So lets say I am the owner of Facebook (ahah) and I have million of people visiting my website every day, thousands and thousands of images, videos, logs etc..
How do I store all this data?
Do I have more databases in different servers around the world and then I connect to them from a single location?
Do I use an internal API system that requests info from other servers where the data is stored?
For example I know that Facebook has a lot of data centers around the world and hundreds of servers..
How do they connect to these servers? Are the profiles stored in different locations and when I connect to my profile, I will then be using that specific server? Or is there one main server that has the support of other hundreds of servers around the world?
Is there a way to use PHP in a way that I will connect to different servers and to different mySQL (???) databases to store and retrieve data whenever I want?
Sorry if this looks like a silly question, but since it could happen a day to work on a successful website, I really want to know what I will have to do, and what is the logic behind.
Thank you very much.
I'll try to answer your (big) question but not from Facebook point of view since their architecture is pretty much known.
First thing you have to know is that you would have to distribute the workload of your web application. Question is how, so in order to determine what's going to be slow, you have to divide your app in segments.
First up is the HTTP server, or the one that accepts all the requests. By going to "www.your-facebook.com", you're contacting a service on an IP. Naturally, you would probably have more than one IP but let's say you have a single entry point.
Now what happens? You have an HTTP server software, let's say Apache and it handles incoming connections. Since Apache creates a thread per connected user, it requires certain amount of memory for that operation. Eventually, it will run out of memory and then shit hits the fan, stuff stops working, your site is unavailable.
Therefore, you have to somehow scale this part of your application that connects your PHP code / MySQL db to people who want to interact with it.
Let's assume you successfully scaled your Apache and you have a cluster of computers which can accept new computers in order to scale-out. You solved your first problem.
Next part is the actual layer that does the work. Accepts input from the user and saves it somewhere (MySQL) and that's the biggest problem you'll have - why?
Due to the database.
Databases store their data on mediums such as hard drives. Hard drives, be it an SSD or mechanical one - are limited by their ability to write or retrieve data. If I'm not mistaken, RAM operates at levels of around 6GB/sec transfer rate. Not to mention that the seek time is also much much lower than HDD's one is.
Therefore, if you have an X amount of users asking for a piece of information and you can only deliver it at a certain rate - your app crashes, or it becomes unresponsive and the layer handling database queries becomes slow since the hardware cannot match the speed at which you need the data.
What are the options here? There are many, I won't mention all of them
Split Reads and Writes. Set your database layer in such a way that you have dedicated machines that write the data and completely different ones that read it. You have to use replication and replication has its own quirks - it never works without breaking.
Optimize handling of your data set by sharding your data. Great for read / write performance, screwed up when you need to query multiple shards and merge the data.
Get better hardware, especially storage (such as FusionIO)
Pay for better storage engine (such as TokuDB)
Alleviate load on the database by using caching. The data that your users request probably doesn't change so often that you have to query the db every single time (say you're viewing someone's profile, what's the chance they'll change it every second?). That's why Facebook uses Memcached extensively - a system that stores small pieces of data in RAM, it's easily scalable and what not. Most important, it's damn quick!
Use different solutions next to MySQL. MySQL (and some other databases) aren't good for every type of data storage or retrieval. Someone mentioned NoSQL before. NoSQL solutions are quick, but still immature. They don't do as much as relational databases do. They use methods of delaying disk write (they keep cached copy of data they need to write in RAM) so that they can achieve fast insert rates. That's why it's not unusual to lose data when using NoSQL.
Topic about MySQL vs "insert database or whatever here" is broad, I don't want to go into that but remember - every single one of data stores out there saves data on the hard drive eventually. The difference (physical of course) is how they optimize their flushing to the disk itself.
I also didn't mention various reports you can run by gathering the data (how many men between 19 and 21 have clicked an advert X between 01:15 and 13:37 CET and such) which is what Facebook is actually gathering (scary stuff!).
Third up - the language gluing the data store (MySQL) and output (HTTP server). PHP.
As you can see, most of the work here is already done by Apache and MySQL. Optimization on PHP level is small, even facebook got small results (they claim 50%, but that's UP TO 50%). I tried HipHop extensively, it is not as fast as it claims to be. Naturally, Facebook guys mentioned that already, so it's no wonder. The advantage they get is because they replaced Apache with their own server built in into HipHop. Some people claim "language X is better than language Y" and they're right, but that's not always the case. Each language has its own advantages and disadvantages.
For example, PHP is widely-spread but it's slow for certain operations (implementing a Trie with over 1 billion entries for example). It's great for things like echo some HTML after parsing the output from the db. It's quick to insert and retrieve data from the database, and that's about 90% of the PHP usage - talk to the db, display the data, end.
Therefore, no matter what language you use (say we used C++ instead of PHP), your bottleneck will be the data storage / retrieval layer.
On the other hand, why is using C++ NOT handy? Because there are more people who know how to use PHP than ones who use C++. It's also MUCH slower to develop web apps in C++. Sure, they will execute faster, but who will notice the difference between 1 millisecond and 1 microsecond?
This post is more like an informative blog post, I know it's not filled with resources to back up my claims but anyone who did any work with larger data sets or websites will know that the P.I.T.A. is always the data storage component. Some things that I said probably won't fit with everyone, but in a NUTSHELL this is how you'd go about optimizing your site.
Unfortunately, your question doesn't have a simple answer. For the MySQL portion of it, you would need to investigate database scale-out. You can start looking at it here: http://www.mysql.com/why-mysql/scaleout/mixi.html. There are a number of different ways to set up Apache/PHP web sites across a server farm. One of them involves setting up round robin DNS. This is adding a DNS record with a number of different IP addresses. Your DNS then hands out a different IP address each time the record is requested so that the load is balanced across a number of servers. You can also set up clustering with MySQL, Apache and Heartbeat, but that is more of a high-availability solution than a scaling solution.
When you have a website with so many users you'll already have enough experience to know the answer of the question, you'll also have a lot of money to pay people to find the optimal architecture of your system.
I'm not saying that what I describe below is the Holy Grail, but it is certainly an option:
You will have a big, fragmented database with lots of backups and you'll have a few name servers which will know the location of servers and some rules about the data stored on each server. When data is searched the query will be sent to a name server which will find the server(s) where the answer can be found for the particular query. I've also upvoted N.B.'s answer, I think he is mostly right.
For lots of users, you should have a server with lots of memory and speed. Configure php.ini to allow more memory usage. A server with lots of users should have 4-12GB available. Also, save resources by closing the desktop environment. If you have this many users, you might want to consider a CDN and also make a database request queue.

Addressing scalability issues in web applications

This is a very general question from a newbie thinking about web application scalability. I am hosting my php based web application on a single microsoft IIS server. How do I determine how maximum number of connections that a IIS server can support without affecting performance? Also, main performance criteria for a web application in this situation would be the http response time correct ? I have a mysql database that does some expensive joins. So, my question really is - how to figure out how many max connections the server can handle? And How to speed up database performance ? I m looking for general recommendations.
ufff this is really generic question.
regarding the maximum amount of request the server can server. Try using some tool to stress it. I would recommend jmeter
regarding scalability:
Use optimized indexes
Cache much as you can: scripts, pages, images, etc.
optimize your site
but remember that premature optimization is the root of all evil and can cost you more than you think
To stress test you can use: http://support.microsoft.com/kb/231282/en-us
For what regards the database the only way (if you want to stick with one server) is to do less query per request and maybe use materialized view (be aware of table updates at this point)
The best of course is to cache your HTML so when users request your pages you don't need even the db connction, you just sends the html cached
First you need to understand what performance is acceptable to your user experience. That usually breaks down to response time of the server. If your maximum response time can not exceed 1 second for users to have a good experience, then you figure out how many queries per second the server can handle, end to end , without violating the 1 second response time for 99% of the queries. Once it violates that, its time to add more capacity in the form of servers.

Better Practice: Placing the load on SQL or Web server?

I'm the webmaster for a major US university. We have a great deal of requests on our website, which I've built and been in charge of for the last 7 years or so. I've been building ever-more-complex features into our website and it's always been my practice to put as much of the programming burden on our multi-processor Microsoft SQL server as possible - using stored procedures, views, etc, and fill-in what can't be done with PHP, ASP, or Perl from the IIS web server. Both servers are very powerful and capable machines. Since I've been doing this alone for so long without anyone else to brainstorm with, I'm curious if my approach is ideal for even higher load situations we'll have in the future.
My question is: Is it better practice to place more of the load burden on the SQL server using nested SELECT statements, views, stored procedures and aggregate functions, or should I be pulling multiple simpler queries and processing through them using server-side compile-time scripts like PHP? Keep on keepin' on or come up with a better way?
I've recently become more interested in performance after I did some load traces and learned just how much I've been putting on the shoulders of the SQL server. Both the web server and SQL servers are fast and responsive throughout the day, and almost without regard for how much I put on them, but I'd like to be ready and have trained myself and upgraded my existing code optimized best practices in mind by the time it becomes important.
Thanks for your advice and input.
You put each layer in your stack to use in the domain it fits best.
There is no use in having your database server send 1000 rows and using PHP to filter them if a WHERE-clause or GROUP-clause would suffice. It's not optimal to call the database to add two integers (SELECT 5+9 works fine, but php can do it itself, and you save the roundtrip).
You will probably want to look into scalability: what parts of your application can be divided unto multiple processes? If you're still just using 2 layers (script & db), there is a lot of room for scaling there. But always start with the bottleneck first.
Some examples: host static contents on CDN, use caching for your pages, read about nginx and memcached, use nosql (mongoDB), consider sharding, consider replication.
My opinion is that it's generally (mostly) best to favor letting the web servers do the processing. Two points:
First is scalability. Once your application gets enough usage, you'll need to start worrying about load balancing. And it's a lot easier to drop in a couple of extra web servers pointing to a common database than it is to set up a distributed database cluster. So best to take as much strain away from the Database as you can and keep it on a single machine for as long as possible.
The second point i'd like to make is about optimizing the queries. This will depend a lot on the queries you are using, and the database backend. When i first started working with databases, i fell into the trap of making elaborate SQL queries with multiple JOINs that fetched exactly the data i wanted, even if it was from four or five different tables. I reasoned that "That's what the database is there for - lets get it to do the hard work"
I quickly found that these queries took way too long to execute, and often ended up blocking the database from other requests. While it may seam inefficient to split your query into multiple requests (for example in a for loop), you'll often find that executing multiple small queries with fast indexes will make your application run far more smoothly than trying to pass all the hard work to the database
Firstly, you might want to check if there is any load which can be removed entirely by client side caching (.js, .css, static HTML and images), and use of technologies such as AJAX to do partial updates of screens - this will remove load on both web and sql servers.
Secondly, see if there is sql load which can be reduced by web server caching - e.g. static or low refresh data - if you have a lot of 'content' pages on your systems, have a look at common CMS caching techniques which will scale to allow many more users to view the same data without rebuilding the page or hitting the database.
I tend to do as much as possible outside the db, viewing db calls as expensive/time-intensive.
For example, when performing a select on a user table with fields name_given and name_family, I could fatten the query to return a column called full_name built by concatenation. But that kind of thing can be easily done in a model on your server-side scripting language (PHP, Ruby, etc).
Of course, there are cases when the db is the more "natural" place to perform an operation. But, in general, I incline more towards putting the load on the web server and optimize there with many of the techniques noted in other answers.

How to load balance (scale) a simple PHP application?

I constantly read on the Internet how it's important to correctly architect my PHP applications so that they can scale.
I have built a simple/small CMS that is written in PHP (think of Wordpress, but waaaay simpler).
I essentially have URLs like such: http://example.com/?page_id=X where X is the id in my MySQL database that has the page content.
How can I configure my application to be load balanced where I'm simply performing PHP read activities.
Would something like Nginx as the front door setup to route traffic to multi-nodes running my same code to handle example.com/?page_id=X be enough to "load balance" my site?
Obviously, MySQL is not being load balanced in this situation, though for simplicity - that makes that out of scope for this question.
These are some well known techniques for scaling such an app.
Reduce DB hits
Most often the bottle neck will be your DB, so cache recent pages so that you reduce DB activity, perhaps in something like memcached.
Design your schema such that it is partition-able.
In the simplest case, separate your data into logical partitions, and store each partition in a separate mysql DB. Craigslist, for example, partitions data by city, and in some cases, by section within that. In your case, you could partition by Id quite simply.
Manage php sessions
Putting ngnx in front of a php website will not work if you use sessions. Load balancing php does have issues as sessions are persisted on local storage. Therefore you need to do session management explicitly. The traditional solution is to use memcached to store and look up some kind of cookie.
Don't optimize prematurely.
Focus on getting your application out so that the next magnitude of current users gets the optimal experience.
Note: Your main potential pain points are discussed here on SO
No, it is not at all important to scale your application if you don't need to.
My view on this is:
Make it work
Make sure it works correctly - testability, robustness
Make it work efficiently enough to be cost effective to run
Then, if you have to so much traffic that your system cannot handle it, AND you've already thrown all the hardware that (sensible) money can buy at it, then you need to scale. Not sooner.
Yes it is relatively easy to scale read-workloads, because you can simply perform reads against readonly database replicas. The challenge is to scale write-workloads.
A lot of sites have few writes, even if they're really busy.
The correct approach is to use some kind of load balancer such as:
http://www.softwareprojects.com/resources/programming/t-how-to-install-and-configure-haproxy-as-an-http-loa-1752.html
What this does is forward a certain user session only to a certain server, hence you dont have to worry about sessions and where they are stored at all. What you do have to worry is how to distribute the filesystem if the 2 servers are running on two different machines, especially if you make heavy use of the filesystem. Hope this article above helps...

Categories