How representative is load testing?

How representative is load testing? - php

I wish to test, like many other's I'm sure, "how many simultaneous requests can my web server handle".
By using tools like ab or siege, and hitting your apache web server / mysql database / php script with queries that represent real-life usage, how representative are the results you are getting back compared to what would be real-life usage by actual users?
I mean, for instance, testing with a utility, all the traffic comes from a single IP, while actual usage comes from many different IP addresses? Does this account for a world of difference?
If ab says my web server can handle 1000 requests per second, is this directly transferable to saying that the web server would handle 1000 requests per second from actual users?
I know this is a fluffy area, so the more concrete and direct replies I can get, the better. The old "it depends" won't help much :)

Sorry, but "it depends" is the best answer here.
Firstly, the most valuable tool in answering this question is not ab or siege or JMeter (my favourite open source tool), it's a spreadsheet.
The number of requests your system can handle is determined by which bottleneck you hit first. Some of those bottlenecks will be hardware/infrastructure (bandwidth, CPU, the effectiveness of your load balancing scheme), some will be "off the shelf" software and the way it's configured (Apache's ability to serve static files, for instance), and software (how efficiently your PHP scripts and database queries run). Some of the bottleneck resources may not be under your control - most sites hosted in Europe or the US are slow when accessed from China, for instance.
I've used a spreadsheet to model user journeys - this depends entirely on your particular case, but a user journey might be:
visit homepage
click "register/log in" link
register as new user
click "verify" link from email
access restricted content
Most sites support many user journeys - and at any one time, the mixture between those user journeys is likely to vary significantly.
For each user journey, I then assess the nature of the visitor requests - "visit homepage", for instance, might be "download 20 static files and 1 PHP script", while "register as new user" might require "1 PHP script", but with a fairly complex set of database scripts.
This process ends up as a set of rows in the spreadsheet showing the number of requests per type. For precision, it may be necessary to treat each dynamic page (PHP script) as it's own request, but I usually lump all the static assets together.
That gives you a baseline to test, based on a whole bunch of assumptions. You can now create load testing scripts representing "20 percent new users, 50 percent returning users, 10 percent homepage only, 20 percent complete purchase route, 20 percent abandon basket" or whatever user journeys you come up with.
Create a load testing script including the journeys and run it; ideally from multiple locations (there are several cheap ways to run Jmeter from cloud providers). Measure response times, and see where the response time of your slowest request exceeds your quality threshold (I usually recommend 3 seconds) in more than 10% of cases.
Try varying the split between user journeys - an advertising campaign might drive a lot of new registrations, for instance. I'd usually recommend at least 3 or 4 different mixtures.
If any of the variations in user journeys gives results that are significantly below the average (15% or more), that's probably your worst case scenario.
Otherwise, average the results, and you will know, with a reasonable degree of certainty, that this is the minimum number of requests you can support. The more variations in user journey you can test, the more certain it is that the number is accurate. By "minimum", I mean that you can be reasonably sure that you can manage at least this many users. It does not mean you can handle at most this many users - a subtle difference, but an important one!
In most web applications, the bottleneck is the dynamic page generation - there's relatively little point testing Apache's ability to serve static files, or your hosting provider's bandwidth. It's good as a "have we forgotten anything" test, but you'll get far more value out of testing your PHP scripts.
Before you even do this, I'd recommend playing "hunt the bottleneck" with just the PHP files - the process I've outlined above doesn't tell you where the bottleneck is, only that there is one. As it's most likely to be the PHP (and of course all the stuff you do from PHP, like calling a database), instrumenting the solution to test for performance is usually a good idea.
You should also use a tool like Yslow to make sure your HTTP/HTML set up is optimized - setting cache headers for your static assets will have a big impact on your bandwidth bill, and may help with performance as perceived by the end user. \

The short answer is no, probably not.
ab and friends, when run from the local machine, are not subject to network lag/bandwidth chokes.
Plus every real-life request requires different levels of processing - DB access/load, file includes etc etc.
Plus none of this takes into account the server load from other running background processes.

To get near real result i suggest you to analyze typical user behaviour, create a siege url's file with url users are visiting and run it with random delays. This results cant be directly transferable to production enivroment, but it's the nearest results you could get with your own. You can also try web services that test's web apps performance, but they are usually payed if you need complex test

But saying "it depends" doesn't help much, doesn't mean that the only valid answer isn't "it depends". Because it sort-of is.
Fact: Testing is not real-life usage.
Fact: Testing can come really close to real-life usage.
problem: how do you know if it does?
It depends on what you do with the requests.
Your single IP won't be a problem for many applications, so that would not be the first thing I'd worry about. But it could be: if you do complicated statistics once for every IP (save some information in a table you didn't design very well for instance), it means that you do this only once in test, so you'll have a bad time when the real users come along with their annoyingly different IP's
It depends on your test-system.
If all your requests come from a slow line (maybe it is slow because you are doing all these requests), you won't get a serious test. Basically, if you expect the incoming traffic to be more then your test-system's connection can handle.. you get the drift. The same will be true for CPU usage and the likes.
It depends on how good your tests are.
If your requests are for instance hitting all pages, but your users only hit one specific page, you will obviously get different results. The same would be true with frequency. If you hit the pages in an order that lets you get all advantage of things like cache (query cache is a tricky one in this, but also layers like memcached, varnish, etc), again, you will have a bad time. The simplest thing you can look for is the delay you can set on a siege test, but there are loads of other things you might want to take into account.
Writing good tests is hard, and the better your tests are, the closer you can get. But you need to know your system, know your users and know your tests. There really isn't much more to say then "it depends"

Related

How can i guarantee that my site won't get down with too many players accessing it?

i'm creating a simple browser game with online transactions, but i'm thinking... "How can i guarantee that my site won't get down with too many players accessing it?"
I'm asking because i'll pay digital influencers to do the marketing, so i suppose many people will access it...
I should contract a VPN and run backend with node.js or pure PHP will do a good job to hold the site?

Site stability has a lot of different factors. Two main points to consider:
If your site is static HTML and JS files, using a CDN like Cloudflare will provide very strong protection against the site ever going down.
Assuming there's a heavier lift than static files (like DB calls and server-side processing), this ultimately comes down to two factors:
The specs of your server (e.g. ram, CPUs)
The efficiency of your code
Books can be written about how hardware and code can be improved. Ultimately releasing it in the wild will show you how they handle the load. Great monitoring software (like AppOptics) can give you insights into when you're getting close to any limits and need to upgrade hardware or optimize code.
Practically speaking, if you're not expecting a giant load on day one (which, unless you have a fantastic marketing channel or a lot of followers, you likely won't have), you should be more concerned with building something of value than optimizing it. Optimizing comes later.

How to prevent excessive site visits (suspected screen scraping) from hackers?

I have a website that has been hacked once to have it's database stolen. I think it was done by an automated process that simply accessed the visible website using a series of searches, in the style of 'give me all things beginning with AA', then 'with AB', then 'with AC' and so on. The reality is a little more complicated than this, but that illustrates the principal of the attack. I found the thief and am now taking steps against them, but I want to prevent more like this in the future.
I thought there must be some ready made PHP (which I use) scripts out there. Something that for instance recorded the IP address of the last (say) 50 visitors and tracked the frequency of their requests over the last (say) 5 minutes. It would ban them for (say) 24 hours if they exceeded a certain threshold of requests. However to my amazement I can find no such class, library or example of code intended for this purpose, anywhere online.
Am I missing a trick, or is there a solution here - like the one I imagine, or maybe an even simpler and more effective safeguard?
Thanks.

There are no silver bullets. If you are trying to brainstorm some possible workarounds and solutions there are none that are particularly easy but here are some things to consider:
Most screen scrapers will be using curl to do their dirty work. There is some discussion such as here on SO about whether trying to block based on User-Agent (or lack thereof) is a good way to prevent screen scrapes. Ultimately, if it helps at all it is probably a good idea (and Google does it to prevent websites from screen scraping them). Because User-Agent spoofing is possible this measure can be overcome fairly easily.
Log user requests. If you notice an outlier that is far beyond your average number of user requests (up to you to determine what is uneacceptable), then you can serve them an HTTP 500 error until they revert back to an acceptable range.
Check number of broken links attempted. If a request to a broken link is served, add it to a log. A few of these should be fine, but it should be pretty clear to find someone who is fishing for data. If they are looking for AA, AB, AC, etc. When that occurs, start to serve HTTP 500 errors for all of your pages for a set amount of time. You can do this by serving all of your page requests through a Front Controller, or by creating a custom 404-file not found page and redirecting requests there. The 404 page can log them for you.
Set errors when there is a sudden change in statistics. This is not to shut anyone down, this is just to get you to investigate. The last thing you want to do is shut someone down by accident, because to them it will just seem like the website is down. If you set up a script to send you an e-mail when there has been a sudden change in usage patterns but before you shut someone down, it can help you adjust your decision making appropriately.
These are all fairly broad concepts and there are plenty of other solutions or tweaks on this that can work. In order to do it successfully you will need to monitor your own web patterns in order to determine a safe fix. This is not a small undertaking to craft such a solution (at least not well).
A Caveat
This is important: Security is always going to be counterbalanced by useability. If you do it right you won't be sacrificing too much security and your users will never run into these issues. Extensive testing will be important, and because of the nature of websites and downtime being so crucial, perform extensive testing whenever you introduce a new security measure, before bringing it live. Otherwise, you will have a group of very unhappy people to deal with and a potential en mass loss of users. And in the end, screen scraping is probably a better thing to deal with than angry users.
Another caveat
This could interfere with SEO for your web page, as search engines like Google employ screen scraping to keep records up to date. Again, the note on balance applies. I am sure there is a fix here that can be figured out but it would stray too far from the original question to look into it.

If you're using Apache, I'd look into mod_evasive:
http://www.zdziarski.com/blog/?page_id=442
mod_evasive is an evasive maneuvers module for Apache to provide
evasive action in the event of an HTTP DoS or DDoS attack or brute
force attack. It is also designed to be a detection and network
management tool, and can be easily configured to talk to ipchains,
firewalls, routers, and etcetera. mod_evasive presently reports abuses
via email and syslog facilities.
...
"Detection is performed by creating an internal dynamic hash table of
IP Addresses and URIs, and denying any single IP address from any of
the following:
Requesting the same page more than a few times per second
Making more than 50 concurrent requests on the same child per second
Making any requests while temporarily blacklisted (on a blocking list)"

What might be the best way to benchmark a users PC, PHP or JS?

PHP - Apache with Codeigniter
JS - typical with jQuery and in house lib
The Problem: Determining (without forcing a download) a user's PC ability &/or virus issue
The Why: We put out a software that is mostly used in clinics, but can be used from home, however, we need to know, before they go to our mainsite, if their pc can handle the enormities of our web-based, browser-served software.
Progress: So far, we've come up with a decent way to test dl speed, but that's about it.
What we've done: In php we create about a 2.5Gb array of data to send to the user in a view, from there the view calculates the time it took to get the data and then subtracts the php benchmark from this time in order to get a point of reference of upload/download time. This is not enough.
Some of our (local) users have been found to have "crappy" pc's or are virus infected and this can lead to 2 problems. (1)They crash in the middle of preforming task in our program, or (2) their virus' could be trying to inject into our js thus creating a bad experience that may make us look bad to the average (uneducated on how this stuff works) user, thus hurting "our" integrity.
I've done some googling around, but most plug-ins or advice forums/blogs i've found simply give ways to benchmark the speed of your JS and that is simply not enough. I need a simple bit of code (no visual interface included, another problem i found with one nice piece of js lib that did this, but would take days to remove all of the authors personal visual code) that will allow me to test the following 3 things:
The user's data transfer rate (i think we have this covered, but if better method presented i won't rule it out)
The user's processing speed, how fast is the computer in general
possible test for infection via malware, adware, whatever maybe harmful to the user's experience
What we are not looking to do: repair their pc! We don't care if they have problems, we just don't want to lead them into our site if they have too many problems. If they can't do it from home, then they will be recommended to go to their nearest local office to use this software "in house" so to speak.
Further Explanation
We know your can't test the user-side stuff with PHP, we're not that stupid, PHP is mentioned because it can still be useful in either determining connection speed or in delivering a script that may do what we want. Also, this is not a software for just anyone on the net to go sign up and use, if you find it online, unless you are affiliated with a specific clinic and have a login name and what not, your not ment to use the sight, and if you get in otherwise, it's illegal. I can't really reveal a whole lot of information yet as the sight is not live yet. What I can say, is it mostly used by clinics/offices for customers to preform a certain task. If they don't have time/transport/or otherwise and need to do it from home, then the option is available. However, if their home PC is not "up to snuff" it will be nothing but a problem for them and make the 2 hours task they are meant to preform become a 4-6hour nightmare. Thus the reason, i'm at one of my fav quest sights asking if anyone may have had experience with this before and may know a good way to test the user's PC so they can have the best possible resolution, either do it from home (as their PC is suitable) or be told they need to go to their local office. Hopefully this clears things up enough we can refrain from the "sillier" answers. I need a REAL viable solution and/or suggestions, please.

PHP has (virtually) no access to information about the client's computer. Data transfer can just as easily be limited by network speed as computer speed. Though if you don't care which is the limiter, it might work.
JavaScript can reliably check how quickly a set of operations are run, and send them back to the server... but that's about it. It has no access to the file system, for security reasons.
EDIT: Okay, with that revision, I think I can offer a real suggestion - basically, compromise. You are not going to be able to gather enough information to absolutely guarantee one way or another that the user's computer and connection are adequate, but you can get a general idea.
As someone suggested, use a 10MB-20MB file and several smaller ones to test actual transfer rate; this will give you a reasonable estimate. Then, use JavaScript to test their system speed. But don't just stick with one test, because that can be heavily dependent on browser. Do the research on what tests will best give an accurate representation of capability across browsers; things like looping over arrays, manipulating (invisible) elements, and complex math. If there is a significant discrepancy between browsers, then use different thresholds; PHP does know what browser they're using, so you can give the system different "good enough" ratings depending on that. Limiting by version (like, completely rejecting IE6) may help in that.
Finally... inform the user. Gently. First let them know, "Hey, this is going to run a test to see if your network connection and computer are fast enough to use our system." And if it fails, tell them which part, and give them a warning. "Hey, this really isn't as fast as we recommend. You really ought to go down to the local clinic to perform this task; if you choose to proceed, it may take a lot longer than intended." Hopefully, at that point, the user will realize that any issues are on them, not on you.

What you've heard is correct, there's no way to effectively benchmark a machine based on Javascript - especially because the javascript engine mostly depends on the actual browser the user is using, amongst numerous other variables - no file system permissions etc. A computer is hardly going to let a browsers sub-process stress itself anyway, the browser would simply crash first. PHP is obviously out as it's server-side.
Sites like System Requirements Lab have the user download a java applet to run in it's own scope.

Will <insert popular website here> restrict me from accessing their website if I request it too many times?

I ask this because I am creating a spider to collect data from blogger.com for a data visualisation project for university.
The spider will look for about 17,000 values on the browse function of blogger and (anonymously) save certain ones if they fit the right criteria.
I've been running the spider (written in PHP) and it works fine, but I don't want to have my IP blacklisted or anything like that. Does anyone have any knowledge on enterprise sites and the restrictions they have on things like this?
Furthermore, if there are restrictions in place, is there anything I can do to circumvent them? At the moment all I can think of to help the problem slightly is; adding a random delay between calls to the site (between 0 and 5 seconds) or running the script through random proxies to disguise the requests.
By having to do things like the methods above, it makes me feel as if I'm doing the wrong thing. I would be annoyed if they were to block me for whatever reason because blogger.com is owned by Google and their main product is a web spider. Allbeit, their spider does not send its requests to just one website.

It's likely they have some kind of restriction, and yes there are ways to circumvent them (bot farms and using random proxies for example) but it is likely that none of them would be exactly legal, nor very feasible technically :)
If you are accessing blogger, can't you log in using an API key and query the data directly, anyway? It would be more reliable and less trouble-prone than scraping their page, which may be prohibited anyway, and lead to trouble once the number of requests is big enough that they start to care. Google is very generous with the amount of traffic they allow per API key.
If all else fails, why not write an E-Mail to them. Google have a reputation of being friendly towards academic projects and they might well grant you more traffic if needed.

Since you are writing a spider, make sure it reads the robots.txt file and does accordingly. Also, one of the rules of HTTP is not to have more than 2 concurrent requests on the same server. Don't worry, Google's servers are really powerful. If you only read pages one at the time, they probably won't even notice. If you inject 1 second interval, it will be completely harmless.
On the other hand, using a botnet or other distributed approach is considered harmful behavior, because it looks like DDOS attack. You really shouldn't be thinking in that direction.

If you want to know for sure, write an eMail to blogger.com and ask them.

you could request it through TOR you would have a different ip each time at a peformance cost.

Are these generation times acceptable for php-mysql? Where can I improve? Where should I improve?

I'm in the process of developing my first major project. It's a light-weight content management system.
I have developed all of my own framework for the project. I'm sure that will attract many flames, and a few 'tut-tut's, but it seems to be doing quite well so far.
I'm seeing page generation times of anywhere from 5-15 milliseconds. (An example, just in case my numbers are wrong, is 0.00997686386108 seconds).
I want to make sure that the application is as efficient as possible. While it looks good in my testing environment, I want to be sure that it will perform as well as possible in the real world.
Should I be concerned about these numbers - and thus, take the time to fine tune MySQL and my interaction with it?
Edit: Additionally, are there some tools or methods that people can recommend for saturating a system, and reporting the results?
Additional Info: My 'testing' system is a spare web hosting account that I have over at BlueHost. Thus, I would imagine that any performance I see (positive or negative) would be roughly indicative of what I would see in the 'real world'.

Performing well in your testing environment is a good start, but there's other issues you'll need to think about as well (if you haven't already). Here's a couple I can think of off the top of my head:
How does your app perform as data sizes increase? Usually a test environment has very little data. With lots of data, things like poorly optimized queries, missing indexes, etc. start to cause issues where they didn't before. Performance can start to degrade exponentially with respect to data size if things are not designed well.
How does your app perform under load? Sometimes apps perform great with one or two users, but resource contention or concurrency issues start to pop up when lots of users get involved.

You're doing very well at 5-15 ms. You're not going to know how it performs under load by any method other than throwing load at it, though.

As mentioned in another question: What I often miss is the fact, that most websites could increase their speed enormously by optimizing their frontend, not their backend. Have a look at this superb list about speeding up your frontend # yahoo.com:
Minimize HTTP Requests
Use a Content Delivery Network
Add an Expires or a Cache-Control Header
Gzip Components
Put Stylesheets at the Top
Put Scripts at the Bottom
Avoid CSS Expressions
Make JavaScript and CSS External
Reduce DNS Lookups
Minify JavaScript and CSS
Avoid Redirects
Remove Duplicate Scripts
Configure ETags
Make Ajax Cacheable
Flush the Buffer Early
Use GET for AJAX Requests
Post-load Components
Preload Components
Reduce the Number of DOM Elements
Split Components Across Domains
Minimize the Number of iframes
No 404s
Reduce Cookie Size
Use Cookie-free Domains for Components
Minimize DOM Access
Develop Smart Event Handlers
Choose < link> over #import
Avoid Filters
Optimize Images
Optimize CSS Sprites
Don't Scale Images in HTML
Make favicon.ico Small and Cacheable
Keep Components under 25K
Pack Components into a Multipart Document

5-15 milliseconds is totally acceptable as a page generation time. But what matters most is how well your system performs with many people accessing your content at the same time. So you need to test your system under a heavy load, and see how well it scales.
About tuning, setting up a clever cache policy is often more efficient than tuning MySQL, especially when your database and your http server are on different machines. There are very good Qs and As about cache on StackOverflow, if you need advices on that topic (I like that one, maybe because I wrote it :)

It depends on a few factors. The most important is how much traffic you're expecting the site to get.
If your site is going to be fairly low traffic (maybe 1,000,000 page views per day - an average of around 11 per second), it should be fine. You'll want to test this - use an HTTP benchmarking tool to run lots of requests in parallel, and see what kind of results you get.
Remember that the more parallel requests you're handling, the longer each request will take. The important numbers are how many parallel requests you can handle before the average time becomes unacceptable, and the rate at which you can handle requests.
Taking that 1,000,000 views per day example - you want to be able to handle far more than 11 requests per second. Likely at least 20, and at least 10 parallel requests.
You also want to test this with a representative dataset. There's no point benchmarking a CMS with one page, if you're expecting to have 100. Take your best estimate, double it, and test with a data set at least that large.
As long as you're not doing something stupid in your code, the single biggest improvement you can make is caching. If you make sure to set up appropriate caching headers, you can stick a reverse proxy (such as Squid) in front of your webserver. Squid will serve anything that's in it's cache directly, leaving your PHP application to handle only unique or updated page requests.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.