Related
I'm currently working on a social web application using python/django. Recently I heard about the PHP's weakness on large scale projects, and how hippo-php helped Facebook to overcome this barrier. Considering a python social web application with lot of utilization, could you please tell me if a similar custom tool could help this python application? In what way? I mean which portion (or layer) of application need to be written for example in c++? I know that it's a general question but someone with relevant experience I think that could help me.
Thank you in advance.
The portion to rewrite in C++ is the portion that is too slow in Python. You need to figure out where your bottleneck is, which you can do by load testing or just waiting until users complain.
Of course, even rewriting in C++ might not help. Your bottleneck might be the database (move to a separate, faster DB server or use sharding) or disk, or memory, or anything. Find bottleneck, work out how to eliminate bottleneck, implement. With 'test' inbetween all those phases. General advice.
There's normally no magic bullet, and I imagine Facebook did a LOT of testing and analysis of their bottlenecks before they tried anything.
Don't try to scale too early! Of course you can try to be prepared but most times you can not really know where you need to scale and therefore spend a lot of time and money in wrong direction before you recognize it.
Start your webapp and see how it goes (agreeing with Spacedman here).
Though from my experience the language of your web app is less likely going to be the bottleneck. Most of the time it starts with the database. Many times it simply a wrong line of code (be it just a for loop) and many other times its something like forgetting to use sth. like memcached or task management. As said, find out where it is. In most cases its better to check something else before blaming the language speed for it (since its most likely not the problem!).
You can think about PostgreSQL as Oracle, so from what I've found on the internet (because I am also a beginner) here is the order of DBs from smaller projects, to biggest:
SQLite
MySql
PostgreSQL
Oracle
Aside from scalability issues, has anyone here actually stumbled upon a web-development problem where PHP just didn't cut it and had to go for another language / platform?
I am interested in particular scenarios and the ways they have been handled.
Thanks.
I've been working as a PHP developper (sometimes on projects not too small) for more than 3 years, and I have never really met anything that PHP wouldn't allow me to do.
Of course, you sometimes/always have to use several servers, some other piece of software (database, reverse proxy, cache, ...) ; but that's part of the game ;-)
Actually, the best thing about PHP is its "glue" nature : what PHP does is allow you to glue stuff together, to build your application using different components.
And PHP does that really well.
Sometimes, you'd have to programm in C to code a PHP extension, to glue to something no-one else has ever used (there are a lot of PHP extension which do that already like mysql or curl, to say only two names) ; but there are so many extensions that already exist that I've never had to do that -- even if I'll probabaly do so one day or another, just for fun ;-)
An important thing to note is that there is probably always a solution to your problems :
You're speaking about scalability ; what about caching ? using several servers ? using a reverse proxy ? PHP has no problem with that.
And, as you can see on SO (and in so many places) : PHP has a great community !
If I had to think about one thing that PHP is not well suited for, I'd say "comet" : PHP model of one process per request is not good for long polling and the like...
PHP is not quite good for long running batches, too ; and you often have some of those alongside your web application ; and using the same language allows you to reuse code -- still, I've always found a (not too difficult) solution.
Oh, and I'd also say : PHP is great for web applications... But not that great when is comes to desktop applications -- even if it's possible (see PHP-GTK for example).
The only limit I've ever reached as a programmer was that of my own abilities. I've worked on sites that did a few hundred hits a day and maintained software for a network of sites doing over 1M uniques/day total and both ran on the same piece of the software. The pressure wasn't on PHP, the pressure was on me to make PHP, the servers and the databases all work together to make use of them properly and to do things in a scalable way.
Query optimization
Database optimization
HTTP Caching
Memcache/APC
Server optimization
Master/Slave database setup
Profiling
Choosing just the right amount of DB normalization
Proper logging (only logging what's useful and not so much logging that it becomes useless)
Attention to detail
Proper testing
Code organization
Version control with good branching/tagging
File servers vs web servers vs database servers
Reverse proxies
General security (preventing SQL Injection, preventing XSS attacks, Session hijacking, etc)
All things learned along the way to making myself a better programmer. There's very few languages that can't run any site on the internet without the proper hardware behind it. Much of your job as a programmer is finding out the best way to take advantage of that hardware.
There are things however that PHP isn't all that good at by it's nature. Such things would include:
Desktop software (although possible)
Daemons (again, possible)
Large scale string manipulation (such as scraping sites) in a timely manner
PHP is Turing-complete, so it technically doesn't have any limits than any other language doesn't have. However, there are things that I find easier to do in other languages.
I am new in the website scalability realm. Can you suggest to me some the techniques for making a website scalable to a large number of users?
Test your website under heavy load.
Monitor all statistics
Find bottleneck
Fix bottleneck
Go back to 1
good luck
If you expect your site to scale beyond the capabilities of a single server you will need to plan carefully. Design so the following will be possible:-
Make it so your database can be on a separate server. This isn't normally too hard.
Ensure all your static content can be moved to a CDN, as this will normally pull a lot of load off your servers.
Be prepared to spend a lot of money on hardware. More RAM and faster disks help a LOT.
It gets a lot harder when you need to split either the database or the php from a single server to multiple servers, so optimise everything, from your code, your database schema, your server config and anything else you can think of to put this final step off for as long as possible.
Other than that, all you can do is stress test your site, figure out where the bottlenecks are and try and design them away.
Check out this talk by Rasmus Lerdorf (creator of PHP)
Specially Page 8 and beyond.
You might want to look at this resource- highscalability.com.
A number of people have mentioned tools for identifying bottlenecks, and that is of course necessary. You can't spend productive time speeding something up without knowing where it's slow. But the other thing you need to know is where your target scalability lies. Is it value for money to spend a couple of months making your site scale to the same number of users as Twitter if it's going to be used by three people in HR? Do you have a known rate of transactions, or response latency, or number of users, in the requirements of the product? If so, target those numbers with your optimisation strategy. If not, find those out before chasing the performance rat down the hole.
Very similar: How Is PHP Done the Right Way?
Scalability is no small subject and certainly more material than can be reasonably covered in a single question.
For instance, with some kinds of applications, joins (in SQL) don't scale, which brings up all sorts of caching and sharding strategies.
Beanstalk is another scalability and performance tool in high-performance PHP sites. As is memcache (different kind).
The biggest problem for scalability is usually shared resources like DBMS's. The problem arises because DBMS's usually have no way to relax consistency guarantees.
If you want to increase scalability when you use something like MySQL you have to change your schema design to relax consistency.
For instance, you can separate your database schema to have your normalized data model for writes, and a replicated read only denormalized part for the 90% of read operations. The read only data can be spread over several servers.
Another way to increase scalability of a database is to partition the data, e.g. separate the data into a database for every department and aggregate them either in the ORM or in the DBMS.
In order of importance:
If you run PHP, use an opcode cache like APC. (This is important enough to be built-in in the next generation of PHP.)
Use YSlow or Google Page Speed to identify bottlenecks. (This will reveal structural problems with your website that affect both client and server performance.)
Ensure that your web server sends a proper Expires header for static content (images, Javascript, CSS), such that the browser can cache it properly. (YSlow will warn you about this, too.)
Use an HTTP accelerator, such as Varnish. (This picture says it all – and they already had an HTTP accelerator in place.)
Develop your site using solid OOP techniques. You will need your site to be modular as not all performance bottlenecks are obvious at the start. Be ready to refactor parts of your site as traffic increases. The first sentence I wrote will help you do it more easily and safely. Also, use test driven development, As refactor means new introduced bugs, and good TDD is good in catching them before they go into production.
Separate as much as possible client side code from server side code, as they will likely to be served from different servers, if your site traffic justify this.
Read articles (read the YSlow tips for instance).
GL
In addition to the other suggestions, look into splitting your sites into tiers, as in multitier architecture. If done right, you can then use one server per tier.
A site I built with Kohana was slammed with an enormous amount of traffic yesterday, causing me to take a step back and evaluate some of the design. I'm curious what are some standard techniques for optimizing Kohana-based applications?
I'm interested in benchmarking as well. Do I need to setup Benchmark::start() and Benchmark::stop() for each controller-method in order to see execution times for all pages, or am I able to apply benchmarking globally and quickly?
I will be using the Cache-library more in time to come, but I am open to more suggestions as I'm sure there's a lot I can do that I'm simply not aware of at the moment.
What I will say in this answer is not specific to Kohana, and can probably apply to lots of PHP projects.
Here are some points that come to my mind when talking about performance, scalability, PHP, ...
I've used many of those ideas while working on several projects -- and they helped; so they could probably help here too.
First of all, when it comes to performances, there are many aspects/questions that are to consider:
configuration of the server (both Apache, PHP, MySQL, other possible daemons, and system); you might get more help about that on ServerFault, I suppose,
PHP code,
Database queries,
Using or not your webserver?
Can you use any kind of caching mechanism? Or do you need always more that up to date data on the website?
Using a reverse proxy
The first thing that could be really useful is using a reverse proxy, like varnish, in front of your webserver: let it cache as many things as possible, so only requests that really need PHP/MySQL calculations (and, of course, some other requests, when they are not in the cache of the proxy) make it to Apache/PHP/MySQL.
First of all, your CSS/Javascript/Images -- well, everything that is static -- probably don't need to be always served by Apache
So, you can have the reverse proxy cache all those.
Serving those static files is no big deal for Apache, but the less it has to work for those, the more it will be able to do with PHP.
Remember: Apache can only server a finite, limited, number of requests at a time.
Then, have the reverse proxy serve as many PHP-pages as possible from cache: there are probably some pages that don't change that often, and could be served from cache. Instead of using some PHP-based cache, why not let another, lighter, server serve those (and fetch them from the PHP server from time to time, so they are always almost up to date)?
For instance, if you have some RSS feeds (We generally tend to forget those, when trying to optimize for performances) that are requested very often, having them in cache for a couple of minutes could save hundreds/thousands of request to Apache+PHP+MySQL!
Same for the most visited pages of your site, if they don't change for at least a couple of minutes (example: homepage?), then, no need to waste CPU re-generating them each time a user requests them.
Maybe there is a difference between pages served for anonymous users (the same page for all anonymous users) and pages served for identified users ("Hello Mr X, you have new messages", for instance)?
If so, you can probably configure the reverse proxy to cache the page that is served for anonymous users (based on a cookie, like the session cookie, typically)
It'll mean that Apache+PHP has less to deal with: only identified users -- which might be only a small part of your users.
About using a reverse-proxy as cache, for a PHP application, you can, for instance, take a look at Benchmark Results Show 400%-700% Increase In Server Capabilities with APC and Squid Cache.
(Yep, they are using Squid, and I was talking about varnish -- that's just another possibility ^^ Varnish being more recent, but more dedicated to caching)
If you do that well enough, and manage to stop re-generating too many pages again and again, maybe you won't even have to optimize any of your code ;-)
At least, maybe not in any kind of rush... And it's always better to perform optimizations when you are not under too much presure...
As a sidenote: you are saying in the OP:
A site I built with Kohana was slammed with
an enormous amount of traffic yesterday,
This is the kind of sudden situation where a reverse-proxy can literally save the day, if your website can deal with not being up to date by the second:
install it, configure it, let it always -- every normal day -- run:
Configure it to not keep PHP pages in cache; or only for a short duration; this way, you always have up to date data displayed
And, the day you take a slashdot or digg effect:
Configure the reverse proxy to keep PHP pages in cache; or for a longer period of time; maybe your pages will not be up to date by the second, but it will allow your website to survive the digg-effect!
About that, How can I detect and survive being “Slashdotted”? might be an interesting read.
On the PHP side of things:
First of all: are you using a recent version of PHP? There are regularly improvements in speed, with new versions ;-)
For instance, take a look at Benchmark of PHP Branches 3.0 through 5.3-CVS.
Note that performances is quite a good reason to use PHP 5.3 (I've made some benchmarks (in French), and results are great)...
Another pretty good reason being, of course, that PHP 5.2 has reached its end of life, and is not maintained anymore!
Are you using any opcode cache?
I'm thinking about APC - Alternative PHP Cache, for instance (pecl, manual), which is the solution I've seen used the most -- and that is used on all servers on which I've worked.
See also: Slides APC Facebook,
Or Benchmark Results Show 400%-700% Increase In Server Capabilities with APC and Squid Cache.
It can really lower the CPU-load of a server a lot, in some cases (I've seen CPU-load on some servers go from 80% to 40%, just by installing APC and activating it's opcode-cache functionality!)
Basically, execution of a PHP script goes in two steps:
Compilation of the PHP source-code to opcodes (kind of an equivalent of JAVA's bytecode)
Execution of those opcodes
APC keeps those in memory, so there is less work to be done each time a PHP script/file is executed: only fetch the opcodes from RAM, and execute them.
You might need to take a look at APC's configuration options, by the way
there are quite a few of those, and some can have a great impact on both speed / CPU-load / ease of use for you
For instance, disabling [apc.stat](https://php.net/manual/en/apc.configuration.php#ini.apc.stat) can be good for system-load; but it means modifications made to PHP files won't be take into account unless you flush the whole opcode-cache; about that, for more details, see for instance To stat() Or Not To stat()?
Using cache for data
As much as possible, it is better to avoid doing the same thing over and over again.
The main thing I'm thinking about is, of course, SQL Queries: many of your pages probably do the same queries, and the results of some of those is probably almost always the same... Which means lots of "useless" queries made to the database, which has to spend time serving the same data over and over again.
Of course, this is true for other stuff, like Web Services calls, fetching information from other websites, heavy calculations, ...
It might be very interesting for you to identify:
Which queries are run lots of times, always returning the same data
Which other (heavy) calculations are done lots of time, always returning the same result
And store these data/results in some kind of cache, so they are easier to get -- faster -- and you don't have to go to your SQL server for "nothing".
Great caching mechanisms are, for instance:
APC: in addition to the opcode-cache I talked about earlier, it allows you to store data in memory,
And/or memcached (see also), which is very useful if you literally have lots of data and/or are using multiple servers, as it is distributed.
of course, you can think about files; and probably many other ideas.
I'm pretty sure your framework comes with some cache-related stuff; you probably already know that, as you said "I will be using the Cache-library more in time to come" in the OP ;-)
Profiling
Now, a nice thing to do would be to use the Xdebug extension to profile your application: it often allows to find a couple of weak-spots quite easily -- at least, if there is any function that takes lots of time.
Configured properly, it will generate profiling files that can be analysed with some graphic tools, such as:
KCachegrind: my favorite, but works only on Linux/KDE
Wincachegrind for windows; it does a bit less stuff than KCacheGrind, unfortunately -- it doesn't display callgraphs, typically.
Webgrind which runs on a PHP webserver, so works anywhere -- but probably has less features.
For instance, here are a couple screenshots of KCacheGrind:
(source: pascal-martin.fr)
(source: pascal-martin.fr)
(BTW, the callgraph presented on the second screenshot is typically something neither WinCacheGrind nor Webgrind can do, if I remember correctly ^^ )
(Thanks #Mikushi for the comment) Another possibility that I haven't used much is the the xhprof extension : it also helps with profiling, can generate callgraphs -- but is lighter than Xdebug, which mean you should be able to install it on a production server.
You should be able to use it alonside XHGui, which will help for the visualisation of data.
On the SQL side of things:
Now that we've spoken a bit about PHP, note that it is more than possible that your bottleneck isn't the PHP-side of things, but the database one...
At least two or three things, here:
You should determine:
What are the most frequent queries your application is doing
Whether those are optimized (using the right indexes, mainly?), using the EXPLAIN instruction, if you are using MySQL
See also: Optimizing SELECT and Other Statements
You can, for instance, activate log_slow_queries to get a list of the requests that take "too much" time, and start your optimization by those.
whether you could cache some of these queries (see what I said earlier)
Is your MySQL well configured? I don't know much about that, but there are some configuration options that might have some impact.
Optimizing the MySQL Server might give you some interesting informations about that.
Still, the two most important things are:
Don't go to the DB if you don't need to: cache as much as you can!
When you have to go to the DB, use efficient queries: use indexes; and profile!
And what now?
If you are still reading, what else could be optimized?
Well, there is still room for improvements... A couple of architecture-oriented ideas might be:
Switch to an n-tier architecture:
Put MySQL on another server (2-tier: one for PHP; the other for MySQL)
Use several PHP servers (and load-balance the users between those)
Use another machines for static files, with a lighter webserver, like:
lighttpd
or nginx -- this one is becoming more and more popular, btw.
Use several servers for MySQL, several servers for PHP, and several reverse-proxies in front of those
Of course: install memcached daemons on whatever server has any amount of free RAM, and use them to cache as much as you can / makes sense.
Use something "more efficient" that Apache?
I hear more and more often about nginx, which is supposed to be great when it comes to PHP and high-volume websites; I've never used it myself, but you might find some interesting articles about it on the net;
for instance, PHP performance III -- Running nginx.
See also: PHP-FPM - FastCGI Process Manager, which is bundled with PHP >= 5.3.3, and does wonders with nginx.
Well, maybe some of those ideas are a bit overkill in your situation ^^
But, still... Why not study them a bit, just in case ? ;-)
And what about Kohana?
Your initial question was about optimizing an application that uses Kohana... Well, I've posted some ideas that are true for any PHP application... Which means they are true for Kohana too ;-)
(Even if not specific to it ^^)
I said: use cache; Kohana seems to support some caching stuff (You talked about it yourself, so nothing new here...)
If there is anything that can be done quickly, try it ;-)
I also said you shouldn't do anything that's not necessary; is there anything enabled by default in Kohana that you don't need?
Browsing the net, it seems there is at least something about XSS filtering; do you need that?
Still, here's a couple of links that might be useful:
Kohana General Discussion: Caching?
Community Support: Web Site Optimization: Maximum Website Performance using Kohana
Conclusion?
And, to conclude, a simple thought:
How much will it cost your company to pay you 5 days? -- considering it is a reasonable amount of time to do some great optimizations
How much will it cost your company to buy (pay for?) a second server, and its maintenance?
What if you have to scale larger?
How much will it cost to spend 10 days? more? optimizing every possible bit of your application?
And how much for a couple more servers?
I'm not saying you shouldn't optimize: you definitely should!
But go for "quick" optimizations that will get you big rewards first: using some opcode cache might help you get between 10 and 50 percent off your server's CPU-load... And it takes only a couple of minutes to set up ;-) On the other side, spending 3 days for 2 percent...
Oh, and, btw: before doing anything: put some monitoring stuff in place, so you know what improvements have been made, and how!
Without monitoring, you will have no idea of the effect of what you did... Not even if it's a real optimization or not!
For instance, you could use something like RRDtool + cacti.
And showing your boss some nice graphics with a 40% CPU-load drop is always great ;-)
Anyway, and to really conclude: have fun!
(Yes, optimizing is fun!)
(Ergh, I didn't think I would write that much... Hope at least some parts of this are useful... And I should remember this answer: might be useful some other times...)
Use XDebug and WinCacheGrind or WebCacheGrind to profile and analyze slow code execution.
(source: jokke.dk)
Profile code with XDebug.
Use a lot of caching. If your pages are relatively static, then reverse proxy might be the best way to do it.
Kohana is out of the box very very fast, except for the use of database objects. To quote Zombor "You can reduce memory usage by ensuring you are using the database result object instead of result arrays." This makes a HUGEE performance difference on a site that is being slammed. Not only does it use more memory, it slows down execution of scripts.
Also - you must use caching. I prefer memcache and use it in my models like this:
public function get($e_id)
{
$event_data = $this->cache->get('event_get_'.$e_id.Kohana::config('config.site_domain'));
if ($event_data === NULL)
{
$this->db_slave
->select('e_id,e_name')
->from('Events')
->where('e_id', $e_id);
$result = $this->db_slave->get();
$event_data = ($result->count() ==1)? $result->current() : FALSE;
$this->cache->set('event_get_'.$e_id.Kohana::config('config.site_domain'), $event_data, NULL, 300); // 5 minutes
}
return $event_data;
}
This will also dramatically increase performance. The above two techniques improved a sites performance by 80%.
If you gave some more information about where you think the bottleneck is, I'm sure we could give some better ideas.
Also check out yslow (google it) for some other performance tips.
Strictly related to Kohana (you probably already have done this, or not):
In production mode:
Enable internal caching (this will only cache the Kohana::find_file results, but this actually can help a lot.
Disable profiler
Just my 2 cents :)
I totally agree with the XDebug and caching answers. Don't look into the Kohana layer for optimization until you've identified your biggest speed and scale bottlenecks.
XDebug will tell you were you spend the most of your time and identify 'hotspots' in your code. Keep this profiling information so you can baseline and measure performance improvements.
Example problem and solution:
If you find that you're building up expensive objects from the database each time, that don't really change often, then you can look at caching them with memcached or another mechanism. All of these performance fixes take time and add complexity to your system, so be sure of your bottlenecks before you start fixing them.
Before you answer this I have never developed anything popular enough to attain high server loads. Treat me as (sigh) an alien that has just landed on the planet, albeit one that knows PHP and a few optimisation techniques.
I'm developing a tool in PHP that could attain quite a lot of users, if it works out right. However while I'm fully capable of developing the program I'm pretty much clueless when it comes to making something that can deal with huge traffic. So here's a few questions on it (feel free to turn this question into a resource thread as well).
Databases
At the moment I plan to use the MySQLi features in PHP5. However how should I setup the databases in relation to users and content? Do I actually need multiple databases? At the moment everything's jumbled into one database - although I've been considering spreading user data to one, actual content to another and finally core site content (template masters etc.) to another. My reasoning behind this is that sending queries to different databases will ease up the load on them as one database = 3 load sources. Also would this still be effective if they were all on the same server?
Caching
I have a template system that is used to build the pages and swap out variables. Master templates are stored in the database and each time a template is called it's cached copy (a html document) is called. At the moment I have two types of variable in these templates - a static var and a dynamic var. Static vars are usually things like page names, the name of the site - things that don't change often; dynamic vars are things that change on each page load.
My question on this:
Say I have comments on different articles. Which is a better solution: store the simple comment template and render comments (from a DB call) each time the page is loaded or store a cached copy of the comments page as a html page - each time a comment is added/edited/deleted the page is recached.
Finally
Does anyone have any tips/pointers for running a high load site on PHP. I'm pretty sure it's a workable language to use - Facebook and Yahoo! give it great precedence - but are there any experiences I should watch out for?
No two sites are alike. You really need to get a tool like jmeter and benchmark to see where your problem points will be. You can spend a lot of time guessing and improving, but you won't see real results until you measure and compare your changes.
For example, for many years, the MySQL query cache was the solution to all of our performance problems. If your site was slow, MySQL experts suggested turning the query cache on. It turns out that if you have a high write load, the cache is actually crippling. If you turned it on without testing, you'd never know.
And don't forget that you are never done scaling. A site that handles 10req/s will need changes to support 1000req/s. And if you're lucking enough to need to support 10,000req/s, your architecture will probably look completely different as well.
Databases
Don't use MySQLi -- PDO is the 'modern' OO database access layer. The most important feature to use is placeholders in your queries. It's smart enough to use server side prepares and other optimizations for you as well.
You probably don't want to break your database up at this point. If you do find that one database isn't cutting, there are several techniques to scale up, depending on your app. Replicating to additional servers typically works well if you have more reads than writes. Sharding is a technique to split your data over many machines.
Caching
You probably don't want to cache in your database. The database is typically your bottleneck, so adding more IO's to it is typically a bad thing. There are several PHP caches out there that accomplish similar things like APC and Zend.
Measure your system with caching on and off. I bet your cache is heavier than serving the pages straight.
If it takes a long time to build your comments and article data from the db, integrate memcache into your system. You can cache the query results and store them in a memcached instance. It's important to remember that retrieving the data from memcache must be faster than assembling it from the database to see any benefit.
If your articles aren't dynamic, or you have simple dynamic changes after it's generated, consider writing out html or php to the disk. You could have an index.php page that looks on disk for the article, if it's there, it streams it to the client. If it isn't, it generates the article, writes it to the disk and sends it to the client. Deleting files from the disk would cause pages to be re-written. If a comment is added to an article, delete the cached copy -- it would be regenerated.
I'm a lead developer on a site with over 15M users. We have had very little scaling problems because we planned for it EARLY and scaled thoughtfully. Here are some of the strategies I can suggest from my experience.
SCHEMA
First off, denormalize your schemas. This means that rather than to have multiple relational tables, you should instead opt to have one big table. In general, joins are a waste of precious DB resources because doing multiple prepares and collation burns disk I/O's. Avoid them when you can.
The trade-off here is that you will be storing/pulling redundant data, but this is acceptable because data and intra-cage bandwidth is very cheap (bigger disks) whereas multiple prepare I/O's are orders of magnitude more expensive (more servers).
INDEXING
Make sure that your queries utilize at least one index. Beware though, that indexes will cost you if you write or update frequently. There are some experimental tricks to avoid this.
You can try adding additional columns that aren't indexed which run parallel to your columns that are indexed. Then you can have an offline process that writes the non-indexed columns over the indexed columns in batches. This way, you can control better when mySQL will need to recompute the index.
Avoid computed queries like a plague. If you must compute a query, try to do this once at write time.
CACHING
I highly recommend Memcached. It has been proven by the biggest players on the PHP stack (Facebook) and is very flexible. There are two methods to doing this, one is caching in your DB layer, the other is caching in your business logic layer.
The DB layer option would require caching the result of queries retrieved from the DB. You can hash your SQL query using md5() and use that as a lookup key before going to database. The upside to this is that it is pretty easy to implement. The downside (depending on implementation) is that you lose flexibility because you're treating all caching the same with regard to cache expiration.
In the shop I work in, we use business layer caching, which means each concrete class in our system controls its own caching schema and cache timeouts. This has worked pretty well for us, but be aware that items retrieved from DB may not be the same as items from cache, so you will have to update cache and DB together.
DATA SHARDING
Replication only gets you so far. Sooner than you expect, your writes will become a bottleneck. To compensate, make sure to support data sharding early as possible. You will likely want to shoot yourself later if you don't.
It is pretty simple to implement. Basically, you want to separate the key authority from the data storage. Use a global DB to store a mapping between primary keys and cluster ids. You query this mapping to get a cluster, and then query the cluster to get the data. You can cache the hell out of this lookup operation which will make it a negligible operation.
The downside to this is that it may be difficult to piece together data from multiple shards. But, you can engineer your way around that as well.
OFFLINE PROCESSING
Don't make the user wait for your backend if they don't have to. Build a job queue and move any processing that you can offline, doing it separate from the user's request.
I've worked on a few sites that get millions/hits/month backed by PHP & MySQL. Here are some basics:
Cache, cache, cache. Caching is one of the simplest and most effective ways to reduce load on your webserver and database. Cache page content, queries, expensive computation, anything that is I/O bound. Memcache is dead simple and effective.
Use multiple servers once you are maxed out. You can have multiple web servers and multiple database servers (with replication).
Reduce overall # of request to your webservers. This entails caching JS, CSS and images using expires headers. You can also move your static content to a CDN, which will speed up your user's experience.
Measure & benchmark. Run Nagios on your production machines and load test on your dev/qa server. You need to know when your server will catch on fire so you can prevent it.
I'd recommend reading Building Scalable Websites, it was written by one of the Flickr engineers and is a great reference.
Check out my blog post about scalability too, it has a lot of links to presentations about scaling with multiple languages and platforms:
http://www.ryandoherty.net/2008/07/13/unicorns-and-scalability/
Re: PDO / MySQLi / MySQLND
#gary
You cannot just say "don't use MySQLi" as they have different goals. PDO is almost like an abstraction layer (although it is not actually) and is designed to make it easy to use multiple database products whereas MySQLi is specific to MySQL conections. It is wrong to say that PDO is the modern access layer in the context of comparing it to MySQLi because your statement implies that the progression has been mysql -> mysqli -> PDO which is not the case.
The choice between MySQLi and PDO is simple - if you need to support multiple database products then you use PDO. If you're just using MySQL then you can choose between PDO and MySQLi.
So why would you choose MySQLi over PDO? See below...
#ross
You are correct about MySQLnd which is the newest MySQL core language level library, however it is not a replacement for MySQLi. MySQLi (as with PDO) remains the way you would interact with MySQL through your PHP code. Both of these use libmysql as the C client behind the PHP code. The problem is that libmysql is outside of the core PHP engine and that is where mysqlnd comes in i.e. it is a Native Driver which makes use of the core PHP internals to maximise efficiency, specifically where memory usage is concerned.
MySQLnd is being developed by MySQL themselves and has recently landed onto the PHP 5.3 branch which is in RC testing, ready for a release later this year. You will then be able to use MySQLnd with MySQLi...but not with PDO. This will give MySQLi a performance boost in many areas (not all) and will make it the best choice for MySQL interaction if you do not need the abstraction like capabilities of PDO.
That said, MySQLnd is now available in PHP 5.3 for PDO and so you can get the advantages of the performance enhancements from ND into PDO, however, PDO is still a generic database layer and so will be unlikely to be able to benefit as much from the enhancements in ND as MySQLi can.
Some useful benchmarks can be found here although they are from 2006. You also need to be aware of things like this option.
There are a lot of considerations that need to be taken into account when deciding between MySQLi and PDO. It reality it is not going to matter until you get to rediculously high request numbers and in that case, it makes more sense to be using an extension that has been specifically designed for MySQL rather than one which abstracts things and happens to provide a MySQL driver.
It is not a simple matter of which is best because each has advantages and disadvantages. You need to read the links I've provided and come up with your own decision, then test it and find out. I have used PDO in past projects and it is a good extension but my choice for pure performance would be MySQLi with the new MySQLND option compiled (when PHP 5.3 is released).
General
Do not try to optimize before you start to see real world load. You might guess right, but if you don't, you've wasted your time.
Use jmeter, xdebug or another tool to benchmark the site.
If load starts to be an issue, either object or data caching will likely be involved, so generally read up on caching options (memcached, MySQL caching options)
Code
Profile your code so that you know where the bottleneck is, and whether it's in code or the database
Databases
Use MYSQLi if portability to other databases is not vital, PDO otherwise
If benchmarks reveal the database is the issue, check the queries before you start caching. Use EXPLAIN to see where your queries are slowing down.
After the queries are optimized and the database is cached in some way, you may want to use multiple databases. Either replicating to multiple servers or sharding (splitting the data over multiple databases/servers) may be appropriate, depending on the data, the queries, and the kind of read/write behavior.
Caching
Plenty of writing has been done on caching code, objects, and data. Look up articles on APC, Zend Optimizer, memcached, QuickCache, JPCache. Do some of this before you really need to, and you'll be less concerned about starting off unoptimized.
APC and Zend Optimizer are opcode caches, they speed up PHP code by avoiding reparsing and recompilation of code. Generally simple to install, worth doing early.
Memcached is a generic cache, that you can use to cache queries, PHP functions or objects, or entire pages. Code must be specifically written to use it, which can be an involved process if there are no central points to handle creation, update and deletion of cached objects.
QuickCache and JPCache are file caches, otherwise similar to Memcached. The basic concept is simple, but also requires code and is easier with central points of creation, update and deletion.
Miscellaneous
Consider alternative web servers for high load. Servers like lighthttp and nginx can handle large amounts of traffic in much less memory than Apache, if you can sacrifice Apache's power and flexibility (or if you just don't need those things, which often, you don't).
Remember that hardware is surprisingly cheap these days, so be sure to cost out the effort to optimize a large block of code versus "let's buy a monster server."
Consider adding the "MySQL" and "scaling" tags to this question
APC is an absolute must. Not only does it make for a great caching system, but the gain from the auto-cached PHP files is a godsend. As for the multiple database idea, I don't think you would get much out of having different databases on the same server. It may give you a bit of a gain in speed during query time, but I doubt the effort it would take to deploy and maintain the code for all three while making sure they are in sync would be worth it.
I also highly recommend running Xdebug to find bottlenecks in your program. It made optimization a breeze for me.
Firstly, as I think Knuth said, "Premature optimization is the root of all evil". If you don't have to deal with these issues right now then don't, focus on delivering something that works correctly first. That being said, if the optimizations can't wait.
Try profiling your database queries, figure out what's slow and what happens alot and come up with an optimization strategy from that.
I would investigate Memcached as it's what a lot of the higher load sites use for efficiently caching content of all types, and the PHP object interface to it is quite nice.
Splitting up databases among servers and using some sort of load balancing technique (e.g. generate a random number between 1 and # redundant databases with necessary data - and use that number to determine which database server to connect to) can also be an excellent way to increase efficiency.
These have all worked out pretty well in the past for some fairly high load sites. Hope this helps to get you started :-)
Profiling your app with something like Xdebug (like tj9991 recommended) is definitely going to be a must. It doesn't make a whole lot of sense to just go around optimizing things blindly. Xdebug will help you find the real bottlenecks in your code so you can spend your optimization time wisely and fix chunks of code that are actually causing slow downs.
If you're using Apache, another utility that can help in testing is Siege. It will help you anticipate how your server and application will react to high loads by really putting it through its paces.
Any kind of opcode cache for PHP (like APC or one of the many others) will help a lot as well.
I run a website with 7-8 million page views a month. Not terribly much, but enough that our server felt the load. The solution we chose was simple: Memcache at the database level. This solution works well if the database load is your main problem.
We started out using Memcache to cache entire objects and the database results that were most frequently used. It did work, but it also introduced bugs (we might have avoided some of those if we had been more careful).
So we changed our approach. We built a database wrapper (with the exact same methods as our old database, so it was easy to switch), and then we subclassed it to provide memcached database access methods.
Now all you have to do is decide whether a query can use cached (and possibly out of date) results or not. Most of the queries run by the users are now fetched directly from Memcache. The exceptions are updates and inserts, which for the main website only happens because of logging. This rather simple measure reduced our server load by about 80%.
For what it's worth, caching is DIRT SIMPLE in PHP even without an extension/helper package like memcached.
All you need to do is create an output buffer using ob_start().
Create a global cache function. Call ob_start, pass the function as a callback. In the function, look for a cached version of the page. If exists, serve it and end.
If it doesn't exist, the script will continue processing. When it reaches the matching ob_end() it will call the function you specified. At that time, you just get the contents of the output buffer, drop them in a file, save the file, and end.
Add in some expiration/garbage collection.
And many people don't realize you can nest ob_start()/ob_end() calls. So if you're already using an output buffer to, say, parse in advertisements or do syntax highlighting or whatever, you can just nest another ob_start/ob_end call.
Thanks for the advice on PHP's caching extensions - could you explain reasons for using one over another? I've heard great things about memcached through IRC but have never heard of APC - what are your opinions on them? I assume using multiple caching systems is pretty counter-effective.
Actually, many do use APC and memcached together...
It looks like I was wrong. MySQLi is still being developed. But according to the article, PDO_MySQL is now being contributed to by the MySQL team. From the article:
The MySQL Improved Extension - mysqli
- is the flagship. It supports all features of the MySQL Server including
Charsets, Prepared Statements and
Stored Procedures. The driver offers a
hybrid API: you can use a procedural
or object-oriented programming style
based on your preference. mysqli comes
with PHP 5 and up. Note that the End
of life for PHP 4 is 2008-08-08.
The PHP Data Objects (PDO) are a
database access abstraction layer. PDO
allows you to use the same API calls
for various databases. PDO does not
offer any degree of SQL abstraction.
PDO_MYSQL is a MySQL driver for PDO.
PDO_MYSQL comes with PHP 5. As of PHP
5.3 MySQL developers actively contribute to it. The PDO benefit of a
unified API comes at the price that
MySQL specific features, for example
multiple statements, are not fully
supported through the unified API.
Please stop using the first MySQL
driver for PHP ever published:
ext/mysql. Since the introduction of
the MySQL Improved Extension - mysqli
- in 2004 with PHP 5 there is no reason to still use the oldest driver
around. ext/mysql does not support
Charsets, Prepared Statements and
Stored Procedures. It is limited to
the feature set of MySQL 4.0. Note
that the Extended Support for MySQL
4.0 ends at 2008-12-31. Don't limit yourself to the feature set of such
old software! Upgrade to mysqli, see
also Converting_to_MySQLi. mysql is in
maintenance only mode from our point
of view.
To me, it seems the article is biased towards MySQLi. I suppose I'm biased towards PDO.
I really like PDO over MySQLi. It's straight forward to me. The API is a lot closer to other languages I've programmed in. OO Database interfaces seem to work better.
I haven't come across any specific MySQL features that weren't available through PDO. I would be surprised if I ever did.
PDO is also very slow and its API is pretty complicated. No one in their sane mind should use it if portability is not a concern. And let's face it, in 99% of all webapps it is not. You just stick with MySQL or PostrgreSQL, or whatever it is you are working with.
As for the PHP question and what to take into account. I think premature optimization is the root of all evil. ;) Get your application done first, try to keep it clean when it comes to programming, do a little documentation and write unit tests. With all of the above you will have no issues refactoring code when the time comes. But first you want to be done and push it out to see how people react to it.
Sure pdo is nice, but there has been some controversy about it's performance versus mysql and mysqli, although it seems fixed now.
You should use pdo if you envision portability, but if not, mysqli should be the way. It has an OO interface, prepared statements, and most of what pdo offers (except, well, portability).
Plus, if performance is really needed, prepare for the (native mysql) MysqLnd driver in PHP 5.3, who will be much more tightly integrated with php, with better performance and improved memory usage (and statistics for performance tuning).
Memcache is nice if you have clustered servers (and YouTube-like load), but i'd try out APC first too.
A lot of good answers were given already, but I would like to point you to an alternate opcode cache called XCache. It is created by a lighty contributor.
Also, if you may need load balancing your database server in future, MySQL Proxy could very well help you to achieve this.
Both of those tools should plug into an existing application quite easily, so this optimization can be done when you need it, without too much hassle.
First question is how big do you really expect it to be? And how much do you plan on investing in your infrastructure. Since you feel the need to ask the question here, I'm guessing that you expect to start small on a limited budget.
Performance is irrelevant if the site is not available. And for availability you need horizontal scaling. The minimum you can sensibly get away with is 2 servers, both running apache, php and mysql. Set up one DBMS as a slave to the other. Do all the writes on the master, and all the reads on the local database (whatever that is) - unless for some reason you need to read back the data you've just read (use master). Make sure you've got the machinery in place to automatically promote the slave and fence the master. Use round-robin DNS for the webserver addresses to give more affinity for the slave node.
Partitioning your data across different database nodes at this stage is a very bad idea - however you might want to consider splitting it across different databases on the same server (which will facilitate partitioning across nodes when you overtake facebook).
Do make sure you've got the monitoring and data analysis tools in place to measure your sites performance and identify bottlenecks. Most performance problems can be fixed by writing better SQL / fixing the database schema.
Keeping your template cache on the database is a dumb idea - the database should be a central common repository for structured data. Keep your template cache on the local filesystem of your webservers - it will be available faster and won't slow down your database access.
Do use a op-code cache.
Spend plenty of time studying your site and its logs to understand why its going so slow.
Push as much caching as possible onto the client.
Use mod_gzip to compress everything you can.
C.
My first piece of advice is to think about this issue and keep it in mind when designing the site but don't go overboard. It's often difficult to predict the success of a new site and I your time will be better spent getting up finished early and optimising it later.
In general, Simple is fast.
Templates slow you down. Databases slow you down. Complex libraries slow you down. Layering templates over each other retrieving them from databases and parsing it in a complex library --> the time delays multiply with each other.
Once you have the basic site up and running do tests to show you where to spend your efforts. It's difficult to see where to target. Often to speed things up you will have to unravel the complexity of the code, this makes it larger and harder to maintain, so you only want to do it where necessary.
In my experience establishing the database connection was relatively expensive. If you can get away with it, don't connect to the database for general visitors on the most trafficed pages like the front page to the site. Creating multiple database connections is madness with very little benefit.
#Gary
Don't use MySQLi -- PDO is the 'modern' OO database access layer. The most important feature to use is placeholders in your queries. It's smart enough to use server side prepares and other optimizations for you as well.
I'm loking over PDO at the moment and it looks like you're right - however I know that MySQL are developing the MySQLd extension for PHP - I think to succeed either MySQL or MySQLi - what do you think about that?
#Ryan, Eric, tj9991
Thanks for the advice on PHP's caching extensions - could you explain reasons for using one over another? I've heard great things about memcached through IRC but have never heard of APC - what are your opinions on them? I assume using multiple caching systems is pretty counter-effective.
I will definitely be sorting out some profiling testers - thank you very much for your recommendations on those.
I don't see myself switching from MySQL anytime soon - so I guess I don't need the abstraction capabilities of PDO. Thanks for those articles DavidM, they've helped me a lot.
Look into mod_cache, an output cache for the Apache web server, simillar to the output caching in ASP.NET.
Yes, I can see that it's still experimental but it will be final someday.
I can't believe no-one has already mentioned this: Modularisation and Abstraction. If you think your site is going to have to grow to lots of machines, you must design it so it can! That means stupid things like don't assume the database is on localhost. It also means things that are going to be a bother at first, like writing a database abstraction layer (like PDO, but much much lighter because it only does what you need it to do).
And it means things like working with a framework. You will need layers to your code so that you can later gain performance by refactoring the data-abstraction layer, for example, by teaching it that some objects are in a different database -- and the code doesn't have to know or care.
Finally, be careful of memory-intensive operations, for example, unnecessary string copying. If you can keep PHP's memory usage down, then you will get more performance out of your webserver and this is something that will scale when you go to a load-balanced solution.
If you are working with large amounts of data, and caching isn't cutting it, look into Sphinx. We've had great results with using SphinxSearch not only for better text searching, but also as a data retrieval replacement for MySQL when dealing larger tables. If you use SphinxSE (MySQL plugin), it surpassed our performance gains we had from caching several times over, and application-implementation is a sinch.
The points made about cache are spot-on; it is the least complicated and most important part of building an efficient application. I'd like to add that while memcached is great, APC is about five times faster if your application lives on a single server.
The "Cache Performance Comparison" post at the MySQL performance blog has some interesting benchmarks on the subject - http://www.mysqlperformanceblog.com/2006/08/09/cache-performance-comparison/.