Related
Say we want to develop a photo site.
Would it be faster to upload or download images to or from MongoDB than store or download images from disk... Since mongoDB can save images and files in chunks and save metadata.
So for a photosharing website, would it be better (faster) to store the images on a mongodb or on a typical server harddisk. etc.
im thinking of using php, codeigniter btw if that changes the performance issues regarding the question.
Lightweight web servers (lighttpd, nginx) do a pretty good job of serving content from the filesystem. Since the OS acts as a caching layer they typically serve content from memory which is very fast.
If you want to serve images from mongodb the web server has to run some sort of script (python, php, ruby... of course FCGI, you can't start a new process for each image), which has to fetch data from mongodb each time the image is requested. So it's going to be slow? The benefits are automatic replication and failover if you use replica sets. If you need this and clever enough to know to achieve it with FS then go with that option... If you need a very quick implementation that's reliable then mongodb might be a faster way to do that. But if your site is going to be popular sooner or later you have to switch to the FS implementation.
BTW: you can mix these two approaches, store the image in mongodb to get instant reliability and then replicate it to the FS of a couple of servers to gain speed.
Some test results.
Oh one more thing.. coupling the metadata with the image seems to be nice until you realize the generated HTML and the image download is going to be two separate HTTP requests, so you have to query mongo twice, once for the metadata and once for the image.
When to use GridFS for storing files with MongoDB - the document suggests you should. It also sounds fast and reliable, and is great for backups and replication. Hope that helps.
Several benchmarks have shown MongoDB is approximately 6 times slower for file storage (via GridFS) versus using the regular old filesystem. (One compared apache, nginx, and mongo)
However, there are strong reasons to use MongoDB for file storage despite it being slower -- #1 free backup from Mongo's built-in sharding/replication. This is a HUGE time saver. #2 ease of admin, storing metadata, not having to worry about directories, permissions, etc. Also a HUGE time saver.
Our photo back-end was realized years ago in a huge gob of spaghetti code that did all kinds of stuff (check or create user dir, check or create date dirs, check for name collision, set perms), and a whole other mess did backups.
We've recently changed everything over to Mongo. In our experience, Mongo is a bit slower (it may be 6 times slower but it doesn't feel like 6 times slower), and anyway- so what? All that spaghetti is out the window, and the new Mongo+photo code is much smaller, tighter and logic simpler. Never going back to file system.
http://www.lightcubesolutions.com/blog/?p=209
You definitely do not want to download images directly from MongoDB. Even going through GridFS will be (slightly) slower than from a simple file on disk. You shouldn't want to do it from disk either. Neither option is appropriate for delivering image content with high throughput. You'll always need a server-side caching layer for static content between your origin/source (be it mongo or the filesystem) and your users.
So what that in mind you are free to pick whatever works best for you, and MongoDB's GridFS provides quite a few features for free that you'd otherwise have to do yourself when you're working directly with files.
I'm working on a PHP content management system and, in testing, have noticed that quite a few of the system's MySQL tables are queried on almost every page but are very rarely written to. What I'm wondering is will this start to weigh heavily on the database as site traffic increases, and how can I solve/prevent this?
My initial thoughts were to start storing some of the more static data in files (using PHP serialization) but does this actually reduce server load? What I'm worried about is that I'd be simply transferring the high load from the database to the file system!
If somebody could clue me in on the better approach, that would be great. In case the volume of data itself has a large effect, I've detailed some of the data I'll be storing below:
Full list of Countries (including ISO country codes)
Site options (skin, admin email, support URLs etc.)
Usergroups (including permissions)
You have to remember that reading a table from a database on a powerful server and on a fast connection is likely to be faster than reading it from disk on your local machine. The database will cache the entirety of these small, regularly accessed tables in memory.
By implementing the same functionality yourself in the file system, there is only a small possible speed up, but a huge chance to mess it up and make it slower.
It's probably best to stick with using the database.
Optimize your queries (using mysql slow query log) and EXPLAIN function.
If tables are really rarely written to you can use native MySQL caching. You have nothing to change in you code, just enable mysql caching in my.conf.
Try out using template engine like Smarty (smarty.net). It has it's own caching system that works pretty well and will REALLY reduce server load.
You can also use Memcache, but it is really worth using only with really high load websites. (I think that Smarty will be enough.)
Databases are much better at handling large data volumes than the native file system.
Don't worry about optimizing your site to reduce server load, until you actually have a server load problem. :-)
The tables you mentioned (countries and users) will normally be cached in memory by MySQL directly unless you are expecting quite a few millions of records in these tables.
In case where these tables will not fit in memory, you may want to consider a general-purpose distributed memory caching system, such as memcached.
If your database is properly indexed, it will be much faster to query data from the database. If you want to speed that up, look into memcached or similar.
Databases are exactly for this purpose.. To store and provide data. Filesystem is for scripts and programming.
If you encounter load problems, consider using Memcached or another utility for database.
You may also consider trying to cache different parts of your page directly into database as whole sections (eg. a sidebar, that doesn't change too much, generated header section, ..)
you could cache output (flush(), ob_flush() etc.) to a file and include that instead of having multiple MySQL reads. caching is definitely faster than accessing MySQL multiple time.
reading a static file is much faster than adding overhead via php and mysql processing.
You need to evaluate the performance via load testing to avoid prematurely optimising.
It would be foolish and quite possibly increase overall load to store data in files with serialization, databases are really good at retrieving data.
If after analysis there is a true performance hit (which I doubt unless you are talking about massive loading), then caching is a better solution.
It's more important to have a well designed system that facilitates changes as needs arise.
Here's a link to a couple script that will essentially do what dusoft is talking about and cache the output buffer to a file:
http://www.addedbytes.com/articles/caching-output-in-php/
Used this way, it's more of a bolt-on-after-the-fact type of solution, but this same behavior can certainly be implemented in a more integrated fashion if considered earlier in the process. Many frameworks also have this kind of thing built in.
I have a friendly argument going on with a co-worker about this, and my personal opinion is that a ASP.NET-MVC compiled web application would run more efficiently/faster than the same project that would be written in PHP. My friend disagrees.
Unfortunately I do not have any solid data that I can use to back up my argument. (neither does he)
To this, I tried to Google for answers to try and find evidence to prove him wrong but most of the time the debate turned into which platform it is better to develop on, cost, security features, etc... For the sake of this argument I really don't care about any of that.
I would like to know what stack overflow community thinks about the raw speed/efficency of websites in general that are developed in ASP.NET with MVC versus exactly the same website developed with PHP?
Does anyone have any practical examples in real-world scenarios comparing the performance of the two technologies?
(I realize for some of you this may very well be an irrelevant and maybe stupid argument, but it is an argument, and I would still like to hear the answers of the fine people here at S.O.)
It's a hard comparison to make because differences in the respective stacks mean you end up doing the same thing differently and if you do them the same for the purpose of comparison it's not a very realistic test.
PHP, which I like, is in its most basic form loaded with every request, interpreted and then discarded. It is very much like CGI in this respect (which is no surprise considering it is roughly 15 years old).
Now over the years various optimisations have been made to improve the performance, most notably opcode caching with APC, for example (so much so that APC will be a standard part of PHP 6 and not an optional module like it is now).
But still PHP scripts are basically transient. Session information is (normally) file based and mutually exclusive (session_start() blocks other scripts accessing the same user session until session_commit() or the script finishes) whereas that's not the case in ASP.NET. Aside from session data, it's fairly easy (and normal) to have objects that live within the application context in ASP.NET (or Java for that matter, which ASP.NET is much more similar to).
This is a key difference. For example, database access in PHP (using mysql, mysqli, PDO, etc) is transient (persistent connections notwithstanding) whereas .Net/Java will nearly always use persistent connection pools and build on top of this to create ORM frameworks and the like, the caches for which are beyond any particular request.
As a bytecode interpreted platform, ASP.NET is theoretically faster but the limits to what PHP can do are so high as to be irrelevant for most people. 4 of the top 20 visited sites on the internet are PHP for example. Speed of development, robustness, cost of running the environment, etc... tend to be far more important when you start to scale than any theoretical speed difference.
Bear in mind that .Net has primitive types, type safety and these sorts of things that will make code faster than PHP can run it. If you want to do a somewhat unfair test, sort an array of one million random 64 bit integers in both platforms. ASP.NET will kill it because they are primitive types and simple arrays will be more efficient than PHP's associative arrays (and all arrays in PHP are associative ultimately). Plus PHP on a 32 bit OS won't have a native 64 bit integer so will suffer hugely for that.
It should also be pointed out that ASP.NET is pre-compiled whereas PHP is interpreted on-the-fly (excluding opcode caching), which can make a difference but the flexibility of PHP in this regard is a good thing. Being able to deploy a script without bouncing your server is great. Just drop it in and it works. Brilliant. But it is less performant ultimately.
Ultimately though I think you're arguing what's really an irrelevant detail.
ASP.NET runs faster. ASP.NET Development is faster.
Buy fast computer, and enjoy it if you do serious business web applications
ASP.NET code executes a lot faster compared to PHP, when it's builded in Release mode, optimized, cached etc etc. But, for websites (except big players, like Facebook), it's less important - the most time of page rendering time is accessing and querying database.
In connecting database ASP.NET is a lot better - in asp.net we typically use LINQ which translates our object queries into stored procedures in SQL server database. Also connection to database is persistent, one for one website, there is no need for reconnecting.
PHP, in comparison, can't hold sql server connection between request, it connect, grab data from db and destroys, when reconnecting the database is often 20-30% of page rendering time.
Also whole web application config is reloaded in php on each request, where in asp.net it persist in memory. It can be easily seen in big, enterprise frameworks like symfony/symfony2, a lot of rendering time is symfony internal processess, where asp.net loads it's once and don't waste your server for useless work.
ASP.NET can holds object in cache in application memory - in php you have to write it to files, or use hack like memcache. using memcache is a lot of working with concurrency and hazard problems (storing cache in files also have it's own problems with concurrency - every request start new thread of apache server and many request can work on one time - you have to think about concurrency between those threads, it take a lot of development time and not always work because php don't have any mutex mechanisms in language, so you can't make critical section by any way).
now something about development speed:
ASP.NET have two main frameworks designed for it (Webforms and MVC), installed with environment, where in PHP you must get a open-source framework. There is no standard framework in php like in asp.NET.
ASP.NET language is so rich, standard library has solutions for very much common problems, where PHP standard library is ... naked... they can't keep one naming convention.
.NET has types, where PHP is dynamic, so it means no control about source code until you run it or write unit tests.
.NET has great IDE where PHP IDE's are average or average-good (PHPStorm is still a lot worse than VS+resharper or even without it)
PHP scaffolding in symfony is fired from command line when ASP.NET scaffolding is integrated into environment.
If you have slow computer like my (one core 2,2ghz), developing asp.net pages can be painfull because you have to recompile your project on any change of source code, where PHP code refresh immediately.
PHP language syntax is so unfinished, unsolid and naked compared to C# syntax.
Strong types in C# and many flexible language features can speed up your development and make your code less buggy.
In my (non-hardbenchmarked) experience Asp.Net can certainly compete (and in some areas surpass) PHP in terms of raw speed. But similar with a lot of other language-choice related questions the following statement is (in this case) valid (in my opinion):
There are slow, buggy sites in language x (be it PHP or Asp.Net)
There are great, fast sites in language x (be it PHP or Asp.Net)
What i'm trying to say: the (talents of the) developer will influence the overall speed more than a choice between two (roughly equivalent in some abstracted extent) technologies.
Really, an 'overall speed' comparison does not make a lot of sense as both can catch up to each other in some way or another unless you're in a very specific specialist niche (which you have not informed us about).
I have done performance test.
Program : Sum of 10000000 Numbers
Given output proves that php is slower than C#............
I'd say ASP.net
Things to consider:
ASP.net is pre-compiled
ASP.net is usually written in C#, which should execute faster than PHP
Granted, the differences are very minor. There's advantages to both, I think PHP is much easier to deploy and can run on any server not just IIS. I am quite fond of ASP.net MVC though.
I am a developer expert on both technologies (ASP.Net c# and PHP5).
After years and years of working and comparing them in real production environments these are my impressions:
First of all, cant compare them making a loop of adding values 1.000.000, this is not a real case.
Is not the same comparing them in my development environment than a real production env. Eg: In development ASP.Net does not use IIS by default, use a Inner Development server which has different optimizations. In dev, there is no concurrency.
So my opinion is the next:
Looping 1.000.000 times c# is going to be faster.(no-sense)
Serving a real page, that access DB, shows images, has forms etc....
ASP.Net is slower than PHP.
Weight of ASPX pages is x10 heavier than PHP, so this makes the final user to be waiting more time to get the page.
ASPX is slower to develop than PHP, this is important because at the end is money. We develop a 35% faster in PHP than ASP.Net, because of having to compile and restart every time u want to check smthg.
In big projects, ASP.Net in long term is better for avoiding errors and have a complex architechture.
Because of Windows Servers, IIS, .... at the end u need a powerfull server to hold the same amount of users on ASP than PHP. Eg: We serve with ASP.net arround 20.000 concurrent users and in PHP, the same server can get arround 30.000 users.
The only important thing is not if looping which one is faster. The thing is when website is real and is in production, how many users they can hold, how heavy is the page (heavier== more waiting time from users, more net charge of server, more disk charge of server, more memory charge of server).
Try the checking times with concurrency and u will see.
Hope it helps.
Without any optimizations, a .net compiled app would of course run "faster" than php. But you are correct that it's a stupid and irrelevant argument because it has no bearing on the real world beyond bragging rights.
Generally ASP.Net will perform better on a given hardware than PHP. ASP.Net MVC can do better still (can being the operative word here). Most of the platform is designed with enterprise development in mind. Testable code, separation of concerns etc. A lot of the bloat in ASP.Net comes from the object stack within the page (nested controls). Pre-compiling makes this better performant, but it can be a key issue. MVC tends to allow for less nesting, using the webforms based view engine (others are available).
Where the biggest slowdowns in web applications happen tends to be remote services, especially database persistence. PHP is programmed without the benefit of connection pooling, or in-memory session state. This can be overcome with memcached and other, more performant service layers (also available to .Net).
It really comes down to the specifics of a site/application. this site happens to run MVC on fairly modest hardware quite well. A similar site under PHP would likely fall under its own weight. Other things to consider. IIS vs. Apache vs LightHTTPD etc. Honestly the php vs asp.net is much more than raw performance differences. PHP doesnt lend itself well to large, complex applications nearly so much as asp.net mvc, it's that simple... This itself has more to do with VS+SCC than anything else.
I'd tend to agree with you (that ASP.NET MVC is faster), but why not make a friendly wager with your friend and share the results? Create a really simple DYNAMIC page, derived from a MySQL database, and load the page many times.
For example, create a table with 1,000,000 rows containing a sequential primary key, and then a random # in the second column. Each of your sites can accept the primary key in a GET, retrieve the random # based on the passed in key, and display the random # in some type of dynamically generated html.
I'd love to know the results ... and if you have a blog or similar, the rest of the world would too (this question gets asked ALL the time).
It would be even better if you could build this simple little app in regular ASP too. Heck, I'd even pay you for these results if the test was well designed. Seriously - just express your interest here and I'll send you my e-mail.
Need to note that question is .NET MVC vs PHP, not .NET (Web Forms) vs PHP.
I don't have the facts, but general feeling is PHP websites run faster than .NET Web form sites (and I do .NET only). .NET web forms despite being compiled vs interpreted PHP is generally slow because all the chunk of code that is autogenerated by the .NET engine to render the HTML for each < asp:control > you use on design mode. Getting a .NET web form to compete in speed with PHP is a complete odisea that starts with setting EnableViewState = false, and can end on using every html control with runat=server... crazy uh?
Now, MVC is a different story, I had made two websites using .NET MVC2 and feeling is good, you can feel the speed now! and code is as clean as any PHP website. So, now, MVC allows you write clean code as PHP does, and MVC is compiled against PHP interpreted, it can only lead to one thing, MVC faster than PHP... time will prove, when the general sense is "MVC websites runs faster than PHP" then we will be right about what I say here today.
see/you/!
C++... Right now the fight will be between PHP and ASP.NET. PHP will win on ease of use, ASP.NET will win on performance ( in a windows server ecosystem). A lot of the larger websites that started with php have graduated to C++.
We have a large management software that is producing big reports of all kinds, based on numerous loops, with database retrievals, objects creations (many), and so on.
On PHP4 it could run happily with a memory limit of 64 MB - now we have moved it on a new server and with the same database - same code, the same reports won't come up without a gig of memory limit...
I know that PHP5 has changed under the hood quite a lot of things, but is there a way to make it behave ?
The question at the end is, what strategies do you apply when you need to have your scripts on a diet ?
A big problem we have run into was circular references between objects stopping them from freeing memory when they become out of scope.
Depending on your architecture you may be able to use __destruct() and manually unset any references. For our problem i ended up restructuring the classes and removing the circular references.
When I need to optimize resources on any script, I try always to analyze, profile and debug my code, I use xDebug, and the xDebug Profiler, there are other options like APD, and Benchmark Profiler.
Additionally I recommend you this articles:
Make PHP apps fast, faster, fastest..
Profiling PHP Applications (PDF)
PHP & Performance (PDF)
Since moving to the new server, have you verified that your MySQL and PHP system variables are identical to the way they were on your old server?
PHP5 introduced a lot of new functionality but due to its backward compatibility mantra, I don't believe that the differences between PHP5 and PHP4 should be causing this large an affect on the performance of an application who's code and database has not been altered.
Are you also running on the same version of Apache or IIS?
It sounds like a problem that is more likely related to your new system environment than to an upgrade from PHP4 to 5.
Bertrand,
If you are interested in refactoring the existing code then I would recommend that you first monitor your CPU and Memory usage while executing reports. Are you locking up your SQL server or are you locking up Apache (which happens if a lot of stress is being put onto the system by the PHP code)?
I worked on a project that initially bogged down MySQL so severely that we had to refactor the entire report generation process. However, when we finished the load was simply transferred to Apache (through the more complex PHP code). Our final solution was to refactor the database design to provide for better performance for reporting functions and to use PHP to pick up the slack on what we couldn't do natively in MySQL.
Depending on the nature of the reports you might consider denormalizing the data that is being used for the reports. You might even consider constructing a second database that serves as a data warehouse and is designed around OLAP principles rather than OLTP principles. You can start at Wikipedia for a general explanation of OLAP and data warehousing.
However, before you start looking at serious refactoring, have you verified that your environments are sufficiently similar by looking at phpinfo(); for PHP and SHOW VARIABLES;
in MySQL?
A gig!?!
even 64MB is big.
ignoring the discrepancy between environments, (which does sound very peculiar), it sounds like the code may need some re-factoring.
any chance you can re factor your code so that the result sets from database queries are not dumped into arrays. I would recommend that you construct an iterator for your result sets. (thence you can treat them as array for most purposes).there is a big difference between handling one record at a time, and handling 10,000 records at a time.
secondly, have a look at weather your code is creating multiple instances of the data. Can you pass the objects by reference. (use the '&'). We had to do a similar thing when using an early variant of the horde framework. a 1 MB attachment would blow out to 50MB from numerous calls which passed the whole dataset as a copy, rather than as a reference.
Before you answer this I have never developed anything popular enough to attain high server loads. Treat me as (sigh) an alien that has just landed on the planet, albeit one that knows PHP and a few optimisation techniques.
I'm developing a tool in PHP that could attain quite a lot of users, if it works out right. However while I'm fully capable of developing the program I'm pretty much clueless when it comes to making something that can deal with huge traffic. So here's a few questions on it (feel free to turn this question into a resource thread as well).
Databases
At the moment I plan to use the MySQLi features in PHP5. However how should I setup the databases in relation to users and content? Do I actually need multiple databases? At the moment everything's jumbled into one database - although I've been considering spreading user data to one, actual content to another and finally core site content (template masters etc.) to another. My reasoning behind this is that sending queries to different databases will ease up the load on them as one database = 3 load sources. Also would this still be effective if they were all on the same server?
Caching
I have a template system that is used to build the pages and swap out variables. Master templates are stored in the database and each time a template is called it's cached copy (a html document) is called. At the moment I have two types of variable in these templates - a static var and a dynamic var. Static vars are usually things like page names, the name of the site - things that don't change often; dynamic vars are things that change on each page load.
My question on this:
Say I have comments on different articles. Which is a better solution: store the simple comment template and render comments (from a DB call) each time the page is loaded or store a cached copy of the comments page as a html page - each time a comment is added/edited/deleted the page is recached.
Finally
Does anyone have any tips/pointers for running a high load site on PHP. I'm pretty sure it's a workable language to use - Facebook and Yahoo! give it great precedence - but are there any experiences I should watch out for?
No two sites are alike. You really need to get a tool like jmeter and benchmark to see where your problem points will be. You can spend a lot of time guessing and improving, but you won't see real results until you measure and compare your changes.
For example, for many years, the MySQL query cache was the solution to all of our performance problems. If your site was slow, MySQL experts suggested turning the query cache on. It turns out that if you have a high write load, the cache is actually crippling. If you turned it on without testing, you'd never know.
And don't forget that you are never done scaling. A site that handles 10req/s will need changes to support 1000req/s. And if you're lucking enough to need to support 10,000req/s, your architecture will probably look completely different as well.
Databases
Don't use MySQLi -- PDO is the 'modern' OO database access layer. The most important feature to use is placeholders in your queries. It's smart enough to use server side prepares and other optimizations for you as well.
You probably don't want to break your database up at this point. If you do find that one database isn't cutting, there are several techniques to scale up, depending on your app. Replicating to additional servers typically works well if you have more reads than writes. Sharding is a technique to split your data over many machines.
Caching
You probably don't want to cache in your database. The database is typically your bottleneck, so adding more IO's to it is typically a bad thing. There are several PHP caches out there that accomplish similar things like APC and Zend.
Measure your system with caching on and off. I bet your cache is heavier than serving the pages straight.
If it takes a long time to build your comments and article data from the db, integrate memcache into your system. You can cache the query results and store them in a memcached instance. It's important to remember that retrieving the data from memcache must be faster than assembling it from the database to see any benefit.
If your articles aren't dynamic, or you have simple dynamic changes after it's generated, consider writing out html or php to the disk. You could have an index.php page that looks on disk for the article, if it's there, it streams it to the client. If it isn't, it generates the article, writes it to the disk and sends it to the client. Deleting files from the disk would cause pages to be re-written. If a comment is added to an article, delete the cached copy -- it would be regenerated.
I'm a lead developer on a site with over 15M users. We have had very little scaling problems because we planned for it EARLY and scaled thoughtfully. Here are some of the strategies I can suggest from my experience.
SCHEMA
First off, denormalize your schemas. This means that rather than to have multiple relational tables, you should instead opt to have one big table. In general, joins are a waste of precious DB resources because doing multiple prepares and collation burns disk I/O's. Avoid them when you can.
The trade-off here is that you will be storing/pulling redundant data, but this is acceptable because data and intra-cage bandwidth is very cheap (bigger disks) whereas multiple prepare I/O's are orders of magnitude more expensive (more servers).
INDEXING
Make sure that your queries utilize at least one index. Beware though, that indexes will cost you if you write or update frequently. There are some experimental tricks to avoid this.
You can try adding additional columns that aren't indexed which run parallel to your columns that are indexed. Then you can have an offline process that writes the non-indexed columns over the indexed columns in batches. This way, you can control better when mySQL will need to recompute the index.
Avoid computed queries like a plague. If you must compute a query, try to do this once at write time.
CACHING
I highly recommend Memcached. It has been proven by the biggest players on the PHP stack (Facebook) and is very flexible. There are two methods to doing this, one is caching in your DB layer, the other is caching in your business logic layer.
The DB layer option would require caching the result of queries retrieved from the DB. You can hash your SQL query using md5() and use that as a lookup key before going to database. The upside to this is that it is pretty easy to implement. The downside (depending on implementation) is that you lose flexibility because you're treating all caching the same with regard to cache expiration.
In the shop I work in, we use business layer caching, which means each concrete class in our system controls its own caching schema and cache timeouts. This has worked pretty well for us, but be aware that items retrieved from DB may not be the same as items from cache, so you will have to update cache and DB together.
DATA SHARDING
Replication only gets you so far. Sooner than you expect, your writes will become a bottleneck. To compensate, make sure to support data sharding early as possible. You will likely want to shoot yourself later if you don't.
It is pretty simple to implement. Basically, you want to separate the key authority from the data storage. Use a global DB to store a mapping between primary keys and cluster ids. You query this mapping to get a cluster, and then query the cluster to get the data. You can cache the hell out of this lookup operation which will make it a negligible operation.
The downside to this is that it may be difficult to piece together data from multiple shards. But, you can engineer your way around that as well.
OFFLINE PROCESSING
Don't make the user wait for your backend if they don't have to. Build a job queue and move any processing that you can offline, doing it separate from the user's request.
I've worked on a few sites that get millions/hits/month backed by PHP & MySQL. Here are some basics:
Cache, cache, cache. Caching is one of the simplest and most effective ways to reduce load on your webserver and database. Cache page content, queries, expensive computation, anything that is I/O bound. Memcache is dead simple and effective.
Use multiple servers once you are maxed out. You can have multiple web servers and multiple database servers (with replication).
Reduce overall # of request to your webservers. This entails caching JS, CSS and images using expires headers. You can also move your static content to a CDN, which will speed up your user's experience.
Measure & benchmark. Run Nagios on your production machines and load test on your dev/qa server. You need to know when your server will catch on fire so you can prevent it.
I'd recommend reading Building Scalable Websites, it was written by one of the Flickr engineers and is a great reference.
Check out my blog post about scalability too, it has a lot of links to presentations about scaling with multiple languages and platforms:
http://www.ryandoherty.net/2008/07/13/unicorns-and-scalability/
Re: PDO / MySQLi / MySQLND
#gary
You cannot just say "don't use MySQLi" as they have different goals. PDO is almost like an abstraction layer (although it is not actually) and is designed to make it easy to use multiple database products whereas MySQLi is specific to MySQL conections. It is wrong to say that PDO is the modern access layer in the context of comparing it to MySQLi because your statement implies that the progression has been mysql -> mysqli -> PDO which is not the case.
The choice between MySQLi and PDO is simple - if you need to support multiple database products then you use PDO. If you're just using MySQL then you can choose between PDO and MySQLi.
So why would you choose MySQLi over PDO? See below...
#ross
You are correct about MySQLnd which is the newest MySQL core language level library, however it is not a replacement for MySQLi. MySQLi (as with PDO) remains the way you would interact with MySQL through your PHP code. Both of these use libmysql as the C client behind the PHP code. The problem is that libmysql is outside of the core PHP engine and that is where mysqlnd comes in i.e. it is a Native Driver which makes use of the core PHP internals to maximise efficiency, specifically where memory usage is concerned.
MySQLnd is being developed by MySQL themselves and has recently landed onto the PHP 5.3 branch which is in RC testing, ready for a release later this year. You will then be able to use MySQLnd with MySQLi...but not with PDO. This will give MySQLi a performance boost in many areas (not all) and will make it the best choice for MySQL interaction if you do not need the abstraction like capabilities of PDO.
That said, MySQLnd is now available in PHP 5.3 for PDO and so you can get the advantages of the performance enhancements from ND into PDO, however, PDO is still a generic database layer and so will be unlikely to be able to benefit as much from the enhancements in ND as MySQLi can.
Some useful benchmarks can be found here although they are from 2006. You also need to be aware of things like this option.
There are a lot of considerations that need to be taken into account when deciding between MySQLi and PDO. It reality it is not going to matter until you get to rediculously high request numbers and in that case, it makes more sense to be using an extension that has been specifically designed for MySQL rather than one which abstracts things and happens to provide a MySQL driver.
It is not a simple matter of which is best because each has advantages and disadvantages. You need to read the links I've provided and come up with your own decision, then test it and find out. I have used PDO in past projects and it is a good extension but my choice for pure performance would be MySQLi with the new MySQLND option compiled (when PHP 5.3 is released).
General
Do not try to optimize before you start to see real world load. You might guess right, but if you don't, you've wasted your time.
Use jmeter, xdebug or another tool to benchmark the site.
If load starts to be an issue, either object or data caching will likely be involved, so generally read up on caching options (memcached, MySQL caching options)
Code
Profile your code so that you know where the bottleneck is, and whether it's in code or the database
Databases
Use MYSQLi if portability to other databases is not vital, PDO otherwise
If benchmarks reveal the database is the issue, check the queries before you start caching. Use EXPLAIN to see where your queries are slowing down.
After the queries are optimized and the database is cached in some way, you may want to use multiple databases. Either replicating to multiple servers or sharding (splitting the data over multiple databases/servers) may be appropriate, depending on the data, the queries, and the kind of read/write behavior.
Caching
Plenty of writing has been done on caching code, objects, and data. Look up articles on APC, Zend Optimizer, memcached, QuickCache, JPCache. Do some of this before you really need to, and you'll be less concerned about starting off unoptimized.
APC and Zend Optimizer are opcode caches, they speed up PHP code by avoiding reparsing and recompilation of code. Generally simple to install, worth doing early.
Memcached is a generic cache, that you can use to cache queries, PHP functions or objects, or entire pages. Code must be specifically written to use it, which can be an involved process if there are no central points to handle creation, update and deletion of cached objects.
QuickCache and JPCache are file caches, otherwise similar to Memcached. The basic concept is simple, but also requires code and is easier with central points of creation, update and deletion.
Miscellaneous
Consider alternative web servers for high load. Servers like lighthttp and nginx can handle large amounts of traffic in much less memory than Apache, if you can sacrifice Apache's power and flexibility (or if you just don't need those things, which often, you don't).
Remember that hardware is surprisingly cheap these days, so be sure to cost out the effort to optimize a large block of code versus "let's buy a monster server."
Consider adding the "MySQL" and "scaling" tags to this question
APC is an absolute must. Not only does it make for a great caching system, but the gain from the auto-cached PHP files is a godsend. As for the multiple database idea, I don't think you would get much out of having different databases on the same server. It may give you a bit of a gain in speed during query time, but I doubt the effort it would take to deploy and maintain the code for all three while making sure they are in sync would be worth it.
I also highly recommend running Xdebug to find bottlenecks in your program. It made optimization a breeze for me.
Firstly, as I think Knuth said, "Premature optimization is the root of all evil". If you don't have to deal with these issues right now then don't, focus on delivering something that works correctly first. That being said, if the optimizations can't wait.
Try profiling your database queries, figure out what's slow and what happens alot and come up with an optimization strategy from that.
I would investigate Memcached as it's what a lot of the higher load sites use for efficiently caching content of all types, and the PHP object interface to it is quite nice.
Splitting up databases among servers and using some sort of load balancing technique (e.g. generate a random number between 1 and # redundant databases with necessary data - and use that number to determine which database server to connect to) can also be an excellent way to increase efficiency.
These have all worked out pretty well in the past for some fairly high load sites. Hope this helps to get you started :-)
Profiling your app with something like Xdebug (like tj9991 recommended) is definitely going to be a must. It doesn't make a whole lot of sense to just go around optimizing things blindly. Xdebug will help you find the real bottlenecks in your code so you can spend your optimization time wisely and fix chunks of code that are actually causing slow downs.
If you're using Apache, another utility that can help in testing is Siege. It will help you anticipate how your server and application will react to high loads by really putting it through its paces.
Any kind of opcode cache for PHP (like APC or one of the many others) will help a lot as well.
I run a website with 7-8 million page views a month. Not terribly much, but enough that our server felt the load. The solution we chose was simple: Memcache at the database level. This solution works well if the database load is your main problem.
We started out using Memcache to cache entire objects and the database results that were most frequently used. It did work, but it also introduced bugs (we might have avoided some of those if we had been more careful).
So we changed our approach. We built a database wrapper (with the exact same methods as our old database, so it was easy to switch), and then we subclassed it to provide memcached database access methods.
Now all you have to do is decide whether a query can use cached (and possibly out of date) results or not. Most of the queries run by the users are now fetched directly from Memcache. The exceptions are updates and inserts, which for the main website only happens because of logging. This rather simple measure reduced our server load by about 80%.
For what it's worth, caching is DIRT SIMPLE in PHP even without an extension/helper package like memcached.
All you need to do is create an output buffer using ob_start().
Create a global cache function. Call ob_start, pass the function as a callback. In the function, look for a cached version of the page. If exists, serve it and end.
If it doesn't exist, the script will continue processing. When it reaches the matching ob_end() it will call the function you specified. At that time, you just get the contents of the output buffer, drop them in a file, save the file, and end.
Add in some expiration/garbage collection.
And many people don't realize you can nest ob_start()/ob_end() calls. So if you're already using an output buffer to, say, parse in advertisements or do syntax highlighting or whatever, you can just nest another ob_start/ob_end call.
Thanks for the advice on PHP's caching extensions - could you explain reasons for using one over another? I've heard great things about memcached through IRC but have never heard of APC - what are your opinions on them? I assume using multiple caching systems is pretty counter-effective.
Actually, many do use APC and memcached together...
It looks like I was wrong. MySQLi is still being developed. But according to the article, PDO_MySQL is now being contributed to by the MySQL team. From the article:
The MySQL Improved Extension - mysqli
- is the flagship. It supports all features of the MySQL Server including
Charsets, Prepared Statements and
Stored Procedures. The driver offers a
hybrid API: you can use a procedural
or object-oriented programming style
based on your preference. mysqli comes
with PHP 5 and up. Note that the End
of life for PHP 4 is 2008-08-08.
The PHP Data Objects (PDO) are a
database access abstraction layer. PDO
allows you to use the same API calls
for various databases. PDO does not
offer any degree of SQL abstraction.
PDO_MYSQL is a MySQL driver for PDO.
PDO_MYSQL comes with PHP 5. As of PHP
5.3 MySQL developers actively contribute to it. The PDO benefit of a
unified API comes at the price that
MySQL specific features, for example
multiple statements, are not fully
supported through the unified API.
Please stop using the first MySQL
driver for PHP ever published:
ext/mysql. Since the introduction of
the MySQL Improved Extension - mysqli
- in 2004 with PHP 5 there is no reason to still use the oldest driver
around. ext/mysql does not support
Charsets, Prepared Statements and
Stored Procedures. It is limited to
the feature set of MySQL 4.0. Note
that the Extended Support for MySQL
4.0 ends at 2008-12-31. Don't limit yourself to the feature set of such
old software! Upgrade to mysqli, see
also Converting_to_MySQLi. mysql is in
maintenance only mode from our point
of view.
To me, it seems the article is biased towards MySQLi. I suppose I'm biased towards PDO.
I really like PDO over MySQLi. It's straight forward to me. The API is a lot closer to other languages I've programmed in. OO Database interfaces seem to work better.
I haven't come across any specific MySQL features that weren't available through PDO. I would be surprised if I ever did.
PDO is also very slow and its API is pretty complicated. No one in their sane mind should use it if portability is not a concern. And let's face it, in 99% of all webapps it is not. You just stick with MySQL or PostrgreSQL, or whatever it is you are working with.
As for the PHP question and what to take into account. I think premature optimization is the root of all evil. ;) Get your application done first, try to keep it clean when it comes to programming, do a little documentation and write unit tests. With all of the above you will have no issues refactoring code when the time comes. But first you want to be done and push it out to see how people react to it.
Sure pdo is nice, but there has been some controversy about it's performance versus mysql and mysqli, although it seems fixed now.
You should use pdo if you envision portability, but if not, mysqli should be the way. It has an OO interface, prepared statements, and most of what pdo offers (except, well, portability).
Plus, if performance is really needed, prepare for the (native mysql) MysqLnd driver in PHP 5.3, who will be much more tightly integrated with php, with better performance and improved memory usage (and statistics for performance tuning).
Memcache is nice if you have clustered servers (and YouTube-like load), but i'd try out APC first too.
A lot of good answers were given already, but I would like to point you to an alternate opcode cache called XCache. It is created by a lighty contributor.
Also, if you may need load balancing your database server in future, MySQL Proxy could very well help you to achieve this.
Both of those tools should plug into an existing application quite easily, so this optimization can be done when you need it, without too much hassle.
First question is how big do you really expect it to be? And how much do you plan on investing in your infrastructure. Since you feel the need to ask the question here, I'm guessing that you expect to start small on a limited budget.
Performance is irrelevant if the site is not available. And for availability you need horizontal scaling. The minimum you can sensibly get away with is 2 servers, both running apache, php and mysql. Set up one DBMS as a slave to the other. Do all the writes on the master, and all the reads on the local database (whatever that is) - unless for some reason you need to read back the data you've just read (use master). Make sure you've got the machinery in place to automatically promote the slave and fence the master. Use round-robin DNS for the webserver addresses to give more affinity for the slave node.
Partitioning your data across different database nodes at this stage is a very bad idea - however you might want to consider splitting it across different databases on the same server (which will facilitate partitioning across nodes when you overtake facebook).
Do make sure you've got the monitoring and data analysis tools in place to measure your sites performance and identify bottlenecks. Most performance problems can be fixed by writing better SQL / fixing the database schema.
Keeping your template cache on the database is a dumb idea - the database should be a central common repository for structured data. Keep your template cache on the local filesystem of your webservers - it will be available faster and won't slow down your database access.
Do use a op-code cache.
Spend plenty of time studying your site and its logs to understand why its going so slow.
Push as much caching as possible onto the client.
Use mod_gzip to compress everything you can.
C.
My first piece of advice is to think about this issue and keep it in mind when designing the site but don't go overboard. It's often difficult to predict the success of a new site and I your time will be better spent getting up finished early and optimising it later.
In general, Simple is fast.
Templates slow you down. Databases slow you down. Complex libraries slow you down. Layering templates over each other retrieving them from databases and parsing it in a complex library --> the time delays multiply with each other.
Once you have the basic site up and running do tests to show you where to spend your efforts. It's difficult to see where to target. Often to speed things up you will have to unravel the complexity of the code, this makes it larger and harder to maintain, so you only want to do it where necessary.
In my experience establishing the database connection was relatively expensive. If you can get away with it, don't connect to the database for general visitors on the most trafficed pages like the front page to the site. Creating multiple database connections is madness with very little benefit.
#Gary
Don't use MySQLi -- PDO is the 'modern' OO database access layer. The most important feature to use is placeholders in your queries. It's smart enough to use server side prepares and other optimizations for you as well.
I'm loking over PDO at the moment and it looks like you're right - however I know that MySQL are developing the MySQLd extension for PHP - I think to succeed either MySQL or MySQLi - what do you think about that?
#Ryan, Eric, tj9991
Thanks for the advice on PHP's caching extensions - could you explain reasons for using one over another? I've heard great things about memcached through IRC but have never heard of APC - what are your opinions on them? I assume using multiple caching systems is pretty counter-effective.
I will definitely be sorting out some profiling testers - thank you very much for your recommendations on those.
I don't see myself switching from MySQL anytime soon - so I guess I don't need the abstraction capabilities of PDO. Thanks for those articles DavidM, they've helped me a lot.
Look into mod_cache, an output cache for the Apache web server, simillar to the output caching in ASP.NET.
Yes, I can see that it's still experimental but it will be final someday.
I can't believe no-one has already mentioned this: Modularisation and Abstraction. If you think your site is going to have to grow to lots of machines, you must design it so it can! That means stupid things like don't assume the database is on localhost. It also means things that are going to be a bother at first, like writing a database abstraction layer (like PDO, but much much lighter because it only does what you need it to do).
And it means things like working with a framework. You will need layers to your code so that you can later gain performance by refactoring the data-abstraction layer, for example, by teaching it that some objects are in a different database -- and the code doesn't have to know or care.
Finally, be careful of memory-intensive operations, for example, unnecessary string copying. If you can keep PHP's memory usage down, then you will get more performance out of your webserver and this is something that will scale when you go to a load-balanced solution.
If you are working with large amounts of data, and caching isn't cutting it, look into Sphinx. We've had great results with using SphinxSearch not only for better text searching, but also as a data retrieval replacement for MySQL when dealing larger tables. If you use SphinxSE (MySQL plugin), it surpassed our performance gains we had from caching several times over, and application-implementation is a sinch.
The points made about cache are spot-on; it is the least complicated and most important part of building an efficient application. I'd like to add that while memcached is great, APC is about five times faster if your application lives on a single server.
The "Cache Performance Comparison" post at the MySQL performance blog has some interesting benchmarks on the subject - http://www.mysqlperformanceblog.com/2006/08/09/cache-performance-comparison/.