What performance improvements can I make to Symfony + Doctrine? - php

I've recently built a reasonably complex database application for a university handling almost 200 tables. Some tables can (e.g. Publications) hold 30 or more fields and store 10 one-to-one FK relations and up to 2 or 3 many-to-many FK relations (using crossreference-ref tables). I use integer IDs throughout and normalisation has been key every step of the way. AJAX is minimal and most pages are standard CRUD forms/processes.
I used Symfony 1.4, and Doctrine ORM 1.2, MySQL, PHP.
While the benefits to development time and ease of maintenance have been enormous (in using an MVC and ORM), we've been having problems with speed. That is, when we have more than a few users logged in and active at any one time the application slows considerbly (up to 20 seconds to save or edit a record).
We're currently engaged in discussions with our SysAdmin but they say that we should have more than enough power. With 6 or more users engaged in activity, we end up queuing 4 CPUs in a virtual server environment while memory usage is low (no bleeds).
Of course, we're considering multi-threading our mySQL application (if this will help), refining our code (though much of it is generated by the MVC) and refining our cache usage (this could be better, though the majority of the screen used is user-login specific and dynamic); we've installed APC, extra memory, de-fragmented our database, I've tried unsetting all recordsets (though I understand this is now automatic within the ORM), instigating manual garbage recycling...
But the question I'm asking is whether mySQL, PHP, and Symfony MVC was actually a poor choice for developing an application of this size in the first place? If so, what do people normally use/recommend for a web-based database interface application of this kind of size/complexity?

I don't have any experience with Symfony or Doctrine. However, there are certainly larger sites than yours that are built on these projects. DailyMotion for instance uses both, and it serves a heck of a lot more than a few simultaneous users.
Likewise, PHP and MySQL are used on sites as large as Wikipedia, so scalability should not be a problem. (BTW, MySQL should be multithreaded by default—one thread per connection.)
Of course, reasonable performance is relative. If you're trying to run a site serving 200 concurrent requests off a beige box sitting in a closet, then you may need to optimize your code a lot more than a Fortune 500 company with their own data centers.
The obvious thing to do is actually profile your application and see where the bottleneck is. Profiling MySQL queries is pretty simple, and there are also assorted PHP profiling tools like Xdebug. From there you can figure out whether you should switch frameworks, ORMs (or ditch ORM altogether), databases, or refactor some code, or just invest in more processing power.

The best way to speed up complex database operation is not calling them :).
So analyize which parts of your application you can cache.
Symfony has a pretty well caching system (in symfony2 even better) that can be used pretty granulary.
On database side another way would be to use views or nested sets to store aggregated data.
Try to find parts where this is appropriate.

Related

Can Laravel 5 handle 1000 concurrent users without lagging badly?

I wanted to know that if 1000 users are concurrently using a website built with laravel 5 and also querying database regularly then how does laravel 5 perform ?
I know it would be slow but will it be highly slow that it would be unbearable ?
Note that i am also going to use ajax a lot.
And lets assume i am using digital ocean cloud service with following configurations
2GB memory
2 vCPU
40GB SSD
I don't expect completely real figures as it is impossible to do so but at least provide some details whether i should go with laravel with some considerable performance.
Please also provide some some tools through which i can check the speed of my laravel 5 application as well as how it will perform when there is actual load as well as other tools through which i can test speed and performance.
And it would be great if someone has real experience of using laravel especially Laravel 5.
And what about Lumen does that really make application faster than laravel and how much ?
In short, yes. At least newer versions of Laravel are capable (Laravel 7.*).
That being said, this is really a three part conundrum.
1. Laravel (Php)
High Concurrency Architecture And Laravel Performance Tuning
Honestly, I wouldn't be able to provide half the details as this amazing article provides. He's got everything in there from the definition of concurrency all the way to pre-optimization times vs. after-optimization times.
2. Reading, Writing, & Partitioning Persisted Data (Databases)
MySQL vs. MongoDB
I'd be curious if the real concern is Php's Laravel, or more of a database read/write speed timing bottleneck. Non relational databases are an incredible technology, that benefit big data much more than traditional relational databases.
Non-relational Databases (Mongo) have a much faster read speed than MySql (Something like 60% faster if I'm remembering correctly)
Non-relational Databases (Mongo) do have a slower write speed, but this usually is not an inhibitor to the user experience
Unlike Relational Databases (MySQL), Mongo DB can truly be partitioned, spread out across multiple servers.
Mongo DB has collections of documents, collections are pretty synonymous to tables and documents are pretty synonymous to rows.
The difference being, MongoDB has a very JSON like feel to it. (Collections of documents, where each document looks like a JSON object).
The huge difference, and benefit, is that each document - AKA row - does not have have the same keys. When using mongo DB on a fortune 500 project, my mentor and lead at the time, Logan, had a phenomenal quote.
"Mongo Don't Care"
This means that you can shape the data how you're wanting to retrieve it, so not only is your read speed faster, you're usually not being slowed by having to retrieve data from multiple tables.
Here's a package, personally tested and loved, to set up MongoDB within Laravel
Jessengers ~ MongoDB In Laravel
If you are concerned about immense amounts of users and transferring data, MongoDB may be what you're looking for. With that, let's move on to the 3rd, and most important point.
3. Serverless Architecture (Aka Horizontal scaling)
Aws, Google Cloud, Microsoft Azure, etc... I'm sure you've hear of The Cloud.
This, ultimately, is what you're looking for if you're having concurrency issues and want to stay within the realm of Laravel.
It's a whole new world of incredible tools one can hammer away at -- they'er awesome. It's also a whole new, quite large, world of tools and thought to learn.
First, let's dive into a few serverless concepts.
Infrastructure as Code Terraform
"Use Infrastructure as Code to provision and manage any cloud, infrastructure, or service"
Horizontal Scaling Example via The Cloud
"Create a Laravel application. It's a single application, monolithic. Then you dive Cloud. You discover Terraform. Ahaha, first you use terraform to define how many instances of your app will run at once. You decide you want 8 instances of your application. Next, you of course define a Load Balancer. The Load Balancer simply balances the traffic load between your 8 application instances. Each application is connected to the same database, ultimately sharing the same data source. You're simply spreading out the traffic across multiples instances of the same application."
We can of course top that, very simplified answer of cloud, and dive into lambdas, the what Not to do's of serverless, setting up your internal virtual cloud network...
Or...we can thank the Laravel team in advance for simplifying Serverless Architecture
Yep, Laravel Simplified Serverless (Shout out Laravel team)
Serverless Via Laravel Vapor
Laravel Vapor Opening Paragraph
"Laravel Vapor is an auto-scaling, serverless deployment platform for Laravel, powered by AWS Lambda. Manage your Laravel infrastructure on Vapor and fall in love with the scalability and simplicity of serverless."
Coming to a close, let's summarize.
Oringal Concern
Ability to handle a certain amount of traffic in a set amount of time
Potential Bottlenecks with potential solutions
Laravel & Php
(High Concurrency Architecture And Laravel Performance
Tuning
Database & Persisting/Retrieving Data Efficiently
Jessengers ~ MongoDB In Laravel
Serverless Architecture For Horizontal Scaling
Serverless Via Laravel Vapor
I'll try to answer this based on my experience as a software developer. To be honest I definitely will ask for an upgrade whenever it hits 1000 concurrent users at the same time because I won't take a risk with server failure nor data failure.
But let's break it how to engineer this.
It could handle those users if the data fetched is not complex and there are not many operations from Laravel code. If it's just passing through from the database and almost no modification from the data, it'll be fast.
The data fetched by the users are not unique. Let's say you have a news site without user personalization. the news that the users fetched mostly will be the same. You cached the data from memory (Redis, which I recommend) or from the web server(Nginx, should be avoided), your Laravel program will run fast enough.
Querying directly from the database is faster than using Laravel ORM. you might consider it if needed, but I myself will always try to use ORM because it will help code to be more readable and secure.
Splitting database, web server, CDN, and cache server is obviously making it easier to monitor server usage.
Try to upgrade it to the latest version. I used to work with a company that using version 5, and it's not really good at performance.
opcode caching. cache the PHP file code. I myself never use this.
split app to backend and frontend. use state management for front end app to reduce requests data to the server.
Now let's answer your question
Are there any tools for checking performance? You can check Laravel debug bar, these tools provide for simple performance reports. I myself encourage you to make a test for each of the features you create. You can create a report from that unit test to find which feature taking time to finish.
Are lume faster than laravel? Yeah, Lumen is faster because they disabled some features from Laravel. But please be aware that Taylor seems gonna stop Lumen for development. You should consider this for the future.
If you're aware of performance, you might not choose Laravel for development.
Because there is a delay between each server while opening a connection. Whenever a user creating a request, they open a connection to the server. server open connections to the cache server, database server, SMTP server, or probably other 3rd parties as well. It's the real bottleneck that happened on Laravel. You can make keep-alive connections with the database and cache server to reduce connection delay, but you wouldn't find it in Laravel because it will dispose the connection whenever the request finish.
It's a typed language, mostly compiled language are faster
In the end, you might not be able to use Laravel with that server resources unless you're creating a really simple application.
A question like this need an answer with real numbers:
luckily this guy already have done it in similar conditions as you want to try and with laravel forge.
with this php config:
opcache.enable=1
opcache.memory_consumption=512
opcache.interned_strings_buffer=64
opcache.max_accelerated_files=20000
opcache.validate_timestamps=0
opcache.save_comments=1
opcache.fast_shutdown=1
the results:
Without Sessions:
Laravel: 609.03 requests per second
With Sessions:
Laravel: 521.64 requests per second
so answering your question:
With that memory you would be in trouble to get 1000 users making requests, use 4gb memory and you will be in better shape.
I don't think you can do that with Laravel.
I tried benchmarking Laravel with an 8 core CPU, 8 GB RAM and 120GB HDD and I just got 200-400 request per second.

Why is Symfony2 performing so bad in benchmarks and does it matter?

My colleagues and I are in the process of choosing a web framework to develop a high traffic web site. We are really good with node.js + express and php + symfony2. Both are great frameworks but we are a bit concerned about Symfony2 because it seems to be outperformed by most web frameworks out there.
Here is the benchmarks that proves it:
http://www.techempower.com/benchmarks/
For this reason we will probably use node.js + express but I still wonder why is Symfony2 performing so bad in benchmarks.
In the end it all comes down to correct cache handling...
symfony or PHP in general IS slower than other languages or frameworks thus providing you with the tools to create rich, secure and testable web applications really fast.
If you use a reverse proxy like Varnish and ESI ( edge side includes ) and end up serving the parts of your templates you really need to have updated through symfony. you will have a blazingly fast experience.
Furthermore if you use an opcode cache like APC and an optimized database a human user won't actually notice the difference of a few ms in a real world application.
As per request i will dive a bit deeper and give you a few more things to think about.
Caching & Performance
With cloud-services (s3,ec2,gae,...) at almost no cost paired with load-balancers, easy provisioning (chef,puppet,...) and all this funky stuff avilable it has become easy and affordable even for small companies to run and manage large data and/or high traffic applications.
More storage means more space for cache - more computing power means quicker cache warmimg.
things you will often hear if people are talking about php or framework-performance:
facebook runs with php
youp**n was developed with symfony
...
So why do these sites not break down completely? Because their caching routines are clever.
facebook
Did you know for instance what facebook does if you write a status update?
It does not save it right into a database table with all your status updates and if a friend visits his stream all the statuses from all his friend are being fetched from database prior to being served.
facebook writes your status to all of your friends news streams and starts warming their cache. Now all the streams are being prepared for serving and whenever one of your friends visits his stream he will be served a cached version; instantly with almost no code execution involved. The stream will only show your newly created status when the cache warming has finished. we are talking about ms here ...
What does this tell us? In modern highly-frequented applications almost everything is being served from cache and the user will not notice if the actual computing of the page took 1ms or 5 seconds.
In a "real-world" scenario the end-user will notice no difference in req/sec between frameworks. Even with simple stuff like micro-caching you can have your vps hosted blog not go down instantly once you made it on hackernews's landing page.
In the end the more important thing is ... does my framework provide the tools, the documentation and the tutorials and examples ... to get this whole thing up and running quick & easy. symfony does for me!
If you're stuck ... how many people are there willing and able to answer your performance-related questions?
How many real-world applications have already been or will in the near future be created with this framework?
you choose a community by choosing a framework !
... okay thats for the does it matter part ... now back to these benchmarks :)
Benchmarks & Setups
Over all these shiny colors and fancy graphs in the benchmark you easily miss the fact that there is only one setup ( webserver, database, ... ) tested with each of these frameworks while you can have a wide variety of configurations for each of them.
Example: instead of using symfony2+doctrineORM+mysql you could also use symfony+doctrineODM+MongoDB.
MySQL ... MongoDB ... Relational Databases ... NoSQL Databases ... ORM ... micro ORMs ... raw SQL ... all mixed up in these configurations ------> apples and oranges.
Benchmarks & Optimization
A common problem with almost all benchmarks - even those only comparing php frameworks - found around the web and also those "TechEmpower Web Framework Benchmarks" is unequal optimization.
These benchmarks don't make use of possible (and by experienced developers well known) optimizations on those frameworks ... at least for symfony2 and their tests this is a fact.
A few examples regarding the symfony2 setup used in their latest tests:
"composer install" is not being called with the -o flag to dump an optimized classmap autoloader (code)
Symfony2 will not use APC cache for Doctrine metadata annotations without apc_cli = 1 ( issue )
the whole DI container is injected in the controller instead of only the few necessary services
hereby setter injection is used -> creates object then call setContainer() method instead of injecting the container directly into the constructor (see: BenchController extends Controller extends ContainerAware)
an alias ( $this->get('service_name') ) is used to retrieve services from the container instead of accessing it directly ($this->container->get('service_name') ). ( code )
...
the list continues ... but i guess you understood where this is leading. 90 open issues by now ... an endless story.
Development & Ressources
Ressources like servers and storage are cheap. Really cheap ... compared to development time.
I am a freelancer charging considerably common rates. you can either get 2-3 days of my time ... or a sh**load of computing power and storage!
When choosing a framework, you are also choosing a toolkit for rapid development - a weapon for your fight against the never completely satisfied, feature-creeping customer ... who will pay you well for his wishes.
As an agency (or a freelancer) you want to build feature-rich applications in short time. You will face points where you are stuck with something ... maybe a performance related issue. But you are facing development costs and time as well.
What will be more expensive? An additional server or an additional developer?
This blog answers the second part of your question:
http://symfony.com/blog/is-symfony-too-slow-for-real-world-usage
Dismissing symfony because the speed of a "hello, world" test is not
as good as with FooBar framework is a mistake. Raw speed is not the
key factor for professionals. Cost is the key factor. And the cost of
development, hosting and maintenance of an application with symfony is
less than what it is for other solutions.
When choosing a framework, one should consider the total costs of development. That means looking at the code quality of the framework (unit tests, documentation, etc.), the performance (and hosting costs), the quantity and quality of features it has out of the box, the size of the community, usage by organizations like yours, scalability, etc.
As a Symfony developer, I passionately hate WordPress from a technical point of view. But I'll still recommend (and even use!) it for a simple website. Not just because it's popularity, but because the size of it's community: it's very easy to hire a WordPress designer/developer. Looking at a performance comparison between WordPress and Symfony wouldn't make any sense in this case.

How to optimize this website. Need general advices

this is my first question here, which is regarding a specific website optimization.
A few moths ago, we launched [site] for one of our clients which is some kind of community website.
Everything works great, but now this website is getting bigger and it shows some slowness when the pages are loading.
The server specs:
PHP 5.2.1 (i think we need to upgrade on 5.3 to make use of the new garbage collector)
Apache 2.2
Quad Core Xeon Processor # 2,8 Ghz and 4 GB DDR 3 RAM.
XCACHE 1.3 (we added this a few months ago)
Mysql 5.1 (we are using innodb as engine)
Codeigniter framework
Here is what we did so far and what we intend to do further :
Beside xcache, we don't really use a caching mechanism because most of the content comes live and beside this, we didn't wanted to optimize prematurely because we didn't know what to expect as far as the traffic flow.
On the other hand, we have installed memcached and we want to implement a cache system based on memcached.
Regarding the database structure, we have reached 3NF with most of our tables, and yes we have some slow queries(which we plan to optimize) but i think because the tables that produce slow queries are the one for blog comments(~44,408 rows) / user logs tracking (~725,837 rows) / user comments (~698,964 rows) etc which are quite big tables. The entire database is 697.4 MB in size for now.
Also, here are some stats for January 2011:
Monthly unique visitors: - 127.124
Monthly unique views: 4.829.252
Monthly unique visits: 242.708
Daily average:
Unique new visitors: 7.533
Unique new views : 179.680
Just let me know if you need more details.
Any advice is highly appreciated.
Thank you.
When it come to performance issue, there is no golden rule or labelled sticky note that first tell that is related to database. Maybe what i could suggest is to do performance profiling and there are many free and paid tools over the Internet that allows you to do so.
First start of with web server layer, make sure everything is done correctly and optimized as what is be possible.
Then move on to next layer (which i assume is your database). Normally from layman perspective whenever someone mentioned InnoDB MySQL, we assume there are indexes being created to optimize and search operations. The usage of indexes also quite important because you don't want to indexing something wrong and make things worse. My advise to this is to get a DBA equivalent personnel to troubleshoot using a staging environment.
Another tricks you could possibility look at is the contents, from web page contents to database data, make sure you show/keep data where is needed only, do no store unnecessary information into database and using smart layout on the webpage. A cut down of a seconds or two might do a big difference in terms of usability and response time.
It is very hard to explain the detail here unless we have in-depth information about your application, its architecture and your environment, but above are some commonly used direction people use to troubleshoot such incident.
Good luck!
This site has excellent resources http://www.websiteoptimization.com/
The books that are mentioned are excellent. There are just too many techniques to list here and we do not know what you have tried so far.
Sorry for the delay guys, i have been very busy to find the issue and i did it.
Well, the problem was because of apache mostly, i had an access log of almost 300 GB which at midnight was parsed to generate webalizer stats. Mostly when this was happening the website was very very slow. I disabled webalizer for the domain, cleared the logs, and what to see, it is very fast again, doesn't matter the hour you access it.
I now only have just a few slow queries that i tend to fix today.
I also updated to CI 2.0 Reactor as suggested and started to use the memcached driver.
Who would knew that apache logs can be so problematic...
Based on the stats, I don't think you are hitting load problems... on a hunch, I would look to the database first. Database partitioning might be a good place to start.
But you should really do some profiling of your application first. How much time is spent in the application versus database. Are there application methods that are using lots of time and just need some tweaking? Are database queries not written efficiently? Do you need more or better database indices?
Everything looks pretty good-- if upgrading codeigniter is an option, the new codeigniter 2.0 (reactor) adds support for memcache (New Cache driver with file system, APC and memcache support). Granted you're already using xcache, these new additions may be worth looking at.
When cache objects weren't enough for our multi-domain platform that saw huge traffic, we went the route of throwing more hardware at it-- ram, servers/database. Then we moved to database clustering to handle single account forecasted heavy load. And now switching from apache to nginx... It's a never ending battle, but what worked for us was being smart about what we cached and increasing server memory then distributing this load across servers...
Cache as many database calls as you can. In my CI application I have a settings table that rarely changes, so I cache all calls made to it as I am constantly querying the settings table.
Cache your views and even your controllers as well. I tend to cache basically as much as I can in my CI applications and then refresh the cache when a file changes.
Only autoload important libraries, models and helpers. I've seen people autoload up to 10 libraries and on-top of that a few helpers and then a model. You only really need to autoload the database and session libraries if you are using them.
Regarding point number 3, are you autoloading many things in your config/autoload.php file by any chance? It might help speed things up only loading things you need in your controllers as you need them with exception of course the session and database libraries.

What is optimal hardware configuration for heavy load LAMP application

I need to run Linux-Apache-PHP-MySQL application (Moodle e-learning platform) for a large number of concurrent users - I am aiming 5000 users. By concurrent I mean that 5000 people should be able to work with the application at the same time. "Work" means not only do database reads but writes as well.
The application is not very typical, since it is doing a lot of inserts/updates on the database, so caching techniques are not helping to much. We are using InnoDB storage engine. In addition application is not written with performance in mind. For instance one Apache thread usually occupies about 30-50 MB of RAM.
I would be greatful for information what hardware is needed to build scalable configuration that is able to handle this kind of load.
We are using right now two HP DLG 380 with two 4 core processors which are able to handle much lower load (typically 300-500 concurrent users). Is it reasonable to invest in this kind of boxes and build cluster using them or is it better to go with some more high-end hardware?
I am particularly curious
how many and how powerful servers are
needed (number of processors/cores, size of RAM)
what network equipment should
be used (what kind of switches,
network cards)
any other hardware,
like particular disc storage
solutions, etc, that are needed
Another thing is how to put together everything, that is what is the most optimal architecture. Clustering with MySQL is rather hard (people are complaining about MySQL Cluster, even here on Stackoverflow).
Once you get past the point where a couple of physical machines aren't giving you the peak load you need, you probably want to start virtualising.
EC2 is probably the most flexible solution at the moment for the LAMP stack. You can set up their VMs as if they were physical machines, cluster them, spin them up as you need more compute-time, switch them off during off-peak times, create machine images so it's easy to system test...
There are various solutions available for load-balancing and automated spin-up.
If you can make your app fit, you can get use out of their non-relational database engine as well. At very high loads, relational databases (and MySQL in particular) don't scale effectively. The peak load of SimpleDB, BigTable and similar non-relational databases can scale almost linearly as you add hardware.
Moving away from a relational database is a huge step though, I can't say I've ever needed to do it myself.
I'm not so sure about hardware, but from a software point-of-view:
With an efficient data layer that will cache objects and collections returned from the database then I'd say a standard master-slave configuration would work fine. Route all writes to a beefy master and all reads to slaves, adding more slaves as required.
Cache data as objects returned from your data-mapper/ORM and not HTML, and use Memcached as your caching layer. If you update an object then write to the db and update in memcached, best use IdentityMap pattern for this. You'll probably need quite a few Memcached instances although you could get away with running these on your web servers.
We could never get MySQL clustering to work properly.
Be careful with the SQL queries you write and you should be fine.
Piotr, have you tried asking this question on moodle.org yet? There are a couple of similar scoped installations whose staff members answer that currently.
Also, depending on what your timeframe for deployment is, you might want to check out the moodle 2.0 line rather than the moodle 1.9 line, it looks like there are a bunch of good fixes for some of the issues with moodle's architecture in that version.
also: memcached rocks for this. php acceleration rocks for this. serverfault is probably the better *exchange site for this question though

Tactics for using PHP in a high-load site

Before you answer this I have never developed anything popular enough to attain high server loads. Treat me as (sigh) an alien that has just landed on the planet, albeit one that knows PHP and a few optimisation techniques.
I'm developing a tool in PHP that could attain quite a lot of users, if it works out right. However while I'm fully capable of developing the program I'm pretty much clueless when it comes to making something that can deal with huge traffic. So here's a few questions on it (feel free to turn this question into a resource thread as well).
Databases
At the moment I plan to use the MySQLi features in PHP5. However how should I setup the databases in relation to users and content? Do I actually need multiple databases? At the moment everything's jumbled into one database - although I've been considering spreading user data to one, actual content to another and finally core site content (template masters etc.) to another. My reasoning behind this is that sending queries to different databases will ease up the load on them as one database = 3 load sources. Also would this still be effective if they were all on the same server?
Caching
I have a template system that is used to build the pages and swap out variables. Master templates are stored in the database and each time a template is called it's cached copy (a html document) is called. At the moment I have two types of variable in these templates - a static var and a dynamic var. Static vars are usually things like page names, the name of the site - things that don't change often; dynamic vars are things that change on each page load.
My question on this:
Say I have comments on different articles. Which is a better solution: store the simple comment template and render comments (from a DB call) each time the page is loaded or store a cached copy of the comments page as a html page - each time a comment is added/edited/deleted the page is recached.
Finally
Does anyone have any tips/pointers for running a high load site on PHP. I'm pretty sure it's a workable language to use - Facebook and Yahoo! give it great precedence - but are there any experiences I should watch out for?
No two sites are alike. You really need to get a tool like jmeter and benchmark to see where your problem points will be. You can spend a lot of time guessing and improving, but you won't see real results until you measure and compare your changes.
For example, for many years, the MySQL query cache was the solution to all of our performance problems. If your site was slow, MySQL experts suggested turning the query cache on. It turns out that if you have a high write load, the cache is actually crippling. If you turned it on without testing, you'd never know.
And don't forget that you are never done scaling. A site that handles 10req/s will need changes to support 1000req/s. And if you're lucking enough to need to support 10,000req/s, your architecture will probably look completely different as well.
Databases
Don't use MySQLi -- PDO is the 'modern' OO database access layer. The most important feature to use is placeholders in your queries. It's smart enough to use server side prepares and other optimizations for you as well.
You probably don't want to break your database up at this point. If you do find that one database isn't cutting, there are several techniques to scale up, depending on your app. Replicating to additional servers typically works well if you have more reads than writes. Sharding is a technique to split your data over many machines.
Caching
You probably don't want to cache in your database. The database is typically your bottleneck, so adding more IO's to it is typically a bad thing. There are several PHP caches out there that accomplish similar things like APC and Zend.
Measure your system with caching on and off. I bet your cache is heavier than serving the pages straight.
If it takes a long time to build your comments and article data from the db, integrate memcache into your system. You can cache the query results and store them in a memcached instance. It's important to remember that retrieving the data from memcache must be faster than assembling it from the database to see any benefit.
If your articles aren't dynamic, or you have simple dynamic changes after it's generated, consider writing out html or php to the disk. You could have an index.php page that looks on disk for the article, if it's there, it streams it to the client. If it isn't, it generates the article, writes it to the disk and sends it to the client. Deleting files from the disk would cause pages to be re-written. If a comment is added to an article, delete the cached copy -- it would be regenerated.
I'm a lead developer on a site with over 15M users. We have had very little scaling problems because we planned for it EARLY and scaled thoughtfully. Here are some of the strategies I can suggest from my experience.
SCHEMA
First off, denormalize your schemas. This means that rather than to have multiple relational tables, you should instead opt to have one big table. In general, joins are a waste of precious DB resources because doing multiple prepares and collation burns disk I/O's. Avoid them when you can.
The trade-off here is that you will be storing/pulling redundant data, but this is acceptable because data and intra-cage bandwidth is very cheap (bigger disks) whereas multiple prepare I/O's are orders of magnitude more expensive (more servers).
INDEXING
Make sure that your queries utilize at least one index. Beware though, that indexes will cost you if you write or update frequently. There are some experimental tricks to avoid this.
You can try adding additional columns that aren't indexed which run parallel to your columns that are indexed. Then you can have an offline process that writes the non-indexed columns over the indexed columns in batches. This way, you can control better when mySQL will need to recompute the index.
Avoid computed queries like a plague. If you must compute a query, try to do this once at write time.
CACHING
I highly recommend Memcached. It has been proven by the biggest players on the PHP stack (Facebook) and is very flexible. There are two methods to doing this, one is caching in your DB layer, the other is caching in your business logic layer.
The DB layer option would require caching the result of queries retrieved from the DB. You can hash your SQL query using md5() and use that as a lookup key before going to database. The upside to this is that it is pretty easy to implement. The downside (depending on implementation) is that you lose flexibility because you're treating all caching the same with regard to cache expiration.
In the shop I work in, we use business layer caching, which means each concrete class in our system controls its own caching schema and cache timeouts. This has worked pretty well for us, but be aware that items retrieved from DB may not be the same as items from cache, so you will have to update cache and DB together.
DATA SHARDING
Replication only gets you so far. Sooner than you expect, your writes will become a bottleneck. To compensate, make sure to support data sharding early as possible. You will likely want to shoot yourself later if you don't.
It is pretty simple to implement. Basically, you want to separate the key authority from the data storage. Use a global DB to store a mapping between primary keys and cluster ids. You query this mapping to get a cluster, and then query the cluster to get the data. You can cache the hell out of this lookup operation which will make it a negligible operation.
The downside to this is that it may be difficult to piece together data from multiple shards. But, you can engineer your way around that as well.
OFFLINE PROCESSING
Don't make the user wait for your backend if they don't have to. Build a job queue and move any processing that you can offline, doing it separate from the user's request.
I've worked on a few sites that get millions/hits/month backed by PHP & MySQL. Here are some basics:
Cache, cache, cache. Caching is one of the simplest and most effective ways to reduce load on your webserver and database. Cache page content, queries, expensive computation, anything that is I/O bound. Memcache is dead simple and effective.
Use multiple servers once you are maxed out. You can have multiple web servers and multiple database servers (with replication).
Reduce overall # of request to your webservers. This entails caching JS, CSS and images using expires headers. You can also move your static content to a CDN, which will speed up your user's experience.
Measure & benchmark. Run Nagios on your production machines and load test on your dev/qa server. You need to know when your server will catch on fire so you can prevent it.
I'd recommend reading Building Scalable Websites, it was written by one of the Flickr engineers and is a great reference.
Check out my blog post about scalability too, it has a lot of links to presentations about scaling with multiple languages and platforms:
http://www.ryandoherty.net/2008/07/13/unicorns-and-scalability/
Re: PDO / MySQLi / MySQLND
#gary
You cannot just say "don't use MySQLi" as they have different goals. PDO is almost like an abstraction layer (although it is not actually) and is designed to make it easy to use multiple database products whereas MySQLi is specific to MySQL conections. It is wrong to say that PDO is the modern access layer in the context of comparing it to MySQLi because your statement implies that the progression has been mysql -> mysqli -> PDO which is not the case.
The choice between MySQLi and PDO is simple - if you need to support multiple database products then you use PDO. If you're just using MySQL then you can choose between PDO and MySQLi.
So why would you choose MySQLi over PDO? See below...
#ross
You are correct about MySQLnd which is the newest MySQL core language level library, however it is not a replacement for MySQLi. MySQLi (as with PDO) remains the way you would interact with MySQL through your PHP code. Both of these use libmysql as the C client behind the PHP code. The problem is that libmysql is outside of the core PHP engine and that is where mysqlnd comes in i.e. it is a Native Driver which makes use of the core PHP internals to maximise efficiency, specifically where memory usage is concerned.
MySQLnd is being developed by MySQL themselves and has recently landed onto the PHP 5.3 branch which is in RC testing, ready for a release later this year. You will then be able to use MySQLnd with MySQLi...but not with PDO. This will give MySQLi a performance boost in many areas (not all) and will make it the best choice for MySQL interaction if you do not need the abstraction like capabilities of PDO.
That said, MySQLnd is now available in PHP 5.3 for PDO and so you can get the advantages of the performance enhancements from ND into PDO, however, PDO is still a generic database layer and so will be unlikely to be able to benefit as much from the enhancements in ND as MySQLi can.
Some useful benchmarks can be found here although they are from 2006. You also need to be aware of things like this option.
There are a lot of considerations that need to be taken into account when deciding between MySQLi and PDO. It reality it is not going to matter until you get to rediculously high request numbers and in that case, it makes more sense to be using an extension that has been specifically designed for MySQL rather than one which abstracts things and happens to provide a MySQL driver.
It is not a simple matter of which is best because each has advantages and disadvantages. You need to read the links I've provided and come up with your own decision, then test it and find out. I have used PDO in past projects and it is a good extension but my choice for pure performance would be MySQLi with the new MySQLND option compiled (when PHP 5.3 is released).
General
Do not try to optimize before you start to see real world load. You might guess right, but if you don't, you've wasted your time.
Use jmeter, xdebug or another tool to benchmark the site.
If load starts to be an issue, either object or data caching will likely be involved, so generally read up on caching options (memcached, MySQL caching options)
Code
Profile your code so that you know where the bottleneck is, and whether it's in code or the database
Databases
Use MYSQLi if portability to other databases is not vital, PDO otherwise
If benchmarks reveal the database is the issue, check the queries before you start caching. Use EXPLAIN to see where your queries are slowing down.
After the queries are optimized and the database is cached in some way, you may want to use multiple databases. Either replicating to multiple servers or sharding (splitting the data over multiple databases/servers) may be appropriate, depending on the data, the queries, and the kind of read/write behavior.
Caching
Plenty of writing has been done on caching code, objects, and data. Look up articles on APC, Zend Optimizer, memcached, QuickCache, JPCache. Do some of this before you really need to, and you'll be less concerned about starting off unoptimized.
APC and Zend Optimizer are opcode caches, they speed up PHP code by avoiding reparsing and recompilation of code. Generally simple to install, worth doing early.
Memcached is a generic cache, that you can use to cache queries, PHP functions or objects, or entire pages. Code must be specifically written to use it, which can be an involved process if there are no central points to handle creation, update and deletion of cached objects.
QuickCache and JPCache are file caches, otherwise similar to Memcached. The basic concept is simple, but also requires code and is easier with central points of creation, update and deletion.
Miscellaneous
Consider alternative web servers for high load. Servers like lighthttp and nginx can handle large amounts of traffic in much less memory than Apache, if you can sacrifice Apache's power and flexibility (or if you just don't need those things, which often, you don't).
Remember that hardware is surprisingly cheap these days, so be sure to cost out the effort to optimize a large block of code versus "let's buy a monster server."
Consider adding the "MySQL" and "scaling" tags to this question
APC is an absolute must. Not only does it make for a great caching system, but the gain from the auto-cached PHP files is a godsend. As for the multiple database idea, I don't think you would get much out of having different databases on the same server. It may give you a bit of a gain in speed during query time, but I doubt the effort it would take to deploy and maintain the code for all three while making sure they are in sync would be worth it.
I also highly recommend running Xdebug to find bottlenecks in your program. It made optimization a breeze for me.
Firstly, as I think Knuth said, "Premature optimization is the root of all evil". If you don't have to deal with these issues right now then don't, focus on delivering something that works correctly first. That being said, if the optimizations can't wait.
Try profiling your database queries, figure out what's slow and what happens alot and come up with an optimization strategy from that.
I would investigate Memcached as it's what a lot of the higher load sites use for efficiently caching content of all types, and the PHP object interface to it is quite nice.
Splitting up databases among servers and using some sort of load balancing technique (e.g. generate a random number between 1 and # redundant databases with necessary data - and use that number to determine which database server to connect to) can also be an excellent way to increase efficiency.
These have all worked out pretty well in the past for some fairly high load sites. Hope this helps to get you started :-)
Profiling your app with something like Xdebug (like tj9991 recommended) is definitely going to be a must. It doesn't make a whole lot of sense to just go around optimizing things blindly. Xdebug will help you find the real bottlenecks in your code so you can spend your optimization time wisely and fix chunks of code that are actually causing slow downs.
If you're using Apache, another utility that can help in testing is Siege. It will help you anticipate how your server and application will react to high loads by really putting it through its paces.
Any kind of opcode cache for PHP (like APC or one of the many others) will help a lot as well.
I run a website with 7-8 million page views a month. Not terribly much, but enough that our server felt the load. The solution we chose was simple: Memcache at the database level. This solution works well if the database load is your main problem.
We started out using Memcache to cache entire objects and the database results that were most frequently used. It did work, but it also introduced bugs (we might have avoided some of those if we had been more careful).
So we changed our approach. We built a database wrapper (with the exact same methods as our old database, so it was easy to switch), and then we subclassed it to provide memcached database access methods.
Now all you have to do is decide whether a query can use cached (and possibly out of date) results or not. Most of the queries run by the users are now fetched directly from Memcache. The exceptions are updates and inserts, which for the main website only happens because of logging. This rather simple measure reduced our server load by about 80%.
For what it's worth, caching is DIRT SIMPLE in PHP even without an extension/helper package like memcached.
All you need to do is create an output buffer using ob_start().
Create a global cache function. Call ob_start, pass the function as a callback. In the function, look for a cached version of the page. If exists, serve it and end.
If it doesn't exist, the script will continue processing. When it reaches the matching ob_end() it will call the function you specified. At that time, you just get the contents of the output buffer, drop them in a file, save the file, and end.
Add in some expiration/garbage collection.
And many people don't realize you can nest ob_start()/ob_end() calls. So if you're already using an output buffer to, say, parse in advertisements or do syntax highlighting or whatever, you can just nest another ob_start/ob_end call.
Thanks for the advice on PHP's caching extensions - could you explain reasons for using one over another? I've heard great things about memcached through IRC but have never heard of APC - what are your opinions on them? I assume using multiple caching systems is pretty counter-effective.
Actually, many do use APC and memcached together...
It looks like I was wrong. MySQLi is still being developed. But according to the article, PDO_MySQL is now being contributed to by the MySQL team. From the article:
The MySQL Improved Extension - mysqli
- is the flagship. It supports all features of the MySQL Server including
Charsets, Prepared Statements and
Stored Procedures. The driver offers a
hybrid API: you can use a procedural
or object-oriented programming style
based on your preference. mysqli comes
with PHP 5 and up. Note that the End
of life for PHP 4 is 2008-08-08.
The PHP Data Objects (PDO) are a
database access abstraction layer. PDO
allows you to use the same API calls
for various databases. PDO does not
offer any degree of SQL abstraction.
PDO_MYSQL is a MySQL driver for PDO.
PDO_MYSQL comes with PHP 5. As of PHP
5.3 MySQL developers actively contribute to it. The PDO benefit of a
unified API comes at the price that
MySQL specific features, for example
multiple statements, are not fully
supported through the unified API.
Please stop using the first MySQL
driver for PHP ever published:
ext/mysql. Since the introduction of
the MySQL Improved Extension - mysqli
- in 2004 with PHP 5 there is no reason to still use the oldest driver
around. ext/mysql does not support
Charsets, Prepared Statements and
Stored Procedures. It is limited to
the feature set of MySQL 4.0. Note
that the Extended Support for MySQL
4.0 ends at 2008-12-31. Don't limit yourself to the feature set of such
old software! Upgrade to mysqli, see
also Converting_to_MySQLi. mysql is in
maintenance only mode from our point
of view.
To me, it seems the article is biased towards MySQLi. I suppose I'm biased towards PDO.
I really like PDO over MySQLi. It's straight forward to me. The API is a lot closer to other languages I've programmed in. OO Database interfaces seem to work better.
I haven't come across any specific MySQL features that weren't available through PDO. I would be surprised if I ever did.
PDO is also very slow and its API is pretty complicated. No one in their sane mind should use it if portability is not a concern. And let's face it, in 99% of all webapps it is not. You just stick with MySQL or PostrgreSQL, or whatever it is you are working with.
As for the PHP question and what to take into account. I think premature optimization is the root of all evil. ;) Get your application done first, try to keep it clean when it comes to programming, do a little documentation and write unit tests. With all of the above you will have no issues refactoring code when the time comes. But first you want to be done and push it out to see how people react to it.
Sure pdo is nice, but there has been some controversy about it's performance versus mysql and mysqli, although it seems fixed now.
You should use pdo if you envision portability, but if not, mysqli should be the way. It has an OO interface, prepared statements, and most of what pdo offers (except, well, portability).
Plus, if performance is really needed, prepare for the (native mysql) MysqLnd driver in PHP 5.3, who will be much more tightly integrated with php, with better performance and improved memory usage (and statistics for performance tuning).
Memcache is nice if you have clustered servers (and YouTube-like load), but i'd try out APC first too.
A lot of good answers were given already, but I would like to point you to an alternate opcode cache called XCache. It is created by a lighty contributor.
Also, if you may need load balancing your database server in future, MySQL Proxy could very well help you to achieve this.
Both of those tools should plug into an existing application quite easily, so this optimization can be done when you need it, without too much hassle.
First question is how big do you really expect it to be? And how much do you plan on investing in your infrastructure. Since you feel the need to ask the question here, I'm guessing that you expect to start small on a limited budget.
Performance is irrelevant if the site is not available. And for availability you need horizontal scaling. The minimum you can sensibly get away with is 2 servers, both running apache, php and mysql. Set up one DBMS as a slave to the other. Do all the writes on the master, and all the reads on the local database (whatever that is) - unless for some reason you need to read back the data you've just read (use master). Make sure you've got the machinery in place to automatically promote the slave and fence the master. Use round-robin DNS for the webserver addresses to give more affinity for the slave node.
Partitioning your data across different database nodes at this stage is a very bad idea - however you might want to consider splitting it across different databases on the same server (which will facilitate partitioning across nodes when you overtake facebook).
Do make sure you've got the monitoring and data analysis tools in place to measure your sites performance and identify bottlenecks. Most performance problems can be fixed by writing better SQL / fixing the database schema.
Keeping your template cache on the database is a dumb idea - the database should be a central common repository for structured data. Keep your template cache on the local filesystem of your webservers - it will be available faster and won't slow down your database access.
Do use a op-code cache.
Spend plenty of time studying your site and its logs to understand why its going so slow.
Push as much caching as possible onto the client.
Use mod_gzip to compress everything you can.
C.
My first piece of advice is to think about this issue and keep it in mind when designing the site but don't go overboard. It's often difficult to predict the success of a new site and I your time will be better spent getting up finished early and optimising it later.
In general, Simple is fast.
Templates slow you down. Databases slow you down. Complex libraries slow you down. Layering templates over each other retrieving them from databases and parsing it in a complex library --> the time delays multiply with each other.
Once you have the basic site up and running do tests to show you where to spend your efforts. It's difficult to see where to target. Often to speed things up you will have to unravel the complexity of the code, this makes it larger and harder to maintain, so you only want to do it where necessary.
In my experience establishing the database connection was relatively expensive. If you can get away with it, don't connect to the database for general visitors on the most trafficed pages like the front page to the site. Creating multiple database connections is madness with very little benefit.
#Gary
Don't use MySQLi -- PDO is the 'modern' OO database access layer. The most important feature to use is placeholders in your queries. It's smart enough to use server side prepares and other optimizations for you as well.
I'm loking over PDO at the moment and it looks like you're right - however I know that MySQL are developing the MySQLd extension for PHP - I think to succeed either MySQL or MySQLi - what do you think about that?
#Ryan, Eric, tj9991
Thanks for the advice on PHP's caching extensions - could you explain reasons for using one over another? I've heard great things about memcached through IRC but have never heard of APC - what are your opinions on them? I assume using multiple caching systems is pretty counter-effective.
I will definitely be sorting out some profiling testers - thank you very much for your recommendations on those.
I don't see myself switching from MySQL anytime soon - so I guess I don't need the abstraction capabilities of PDO. Thanks for those articles DavidM, they've helped me a lot.
Look into mod_cache, an output cache for the Apache web server, simillar to the output caching in ASP.NET.
Yes, I can see that it's still experimental but it will be final someday.
I can't believe no-one has already mentioned this: Modularisation and Abstraction. If you think your site is going to have to grow to lots of machines, you must design it so it can! That means stupid things like don't assume the database is on localhost. It also means things that are going to be a bother at first, like writing a database abstraction layer (like PDO, but much much lighter because it only does what you need it to do).
And it means things like working with a framework. You will need layers to your code so that you can later gain performance by refactoring the data-abstraction layer, for example, by teaching it that some objects are in a different database -- and the code doesn't have to know or care.
Finally, be careful of memory-intensive operations, for example, unnecessary string copying. If you can keep PHP's memory usage down, then you will get more performance out of your webserver and this is something that will scale when you go to a load-balanced solution.
If you are working with large amounts of data, and caching isn't cutting it, look into Sphinx. We've had great results with using SphinxSearch not only for better text searching, but also as a data retrieval replacement for MySQL when dealing larger tables. If you use SphinxSE (MySQL plugin), it surpassed our performance gains we had from caching several times over, and application-implementation is a sinch.
The points made about cache are spot-on; it is the least complicated and most important part of building an efficient application. I'd like to add that while memcached is great, APC is about five times faster if your application lives on a single server.
The "Cache Performance Comparison" post at the MySQL performance blog has some interesting benchmarks on the subject - http://www.mysqlperformanceblog.com/2006/08/09/cache-performance-comparison/.

Categories