I'm currently creating a website for a social project in switzerland.
And before there is an overflow of user, I want to prepare the application to scale.
I answered by myself many questions but some are left.
I explain what I want to do.
First
at the beginnning, the Application will have only one server (short time) with DNS, PHP, Mysql, Data, and memcache.
Second
Then I will split them in two
DNS, Mysql, memcache
Data, PHP
Third
Here is the problem, I don't know how to do it exactly here to keep the application running well.
I could do :
Front : Load Balancer, memcache, DNS
Web 1 : PHP, DATA
Web 2 : PHP, DATA
Mysql
This would be the scheme, all PHP sessions are kept in the DB.
BUT, how do I sync the data?
do I run a Rsync to keep them up to date.
do I put them on a separate disk (network disk) to be sure ?
but in this case, how can I do in case of user uploads ?
and if the website gets more success and we have to go on greater structures, would'nt it create some latency on updates ?
or would it be a good thing to go directly to amazon's web services ?
some infos
I use codeigniter as Framework.
I use linux as webserver (distribution not chosen now, but should be Debian)
Thanks in advance for your answers.
According to Wikipedia, Switzerland has 4.6 million German speakers, 1.5 million French speakers, and .5 million speakers of Italian, Romansch and other languages. So I suspect you'll find that a single server will fit your needs. Guess what percentage of the population will visit your site every month or every day to get a sense of how big you can get before running into scaling issues.
So, I don't think you need to worry about scaling yet! Bonus: The time you don't spend worrying about this problem, you can use to solve other problems for your users.
There are a few common paths to scaling web services up, in order of what sites like Flickr and Facebook seem to use:
Split servers based on concepts (API, login, media files, ads, static pages, dynamic pages)
Split databases based on concepts that don't need to be JOINed (logins, long term reporting, page data, etc.)
Compile/optimize your PHP and other resources (sprites, compiled css, zend)
Add caching (front end, back end)
Add delegation (round robin, etc.)
But, before scaling, measure. Set of tests, calculate your capacity, and don't optimize before you need to.
I see some questionable things:
You have one SQL server, and you are storing sessions in a database on a site where you expect extremely high volume. How many queries does that take to produce a single page if someone is logged in and what is the expected slow down when you eventually employ MySQL replication?
If using a cluster FS, everything is 'just kept' in sync. You won't end up with build A on webserver 1 while build B on webserver 2 breaks. If you are really expecting that much traffic, in the time it takes to upload a change, then sync all nodes, you just pissed off a thousand people.
I've deployed apps running on clusters using OCFS2 with over 40 nodes without issue, and OCFS2 is not exactly the 'best' cluster FS available. Check out Lustre and consider keeping sessions on disk.
Remember you can mount/share folders.
What data would you be syncing?
You might consider putting data on the database machine or other machine. The db machine is usually a good idea at first since it is likely to have greater IO than a regular web server.
It is probably a good idea to setup a SAN or similar so your data stays in one place. Multiple copies of data is a pain to deal with. Going this route means you can put the db files there too.
Related
This is something I am really curious about and I do not really understand how is that possible.
So lets say I am the owner of Facebook (ahah) and I have million of people visiting my website every day, thousands and thousands of images, videos, logs etc..
How do I store all this data?
Do I have more databases in different servers around the world and then I connect to them from a single location?
Do I use an internal API system that requests info from other servers where the data is stored?
For example I know that Facebook has a lot of data centers around the world and hundreds of servers..
How do they connect to these servers? Are the profiles stored in different locations and when I connect to my profile, I will then be using that specific server? Or is there one main server that has the support of other hundreds of servers around the world?
Is there a way to use PHP in a way that I will connect to different servers and to different mySQL (???) databases to store and retrieve data whenever I want?
Sorry if this looks like a silly question, but since it could happen a day to work on a successful website, I really want to know what I will have to do, and what is the logic behind.
Thank you very much.
I'll try to answer your (big) question but not from Facebook point of view since their architecture is pretty much known.
First thing you have to know is that you would have to distribute the workload of your web application. Question is how, so in order to determine what's going to be slow, you have to divide your app in segments.
First up is the HTTP server, or the one that accepts all the requests. By going to "www.your-facebook.com", you're contacting a service on an IP. Naturally, you would probably have more than one IP but let's say you have a single entry point.
Now what happens? You have an HTTP server software, let's say Apache and it handles incoming connections. Since Apache creates a thread per connected user, it requires certain amount of memory for that operation. Eventually, it will run out of memory and then shit hits the fan, stuff stops working, your site is unavailable.
Therefore, you have to somehow scale this part of your application that connects your PHP code / MySQL db to people who want to interact with it.
Let's assume you successfully scaled your Apache and you have a cluster of computers which can accept new computers in order to scale-out. You solved your first problem.
Next part is the actual layer that does the work. Accepts input from the user and saves it somewhere (MySQL) and that's the biggest problem you'll have - why?
Due to the database.
Databases store their data on mediums such as hard drives. Hard drives, be it an SSD or mechanical one - are limited by their ability to write or retrieve data. If I'm not mistaken, RAM operates at levels of around 6GB/sec transfer rate. Not to mention that the seek time is also much much lower than HDD's one is.
Therefore, if you have an X amount of users asking for a piece of information and you can only deliver it at a certain rate - your app crashes, or it becomes unresponsive and the layer handling database queries becomes slow since the hardware cannot match the speed at which you need the data.
What are the options here? There are many, I won't mention all of them
Split Reads and Writes. Set your database layer in such a way that you have dedicated machines that write the data and completely different ones that read it. You have to use replication and replication has its own quirks - it never works without breaking.
Optimize handling of your data set by sharding your data. Great for read / write performance, screwed up when you need to query multiple shards and merge the data.
Get better hardware, especially storage (such as FusionIO)
Pay for better storage engine (such as TokuDB)
Alleviate load on the database by using caching. The data that your users request probably doesn't change so often that you have to query the db every single time (say you're viewing someone's profile, what's the chance they'll change it every second?). That's why Facebook uses Memcached extensively - a system that stores small pieces of data in RAM, it's easily scalable and what not. Most important, it's damn quick!
Use different solutions next to MySQL. MySQL (and some other databases) aren't good for every type of data storage or retrieval. Someone mentioned NoSQL before. NoSQL solutions are quick, but still immature. They don't do as much as relational databases do. They use methods of delaying disk write (they keep cached copy of data they need to write in RAM) so that they can achieve fast insert rates. That's why it's not unusual to lose data when using NoSQL.
Topic about MySQL vs "insert database or whatever here" is broad, I don't want to go into that but remember - every single one of data stores out there saves data on the hard drive eventually. The difference (physical of course) is how they optimize their flushing to the disk itself.
I also didn't mention various reports you can run by gathering the data (how many men between 19 and 21 have clicked an advert X between 01:15 and 13:37 CET and such) which is what Facebook is actually gathering (scary stuff!).
Third up - the language gluing the data store (MySQL) and output (HTTP server). PHP.
As you can see, most of the work here is already done by Apache and MySQL. Optimization on PHP level is small, even facebook got small results (they claim 50%, but that's UP TO 50%). I tried HipHop extensively, it is not as fast as it claims to be. Naturally, Facebook guys mentioned that already, so it's no wonder. The advantage they get is because they replaced Apache with their own server built in into HipHop. Some people claim "language X is better than language Y" and they're right, but that's not always the case. Each language has its own advantages and disadvantages.
For example, PHP is widely-spread but it's slow for certain operations (implementing a Trie with over 1 billion entries for example). It's great for things like echo some HTML after parsing the output from the db. It's quick to insert and retrieve data from the database, and that's about 90% of the PHP usage - talk to the db, display the data, end.
Therefore, no matter what language you use (say we used C++ instead of PHP), your bottleneck will be the data storage / retrieval layer.
On the other hand, why is using C++ NOT handy? Because there are more people who know how to use PHP than ones who use C++. It's also MUCH slower to develop web apps in C++. Sure, they will execute faster, but who will notice the difference between 1 millisecond and 1 microsecond?
This post is more like an informative blog post, I know it's not filled with resources to back up my claims but anyone who did any work with larger data sets or websites will know that the P.I.T.A. is always the data storage component. Some things that I said probably won't fit with everyone, but in a NUTSHELL this is how you'd go about optimizing your site.
Unfortunately, your question doesn't have a simple answer. For the MySQL portion of it, you would need to investigate database scale-out. You can start looking at it here: http://www.mysql.com/why-mysql/scaleout/mixi.html. There are a number of different ways to set up Apache/PHP web sites across a server farm. One of them involves setting up round robin DNS. This is adding a DNS record with a number of different IP addresses. Your DNS then hands out a different IP address each time the record is requested so that the load is balanced across a number of servers. You can also set up clustering with MySQL, Apache and Heartbeat, but that is more of a high-availability solution than a scaling solution.
When you have a website with so many users you'll already have enough experience to know the answer of the question, you'll also have a lot of money to pay people to find the optimal architecture of your system.
I'm not saying that what I describe below is the Holy Grail, but it is certainly an option:
You will have a big, fragmented database with lots of backups and you'll have a few name servers which will know the location of servers and some rules about the data stored on each server. When data is searched the query will be sent to a name server which will find the server(s) where the answer can be found for the particular query. I've also upvoted N.B.'s answer, I think he is mostly right.
For lots of users, you should have a server with lots of memory and speed. Configure php.ini to allow more memory usage. A server with lots of users should have 4-12GB available. Also, save resources by closing the desktop environment. If you have this many users, you might want to consider a CDN and also make a database request queue.
There are 1 on 1 live chat. Two solutions:
1) I store every message into database and with jQuery's help I check if there is a new message in database every second. Of course I use cache either. If there is, we give that message.
2) I store every message in one html file and every second through jQuery that file is shown over and over again.
What is better? Or there is third option? And in general, what is better, mysql or file for this kinda project?
Thank you very much.
P.S. The most important question is: what is more efficient and what way will eat less resources!
Edit: And is it, nowadays, very bad for many chats (let's say 2,500 chats, that means 5,000 users) to use long polling and check when file was edited every second through javascript? I use very similiar methods like this chat: http://css-tricks.com/jquery-php-chat/ Will it kill my hosting?
Everyone has given a wide range of opinions but I don't think anyone has really hit the nail on the head.
When it comes down to storing data, the amount of data, the rate it is to be accessed, and several other factors all determine what's the best storage platform.
Some people have suggested using memcached. Now although this is a valid answer (you can use it), I don't think that this is a good idea, solely based on the fact that memcached stores data within your server's memory.
Your memory is not for data storage, it's for use of the actual applications, operating system, shared libraries, etc.
Storing data within the memory can cause a lot of issues with other applications currently running. If you store too much data in your RAM your applications would not be able to complete operations assigned to them.
Although this is faster then a disk based storage platform such as MySQL, it's not as reliable.
I would personally use MySQL as your storage engine server-side. This would reduce the amount of problems you would come across and also makes the data very manageable.
To speed up the responses to your clients I would look at running node on your server.
This is because it's event driven and non-blocking.
What does that mean?
Well, when Client A requests some data that is stored on the hard drive, traditionally PHP might say to the C++, fetch me this chunk of data stored on this sector of the hard drive. C++ would say 'ok no problem', and while it goes of to get the information PHP would sit and wait for the data to be read and returned before it continues it's operations, blocking all other client's in the meantime.
With node, it's slightly different. Node will say to the kernel, 'fetch me this chunk of information and when your done, give me call', and then it continues to take requests from other clients that may not need disk access.
So suddenly because we have assigned a callback to the kernel, we do not have to wait :), happy days.
Take a look at this image:
This really could be the answer your looking for, please see the following for a more descriptive and detailed information regarding how node could be the right choice for you:
http://blog.mixu.net/2011/02/01/understanding-the-node-js-event-loop/
A fourth option, probably not what you want if you already have PHP code you want to use, but maybe the most efficient is to use a Javascript based server instead of php.
Node.js is easily capable of being a chat server and can store all the recent messages as a Javascript variable.
You can use long polling or other comet techniques so that you so not have to wait a second for messages to update.
Also, the event based architecture of a Javascript server means that there is no overhead for idling around waiting for messages.
It depends on number of chats in the same time. If it's for support and you expect average load to be 1 to 5 chat sessions at a time then you don't to worry too much. Just make sure that when there is no activity for some time stop refreshing and show a message for user to click to resume chat session.
If the visitors will chat with each other and you expect big number of sessions - 10-50 at the same time you can still use PHP + database. Just make sure you don't make redundant queries and your queries are cached correctly. To reduce load you can also deny chat script from being logged in web server:
SetEnvIf Request_URI "^/chat.php$" dontlog
CustomLog /var/log/apache2/access.log combined env=!dontlog
Edit:
you can have delay schema. For example if you query 2 times with delay 1 second and you get no data you can increase delay to 2 seconds. if you reach 10 queries with no response - increase delay to 5 seconds. After 10 minute you can pause the conversation, requiring users to click on a button to resume the chat. That'll, combined with advices above will guarantee low enough load to have many concurrent chats
Edit2:
I suggest you to find some flash or java solution and buy it. With 5000-10000 users you have to be genius to make it work on VPS, especially if RAM is not much. Not that it's not possible but you can rent cheaper VPS and with the rest of the money buy some solution in java or flash (don't know if flush supports 2 way connection, I'm not a flash expert).
Note about number of users: if you have 10 000 users my guess is that you'll have not more than 100 chats at the same time. Go and look dating sites - they have not more than 10% of the users online and maybe most of them are doing something else and not chatting
3rd option. use MEMCACHE. infinitely faster read/writes. perfect for your application.
Store the chat messages in the database but use Memcached as a caching layer for the database reads. So the most popular reads (e.g. the last 20 messages in the chat room) will always be served straight out of memory.
This gives you the benefit for speed for the most frequent operations and persistant storage for all of the messages.
Just to throw in another option... flat files could provide a less resource-hungry alternative.
Every chat is assigned a unique ID and a flat file stored for it. Every chat adds a line to this file. Each client machine then uses jquery to check ONLY the modified date of the file, to see if the chat has been updated.
While I would never normally recommend flat files over a database, I have a sneaky feeling that checking the modified date on a flat file would scale up better than the MySQL alternative.
I was intrigued so I did some tests and here are the results:
With an existing db connection, the number of "SELECT field FROM table LIMIT 0,1" that could be run in 1 second: ~ 4,000
Opening and closing a db connection, but running the same query: ~ 1,800
Checking the modified date on various different files: ~225,000
So to check if a conversation has been updated, storing the conversations in flat files and checking for the last modified date would easily be faster than doing anything with a database.
In general, http connections are not very useful when it comes to pushing data to the client. Doing polls at every x seconds tend to be a resource hog on any server, given you have significant traffic.
You should try XMPP combined with BOSH. Luckily, most of the heavy work is already done for you. You can implement a pure jquery (or other js framework) based solution very quickly. Read this tutorial, it will help you a lot - not only solving your specific problem but, giving you a broader view on how to implement push technologies over the good ole' http.
Unless, its a small-audience script - Between Database vs File-System, its better to use Database(.)
P.S:- Flash also makes a great platform for chat servers, you might wanna look into that aswell.
If you define a conversation as only two people, then a request every second is going to look like one read request per second per user, and one write request every time somebody writes something (say every 10 seconds). So every 10 seconds you will have about 2.2 requests per second, per conversation.
For 50 conversations, that's 100 users and 220 requests per second. That's a lot of load on a server for such a small number of conversations. Writing the conversation to JSON or XML, would probably provide a more scalable solution.
This article discusses the architecture of Meebo - long-polling, comet.
As an afterthought, have you considered installing an IM server like Jabber rather than starting from scratch?
you could always get the right tool for the job ... an XMPP compliant bit of software. for as poor as the documentation is, ejabber is pretty alright. because it follows closely the XMPP standard: http://code.google.com/p/ijab/ you can use any XMPP client. You can store all of it in an RDBMS if you like and provide similar functionalities that are offered in gmail / google talk.
$0.02
A really fast alternative could be a NoSQL database like MongoDB:
MongoDB homepage
Some benchmarks
MongoDB's extension homepage on php.net
I don't use it but you maybe can try Photon , a very high speed framework based on Mongrel.
On the author blog (in french) you have a example , 30 lines of code for a real time chat server, with video demonstration.
I think storing the data on the database is better. Please refer the following link
Script Tutorials Chat
Is there any difference between CMS and hight traffic websites (like news portals) in logic and database design and optimization (PHP and MySQL)?
I have searched for php site scalability in stackoverflow and memcached is in a majority.
Is there techniques for MySQL optimization? (Im looking for a book for this issue. I have searched in amazon but I dont know what is the best choise.)
Thanks in advance
this isnt so easy to answer.
there are different approaches and a variety of opinions but ill try to cover some common scenarios. but first some basics.
most web applications can be sperated in application and database.
database usage can be seperated into transactional (oltp) and analytical (olap)
in the best case you can just start a number of application servers and distribute traffic among them. they all have a connection to the same database server and can work independently.
this can be however difficult if you have other shared data, sessions etc.
you can accomplish this by simply adding multiple ip adresses to your domain namen in dns.
or you use load balancing techniques to forward the clients do different servers.
application scaling is generally very easy. database is much more complex.
the first thing to do is usually set up one or more replication servers which have the same data as the main database. they can be cascaded but have 1 serous disadvantage. their data is not always up to date. in general not more than some seconds old but it can be more under load. but for many use cases this is fine.
big sites that just display information could just replicate their database to some slave servers, set up some application servers (its a good practice to run one slave and one application server on the same server and let this application server access this database slave) and every is fine.
every olap query can be directed to a slave. olap querys are those that dont modify anything and dont need 100% up 2 date data.
so everything needs to be written to the very same database source server from which every other server gets its copy. for example every comment for an article.
if this bottleneck gets too tight you can go in two dirctions.
sharding
master-master replication
sharding means you decide on the application server where to store and where to fetch your data.
for example every comment that starts with a gets to server a, b-> b and so on.
thats a stupid example but its basically how it is. mostly some internal ids are involved.
if possible its good to shard data so that it can be completely pulled from that server agani.
in the example above, if i wanted to have all comments for an article i would have to ask eveyr server a-z and merge the results. this is inefficitient but possible, because those servers can be replicated. this is called mapping (you could check the famous google map-reduce algorithm whcih basically does just this).
master-master repliation means that you write your data to different master servers and they synchronize each other, and isnt stored seperately like if you do sharding.
this has to be done if your application is not able to decide on its own where to store and fetch data.
you just store to any master server, every server gets everything and everybody is happy?
no... because this involves another serious problem.
conflicts! imagine two users enter a comment. commentA gets stored on serverA, commentB gets stored on serverB. which id should we use. which one comes first?
the best is to design an application that avoids this cases and has different keys and stuff.
but what usually happens is conflict resolving, prioritizing and stuff. oracle has alot of features on this level and mysql is still behind. but trends are going into much more complex data structes like clouds anaway...
well i dont think i explained well but you should at least get some keywords from the text that oyu can investigate further.
Sure, there are all sorts of things you can do to optimize your PHP/MySQL web applications for high traffic websites. However, most of them depend on your specific situation, which you haven't given in your question.
Your database should be well structured regardless of whether you have a high-traffic site or not. If you use an off-the-shelf CMS, this is typically fine. Aside from good application architecture, there is no one-size-fits-all solution.
I constantly read on the Internet how it's important to correctly architect my PHP applications so that they can scale.
I have built a simple/small CMS that is written in PHP (think of Wordpress, but waaaay simpler).
I essentially have URLs like such: http://example.com/?page_id=X where X is the id in my MySQL database that has the page content.
How can I configure my application to be load balanced where I'm simply performing PHP read activities.
Would something like Nginx as the front door setup to route traffic to multi-nodes running my same code to handle example.com/?page_id=X be enough to "load balance" my site?
Obviously, MySQL is not being load balanced in this situation, though for simplicity - that makes that out of scope for this question.
These are some well known techniques for scaling such an app.
Reduce DB hits
Most often the bottle neck will be your DB, so cache recent pages so that you reduce DB activity, perhaps in something like memcached.
Design your schema such that it is partition-able.
In the simplest case, separate your data into logical partitions, and store each partition in a separate mysql DB. Craigslist, for example, partitions data by city, and in some cases, by section within that. In your case, you could partition by Id quite simply.
Manage php sessions
Putting ngnx in front of a php website will not work if you use sessions. Load balancing php does have issues as sessions are persisted on local storage. Therefore you need to do session management explicitly. The traditional solution is to use memcached to store and look up some kind of cookie.
Don't optimize prematurely.
Focus on getting your application out so that the next magnitude of current users gets the optimal experience.
Note: Your main potential pain points are discussed here on SO
No, it is not at all important to scale your application if you don't need to.
My view on this is:
Make it work
Make sure it works correctly - testability, robustness
Make it work efficiently enough to be cost effective to run
Then, if you have to so much traffic that your system cannot handle it, AND you've already thrown all the hardware that (sensible) money can buy at it, then you need to scale. Not sooner.
Yes it is relatively easy to scale read-workloads, because you can simply perform reads against readonly database replicas. The challenge is to scale write-workloads.
A lot of sites have few writes, even if they're really busy.
The correct approach is to use some kind of load balancer such as:
http://www.softwareprojects.com/resources/programming/t-how-to-install-and-configure-haproxy-as-an-http-loa-1752.html
What this does is forward a certain user session only to a certain server, hence you dont have to worry about sessions and where they are stored at all. What you do have to worry is how to distribute the filesystem if the 2 servers are running on two different machines, especially if you make heavy use of the filesystem. Hope this article above helps...
I'm developing a web app that will access and work with large amounts of data in a MySQL database, something like a dictionary/thesaurus. I need to test the performance of the DB as its size increases, so I know how slow each request will be in the future.
Any ideas? Like are there specific tools to check DB performance for a particular query, etc?
Do you know what, specifically you're testing? Measuring "performance" is almsot always useless, unless you know exactly what it is you want.
For example, are you looking for low latency on query result retrieval? Perhaps high throughput on date retrieval? Perhaps you care more about fast insertions into the database, and less about fast query results? Perhaps you care about different things on different tables (in fact, that's almost always the case).
My advice will probably be ignored, but I'll say it anyway:
Don't optimise before you know what you want.
Don't optimise as you write the code.
When you do get around to optimising your database, make sure you optimise for the right things. Use realistic data - if you're testing dictionary-sized hunks of text, don't test with binary data (for example).
Anyway, I realise you were probably looking for a more technical answer, but hey...
You can use Maatkit's query profiler to measure impact of data amount on MySQL performances.
And generatedata.com to generate the data you need to test your app.
You can also test your application responsiveness using HTTP testing tools like :
Apache's bundled 'ab' tool (Apache Bench)
JMeter
Selenium
a good tool to use is apache's ab, which comes standard with apache httpd server. this tool can make multiple connections to a web server and benchmark its performance. while firebug is a good way to see in what order things lod, how long each item takes to load, etc., you're only seeing one user's experience. against an unloaded test server, that information can only take you so far. ab simulate multiple users connecting and will give a more realistic picture of how a particular page handles concurrent users.
which leads to me a limitation in ab: it only tests one URL. i get around this often by whipping up a simple test webpage that makes a random selection from a list of pre-defined URL's that i want to test. for example: the login page, a search result, posting a comment, and so on. ab hits the test page, and the test page simply calls one of the test URL's (possibly with a randomized paramter) and returns that page. in this manner, you get a better idea of how your whole site handles concurrent users.
PS: your OS is unanswerable. you'll have to figure that out yourself based on how your application is written, the layout of your data, the configuraiton of the web server and the database server, etc.