Is there any difference between CMS and hight traffic websites (like news portals) in logic and database design and optimization (PHP and MySQL)?
I have searched for php site scalability in stackoverflow and memcached is in a majority.
Is there techniques for MySQL optimization? (Im looking for a book for this issue. I have searched in amazon but I dont know what is the best choise.)
Thanks in advance
this isnt so easy to answer.
there are different approaches and a variety of opinions but ill try to cover some common scenarios. but first some basics.
most web applications can be sperated in application and database.
database usage can be seperated into transactional (oltp) and analytical (olap)
in the best case you can just start a number of application servers and distribute traffic among them. they all have a connection to the same database server and can work independently.
this can be however difficult if you have other shared data, sessions etc.
you can accomplish this by simply adding multiple ip adresses to your domain namen in dns.
or you use load balancing techniques to forward the clients do different servers.
application scaling is generally very easy. database is much more complex.
the first thing to do is usually set up one or more replication servers which have the same data as the main database. they can be cascaded but have 1 serous disadvantage. their data is not always up to date. in general not more than some seconds old but it can be more under load. but for many use cases this is fine.
big sites that just display information could just replicate their database to some slave servers, set up some application servers (its a good practice to run one slave and one application server on the same server and let this application server access this database slave) and every is fine.
every olap query can be directed to a slave. olap querys are those that dont modify anything and dont need 100% up 2 date data.
so everything needs to be written to the very same database source server from which every other server gets its copy. for example every comment for an article.
if this bottleneck gets too tight you can go in two dirctions.
sharding
master-master replication
sharding means you decide on the application server where to store and where to fetch your data.
for example every comment that starts with a gets to server a, b-> b and so on.
thats a stupid example but its basically how it is. mostly some internal ids are involved.
if possible its good to shard data so that it can be completely pulled from that server agani.
in the example above, if i wanted to have all comments for an article i would have to ask eveyr server a-z and merge the results. this is inefficitient but possible, because those servers can be replicated. this is called mapping (you could check the famous google map-reduce algorithm whcih basically does just this).
master-master repliation means that you write your data to different master servers and they synchronize each other, and isnt stored seperately like if you do sharding.
this has to be done if your application is not able to decide on its own where to store and fetch data.
you just store to any master server, every server gets everything and everybody is happy?
no... because this involves another serious problem.
conflicts! imagine two users enter a comment. commentA gets stored on serverA, commentB gets stored on serverB. which id should we use. which one comes first?
the best is to design an application that avoids this cases and has different keys and stuff.
but what usually happens is conflict resolving, prioritizing and stuff. oracle has alot of features on this level and mysql is still behind. but trends are going into much more complex data structes like clouds anaway...
well i dont think i explained well but you should at least get some keywords from the text that oyu can investigate further.
Sure, there are all sorts of things you can do to optimize your PHP/MySQL web applications for high traffic websites. However, most of them depend on your specific situation, which you haven't given in your question.
Your database should be well structured regardless of whether you have a high-traffic site or not. If you use an off-the-shelf CMS, this is typically fine. Aside from good application architecture, there is no one-size-fits-all solution.
Related
I'm going to try to make this as brief as possible while covering all points - I work as a PHP/MySQL developer currently. I have a mobile app idea with a friend and we're going to start developing it.
I'm not saying it's going to be fantastic, but if it catches on, we're going to have a LOT of data.
For example, we'd have "clients," for lack of a better term, who would have anywhere from 100-250,000 "products" listed. Assuming the best, we could have hundreds of clients.
The client would edit data through a web interface, the mobile interface would just make calls to the web server and return JSON (probably).
I'm a lowly cms-developing kinda guy, so I'm not sure how to handle this. My question is more or less about performance; the most I've ever seen in a MySQL table was 340k, and it was already sort of slow (granted it wasn't the best server either).
I just can't fathom a table with 40 million rows (and potential to continually grow) running well.
My plan was to have a "core" database that held the name of the "real" database, so the user would come in and try to access a client's data, it would go to the core database and figure out which database to get the information from.
I'm not concerned with data separation or data security (it's not private information)
Yes, it's possible and my company does it. I'm certainly not going to say it's smart, though. We have a SAAS marketing automation system. Some client's databases have 1 million+ records. We deal with a second "common" database that has a "fulfillment" table tracking emails, letters, phone calls, etc with over 4 million records, plus numerous other very large shared tables. With proper indexing, optimizing, maintaining a separate DB-only server, and possibly clustering (which we don't yet have to do) you can handle a LOT of data......in many cases, those who think it can only handle a few hundred thousand records work on a competing product for a living. If you still doubt whether it's valid, consider that per MySQL's clustering metrics, an 8 server cluster can handle 2.5million updates PER SECOND. Not too shabby at all.....
The problem with using two databases is juggling multiple connections. Is it tough? No, not really. You create different objects and reference your connection classes based on which database you want. In our case, we hit the main database's company class to deduce the client db name and then build the second connection based on that. But, when you're juggling those connections back and forth you can run into errors that require extra debugging. It's not just "Is my query valid?" but "Am I actually getting the correct database connection?" In our case, a dropped session can cause all sorts of PDO errors to fire because the system no longer can keep track of which client database to access. Plus, from a maintainability standpoint, it's a scary process trying to push table structure updates to 100 different live database. Yes, it can be automated. But one slip up and you've knocked a LOT of people down and made a ton of extra work for yourself. Now, calculate the extra development and testing required to juggle connections and push updates....that will be your measure of whether it's worthwhile.
My recommendation? Find a host that allows you to put two machines on the same local network. We chose Linode, but who you use is irrelevant. Start out with your dedicated database server, plan ahead to do clustering when it's necessary. Keep all your content in one DB, index and optimize religiously. Finally, find a REALLY good DB guy and treat him well. With that much data, a great DBA would be a must.
This is something I am really curious about and I do not really understand how is that possible.
So lets say I am the owner of Facebook (ahah) and I have million of people visiting my website every day, thousands and thousands of images, videos, logs etc..
How do I store all this data?
Do I have more databases in different servers around the world and then I connect to them from a single location?
Do I use an internal API system that requests info from other servers where the data is stored?
For example I know that Facebook has a lot of data centers around the world and hundreds of servers..
How do they connect to these servers? Are the profiles stored in different locations and when I connect to my profile, I will then be using that specific server? Or is there one main server that has the support of other hundreds of servers around the world?
Is there a way to use PHP in a way that I will connect to different servers and to different mySQL (???) databases to store and retrieve data whenever I want?
Sorry if this looks like a silly question, but since it could happen a day to work on a successful website, I really want to know what I will have to do, and what is the logic behind.
Thank you very much.
I'll try to answer your (big) question but not from Facebook point of view since their architecture is pretty much known.
First thing you have to know is that you would have to distribute the workload of your web application. Question is how, so in order to determine what's going to be slow, you have to divide your app in segments.
First up is the HTTP server, or the one that accepts all the requests. By going to "www.your-facebook.com", you're contacting a service on an IP. Naturally, you would probably have more than one IP but let's say you have a single entry point.
Now what happens? You have an HTTP server software, let's say Apache and it handles incoming connections. Since Apache creates a thread per connected user, it requires certain amount of memory for that operation. Eventually, it will run out of memory and then shit hits the fan, stuff stops working, your site is unavailable.
Therefore, you have to somehow scale this part of your application that connects your PHP code / MySQL db to people who want to interact with it.
Let's assume you successfully scaled your Apache and you have a cluster of computers which can accept new computers in order to scale-out. You solved your first problem.
Next part is the actual layer that does the work. Accepts input from the user and saves it somewhere (MySQL) and that's the biggest problem you'll have - why?
Due to the database.
Databases store their data on mediums such as hard drives. Hard drives, be it an SSD or mechanical one - are limited by their ability to write or retrieve data. If I'm not mistaken, RAM operates at levels of around 6GB/sec transfer rate. Not to mention that the seek time is also much much lower than HDD's one is.
Therefore, if you have an X amount of users asking for a piece of information and you can only deliver it at a certain rate - your app crashes, or it becomes unresponsive and the layer handling database queries becomes slow since the hardware cannot match the speed at which you need the data.
What are the options here? There are many, I won't mention all of them
Split Reads and Writes. Set your database layer in such a way that you have dedicated machines that write the data and completely different ones that read it. You have to use replication and replication has its own quirks - it never works without breaking.
Optimize handling of your data set by sharding your data. Great for read / write performance, screwed up when you need to query multiple shards and merge the data.
Get better hardware, especially storage (such as FusionIO)
Pay for better storage engine (such as TokuDB)
Alleviate load on the database by using caching. The data that your users request probably doesn't change so often that you have to query the db every single time (say you're viewing someone's profile, what's the chance they'll change it every second?). That's why Facebook uses Memcached extensively - a system that stores small pieces of data in RAM, it's easily scalable and what not. Most important, it's damn quick!
Use different solutions next to MySQL. MySQL (and some other databases) aren't good for every type of data storage or retrieval. Someone mentioned NoSQL before. NoSQL solutions are quick, but still immature. They don't do as much as relational databases do. They use methods of delaying disk write (they keep cached copy of data they need to write in RAM) so that they can achieve fast insert rates. That's why it's not unusual to lose data when using NoSQL.
Topic about MySQL vs "insert database or whatever here" is broad, I don't want to go into that but remember - every single one of data stores out there saves data on the hard drive eventually. The difference (physical of course) is how they optimize their flushing to the disk itself.
I also didn't mention various reports you can run by gathering the data (how many men between 19 and 21 have clicked an advert X between 01:15 and 13:37 CET and such) which is what Facebook is actually gathering (scary stuff!).
Third up - the language gluing the data store (MySQL) and output (HTTP server). PHP.
As you can see, most of the work here is already done by Apache and MySQL. Optimization on PHP level is small, even facebook got small results (they claim 50%, but that's UP TO 50%). I tried HipHop extensively, it is not as fast as it claims to be. Naturally, Facebook guys mentioned that already, so it's no wonder. The advantage they get is because they replaced Apache with their own server built in into HipHop. Some people claim "language X is better than language Y" and they're right, but that's not always the case. Each language has its own advantages and disadvantages.
For example, PHP is widely-spread but it's slow for certain operations (implementing a Trie with over 1 billion entries for example). It's great for things like echo some HTML after parsing the output from the db. It's quick to insert and retrieve data from the database, and that's about 90% of the PHP usage - talk to the db, display the data, end.
Therefore, no matter what language you use (say we used C++ instead of PHP), your bottleneck will be the data storage / retrieval layer.
On the other hand, why is using C++ NOT handy? Because there are more people who know how to use PHP than ones who use C++. It's also MUCH slower to develop web apps in C++. Sure, they will execute faster, but who will notice the difference between 1 millisecond and 1 microsecond?
This post is more like an informative blog post, I know it's not filled with resources to back up my claims but anyone who did any work with larger data sets or websites will know that the P.I.T.A. is always the data storage component. Some things that I said probably won't fit with everyone, but in a NUTSHELL this is how you'd go about optimizing your site.
Unfortunately, your question doesn't have a simple answer. For the MySQL portion of it, you would need to investigate database scale-out. You can start looking at it here: http://www.mysql.com/why-mysql/scaleout/mixi.html. There are a number of different ways to set up Apache/PHP web sites across a server farm. One of them involves setting up round robin DNS. This is adding a DNS record with a number of different IP addresses. Your DNS then hands out a different IP address each time the record is requested so that the load is balanced across a number of servers. You can also set up clustering with MySQL, Apache and Heartbeat, but that is more of a high-availability solution than a scaling solution.
When you have a website with so many users you'll already have enough experience to know the answer of the question, you'll also have a lot of money to pay people to find the optimal architecture of your system.
I'm not saying that what I describe below is the Holy Grail, but it is certainly an option:
You will have a big, fragmented database with lots of backups and you'll have a few name servers which will know the location of servers and some rules about the data stored on each server. When data is searched the query will be sent to a name server which will find the server(s) where the answer can be found for the particular query. I've also upvoted N.B.'s answer, I think he is mostly right.
For lots of users, you should have a server with lots of memory and speed. Configure php.ini to allow more memory usage. A server with lots of users should have 4-12GB available. Also, save resources by closing the desktop environment. If you have this many users, you might want to consider a CDN and also make a database request queue.
I'm the webmaster for a major US university. We have a great deal of requests on our website, which I've built and been in charge of for the last 7 years or so. I've been building ever-more-complex features into our website and it's always been my practice to put as much of the programming burden on our multi-processor Microsoft SQL server as possible - using stored procedures, views, etc, and fill-in what can't be done with PHP, ASP, or Perl from the IIS web server. Both servers are very powerful and capable machines. Since I've been doing this alone for so long without anyone else to brainstorm with, I'm curious if my approach is ideal for even higher load situations we'll have in the future.
My question is: Is it better practice to place more of the load burden on the SQL server using nested SELECT statements, views, stored procedures and aggregate functions, or should I be pulling multiple simpler queries and processing through them using server-side compile-time scripts like PHP? Keep on keepin' on or come up with a better way?
I've recently become more interested in performance after I did some load traces and learned just how much I've been putting on the shoulders of the SQL server. Both the web server and SQL servers are fast and responsive throughout the day, and almost without regard for how much I put on them, but I'd like to be ready and have trained myself and upgraded my existing code optimized best practices in mind by the time it becomes important.
Thanks for your advice and input.
You put each layer in your stack to use in the domain it fits best.
There is no use in having your database server send 1000 rows and using PHP to filter them if a WHERE-clause or GROUP-clause would suffice. It's not optimal to call the database to add two integers (SELECT 5+9 works fine, but php can do it itself, and you save the roundtrip).
You will probably want to look into scalability: what parts of your application can be divided unto multiple processes? If you're still just using 2 layers (script & db), there is a lot of room for scaling there. But always start with the bottleneck first.
Some examples: host static contents on CDN, use caching for your pages, read about nginx and memcached, use nosql (mongoDB), consider sharding, consider replication.
My opinion is that it's generally (mostly) best to favor letting the web servers do the processing. Two points:
First is scalability. Once your application gets enough usage, you'll need to start worrying about load balancing. And it's a lot easier to drop in a couple of extra web servers pointing to a common database than it is to set up a distributed database cluster. So best to take as much strain away from the Database as you can and keep it on a single machine for as long as possible.
The second point i'd like to make is about optimizing the queries. This will depend a lot on the queries you are using, and the database backend. When i first started working with databases, i fell into the trap of making elaborate SQL queries with multiple JOINs that fetched exactly the data i wanted, even if it was from four or five different tables. I reasoned that "That's what the database is there for - lets get it to do the hard work"
I quickly found that these queries took way too long to execute, and often ended up blocking the database from other requests. While it may seam inefficient to split your query into multiple requests (for example in a for loop), you'll often find that executing multiple small queries with fast indexes will make your application run far more smoothly than trying to pass all the hard work to the database
Firstly, you might want to check if there is any load which can be removed entirely by client side caching (.js, .css, static HTML and images), and use of technologies such as AJAX to do partial updates of screens - this will remove load on both web and sql servers.
Secondly, see if there is sql load which can be reduced by web server caching - e.g. static or low refresh data - if you have a lot of 'content' pages on your systems, have a look at common CMS caching techniques which will scale to allow many more users to view the same data without rebuilding the page or hitting the database.
I tend to do as much as possible outside the db, viewing db calls as expensive/time-intensive.
For example, when performing a select on a user table with fields name_given and name_family, I could fatten the query to return a column called full_name built by concatenation. But that kind of thing can be easily done in a model on your server-side scripting language (PHP, Ruby, etc).
Of course, there are cases when the db is the more "natural" place to perform an operation. But, in general, I incline more towards putting the load on the web server and optimize there with many of the techniques noted in other answers.
Many database libraries come setup for multiple database connections - but I've never actually known of an scripting application that needed to connect to two databases during it's run. (compiled, daemon-running languages are a different matter).
I understand having database slaves so that you can spread the load out - but usually on startup only one of them is chosen to handle that scripts needs.
So why would a PHP or Ruby application need to connect to more than one database? Or rather, why would you split your data up among several databases?
The only thing I can think of is bad design from a slowly evolving system that started off in multiple separate parts.
Are you talking about different physical database servers or different databases in the "schema" sense?
Regarding physical servers, If you're using MySQL replication you might write to a master and always read from a slave. This helps split the load among each database.
The simple answer is "scalability".
The ready availability of replication and clustering in a number of database products makes multiple database use a definite 'this must be possible'. Any decent ORM should know how to connect to multiple databases as required.
But even when the main application doesn't connect to more than one, there will often be other needs that do. Report generation, either scripted or ad-hoc, often involve queries that run for a long time. These are best run on database replicants dedicated (and configured) for these queries so they don't disrupt the main application.
Another good use is a type of scripted processing. Many apps will have a regular process that needs to rummage through a large part of the database. Whislt updates obviously have to go to the master, the big read queries can be run off a replicant.
Of course, the obvious need is simple performance. I oversaw a webapp and database that grew from surviving comfortably on one MySQL databse on a 32-bit dual-core machine with 3Gb to needing two 8-core 64-bit servers with 8Gb. Once it reached this stage, it relied on the database handler directing traffic to both servers. We had a window of about 50 minutes in a day where it could survive on just one database.
I have a Ruby application that connects to multiple databases. One database contains user login credentials (which is shared between several other projects). Another database contains archived data that my application tracks and compares (that only my application accesses). Another database contains data regarding physical machine resources which my application uses to generate new data (these resources are used by several different applications). By splitting the data into multiple databases, different applications only access the data that they need to be accessing.
It is all too frequently the case that some of the data you need is stored in The Wrong Database. Sometimes it's personnel records in a PeopleSoft (Oracle) database. Maybe it's Enterprise CRM data on Informix. Or some departmental database stored in MS SQL Server. Whatever it is, it's in a different database, but you still need access (hopefully read-only).
Unless your primary database is magic-based, it isn't going to be able to provide you with remote table access for every other database out there. (Most will only provide remote access to other databases of the same type, eg: MySQL->MySQL.) When that all too frequent situation occurs, you'll have no other option but to have multiple database connections, and be glad that your framework supports it.
I have a site that connects with two databases. One powers the website content (CMS DB) the other drives a web application that runs within the site (large amounts of non-CMS data) In fact, the latter uses replication.
I don't feel that's bad design. If one set of data has no relation to the other, then it makes sense even from a pure organization perspective to house it in a separate DB. Otherwise, people would just put all their tables in one DB.
For added security, I always create two accounts for every database: a read-only account (good for SELECT) and a read-write account (for SELECT, UPDATE, INSERT, DELETE and whatever else I might need). On some pages, I may need to use both accounts, thus I will consume two connections for only one database.
Well, reading from one and writing to another is a very common use case. It's easy and fun to write a data access layer that reads from one connection (reading from the slave), and writes to another (the master). A single script might make multiple reads before writing -- perhaps some lookups are necessary for validation, for instance.
Scripting languages are also frequently used for integration. You might have two off-the-shelf codebases, both of which want to maintain their own database. Your integration code might want to talk to both of them.
In general, you can usually design out of using more than one connection, but in general, I don't see anything fundamentally wrong with using connections to more than one database.
Other reasons to have multiple databases. We have one application that everyone can access. We also have client database that are very differnt from client to client. It is easier to maintain the application that all clients use (and which is maintained by a differnte team) if the client_specific data is separated out to their own databases. It is also easier to move the client to a new server when they become a large enterprise client rather than the smaller clietns who run on a server with many other clients.
Further there are types of data that are transactional and need to be in databases that are set to full recovery mode with full transaction logging. Other data is only populated from imports and does not need transactional logging and which might slow down the system as the log grew enough to handle the 10,000,000 record import. These are often split out to a separate databse so they can be in simple recovery mode as it si not necessary to recover data from the transaction log if there is a problem, it can be easily recoverd by re-running the import.
Then data is split out into datawarehouses which are optimized for data reporting not transactions. Again these reporting databases are usually separate databases (often on separate servers).
Then you have the databases for multiple different COTS applications (we have accounting databases, Credit Card transaction porcessing databases, HR databases, our project management database). A particular website might need to access more than one of these or transfer information from one to the other. Believe me vendors won't let you copy their database structure into one database to rule them all.
We have several hundred databases here on many differnt servers.
I'm currently creating a website for a social project in switzerland.
And before there is an overflow of user, I want to prepare the application to scale.
I answered by myself many questions but some are left.
I explain what I want to do.
First
at the beginnning, the Application will have only one server (short time) with DNS, PHP, Mysql, Data, and memcache.
Second
Then I will split them in two
DNS, Mysql, memcache
Data, PHP
Third
Here is the problem, I don't know how to do it exactly here to keep the application running well.
I could do :
Front : Load Balancer, memcache, DNS
Web 1 : PHP, DATA
Web 2 : PHP, DATA
Mysql
This would be the scheme, all PHP sessions are kept in the DB.
BUT, how do I sync the data?
do I run a Rsync to keep them up to date.
do I put them on a separate disk (network disk) to be sure ?
but in this case, how can I do in case of user uploads ?
and if the website gets more success and we have to go on greater structures, would'nt it create some latency on updates ?
or would it be a good thing to go directly to amazon's web services ?
some infos
I use codeigniter as Framework.
I use linux as webserver (distribution not chosen now, but should be Debian)
Thanks in advance for your answers.
According to Wikipedia, Switzerland has 4.6 million German speakers, 1.5 million French speakers, and .5 million speakers of Italian, Romansch and other languages. So I suspect you'll find that a single server will fit your needs. Guess what percentage of the population will visit your site every month or every day to get a sense of how big you can get before running into scaling issues.
So, I don't think you need to worry about scaling yet! Bonus: The time you don't spend worrying about this problem, you can use to solve other problems for your users.
There are a few common paths to scaling web services up, in order of what sites like Flickr and Facebook seem to use:
Split servers based on concepts (API, login, media files, ads, static pages, dynamic pages)
Split databases based on concepts that don't need to be JOINed (logins, long term reporting, page data, etc.)
Compile/optimize your PHP and other resources (sprites, compiled css, zend)
Add caching (front end, back end)
Add delegation (round robin, etc.)
But, before scaling, measure. Set of tests, calculate your capacity, and don't optimize before you need to.
I see some questionable things:
You have one SQL server, and you are storing sessions in a database on a site where you expect extremely high volume. How many queries does that take to produce a single page if someone is logged in and what is the expected slow down when you eventually employ MySQL replication?
If using a cluster FS, everything is 'just kept' in sync. You won't end up with build A on webserver 1 while build B on webserver 2 breaks. If you are really expecting that much traffic, in the time it takes to upload a change, then sync all nodes, you just pissed off a thousand people.
I've deployed apps running on clusters using OCFS2 with over 40 nodes without issue, and OCFS2 is not exactly the 'best' cluster FS available. Check out Lustre and consider keeping sessions on disk.
Remember you can mount/share folders.
What data would you be syncing?
You might consider putting data on the database machine or other machine. The db machine is usually a good idea at first since it is likely to have greater IO than a regular web server.
It is probably a good idea to setup a SAN or similar so your data stays in one place. Multiple copies of data is a pain to deal with. Going this route means you can put the db files there too.