I am working on a project that synchronizes online and offline features due to the unstable Internet. I have come up with a possible solution. That is to create 2 similar databases for both online and offline and sync the two. My question is that is this a good method? Or are there better options?
I have researched online on the subject but I haven't come across anything substantive. One useful link I found was on database Replication. But I want the offline version to detect Internet presence and sync accordingly.
Pls can you help me find solutions or clues to solve my problem?
I'd suggest you have an online storage for syncing and a local database(browser indexeddb, program sqllite or something similar) and log all your changes in your local database but have a record with what data was entered after last sync.
When you have a connection you sync all new data with the online storage at set intervals(like once every 5 mins or constant stream if you have the bandwidth/cpu capacity)
When the user logs in from a "fresh" location the online database pushes all data to the client who fills the local database with the data and then it resumes normal syncing function.
Plan A: Primary-Primary replication (formerly called Master-Master). You do need to be careful PRIMARY KEYs and UNIQUE keys. While the "other" machine is offline, you could write conflicting values to a table. Later, when they try to sync up, replication will freeze, requiring manual intervention. (Not a pretty sight.)
Plan B: Write changes to some storage other than the db. This suffers the same drawbacks as Plan A, plus there is a bunch of coding on your part to implement it.
Plan C: Galera cluster with 3 nodes. When all 3 nodes are up, all can take writes. If one node goes down, or network problems make it seem offline to the other two, it will automatically become read-only. After things get fixed, the sync is done automatically.
Plan D: Only write to a reliable Primary; let the other be a readonly Replica. (But this violates your requirement about an "unstable Internet".)
None of these perfectly fits the requirements. Plan A seems to be the only one that has a chance. Let's look at that.
If you have any UNIQUE key in any table and you might insert new rows into it, the problem exists. Even something as innocuous as a 'normalization table' wherein you insert a name and get back an id for use in other tables has the problem. You might do that on both servers with the same name and get different ids. Now you have a mess that is virtually impossible to fix.
Not sure if its outside the scope of the project but you can try these:
https://pouchdb.com/
https://couchdb.apache.org/
" PouchDB is an open-source JavaScript database inspired by Apache CouchDB that is designed to run well within the browser.
PouchDB was created to help web developers build applications that work as well offline as they do online.
It enables applications to store data locally while offline, then synchronize it with CouchDB and compatible servers when the application is back online, keeping the user's data in sync no matter where they next login. "
Related
I have a master database which would be the cloud server that consisted of different schools.
Dashboard type that has the details of each school. Can edit their information and other data.
Now those schools are deployed to their corresponding school location which would be the local server.
Dashboard type that can only edit the specific school deployed in the local server. Can edit their information and other data.
Now what I want to happen is, to synchronize the cloud to local server on their corresponding school if something is changed. That also goes for local to cloud server.
Note: If you guys ever tried Evernote, that can edit the notes information on whatever device you're using and still be able to
synchronize when you have internet or manually clicked synchronize.
When the local server doesn't have internet connection and edited some data in school. Once the internet is up, the data from local and cloud server should be synchronize.
That's the logic that I'm pursuing to have.
Would anyone shed some light for me where to start off? I couldn't think of any solution that fit my problem.
I also think of using php to foreach loop all over the table and data that corresponds to current date and time. But I know that would be so bad.
Edited: I deleted references / posts of other SO questions regarding this matter.
The application pegs that I found are
Evernote
Todoist
Servers:
Local Server Computer: Windows 10 (Deployed in Schools)
Cloud Server: Probably some dedicated hosting that uses phpmyadmin
Not to be picky but, hopefully the answer would be you're talking to a newbie to master to slave database process. I don't have experience for this.
When we used to do this we would:
Make sure every table we wanted to sync had datetime columns for Created; Modified; & Deleted. They would also have a boolean isDeleted column (so rather than physically delete records we would flag it to true and ignore it in queries). This means we could query for any records that have been deleted since a certain time and return an array of these deleted IDs.
In each DB (Master and slave) create a table that stores the last successful sync datetime. In the master this table stores multiple records: 1 for each school, but in the slave it just needs 1 record - the last time it synced with the master.
In your case every so often each of the slaves would:
Call a webservice (a URL) of the master, lets say called 'helloMaster'. It would pass in the school name (or some specific identifier), the last time they successfully synced with the master, authentication details (for security) and expect a response from the master of whether the master had any updates for the school since that datetime provided. Really the point here is just looking for an acknowledgement that the master available and listening (ie. the internet is still up).
Then, the slave would call another webservice, lets say called 'sendUpdates'. It would again pass in the school name, last successful sync, (plus security authentication details) & three arrays for any added, updated and deleted records since last sync. The master just acknowledge receipt. If a receipt was acknowledged then the slave to move to step 3, otherwise the slave would try step 1 again after a pause of some duration. So now the Master has updates from the slave. Note: it is up to the master to decide how to merge any records if there are conflicts with its pending slave updates.
The slave then calls a webservice, lets say 'getUpdates'. It passes in the school name, last successful sync, security authentication details, & the master then return to it three arrays for any added, updated and deleted records it has which the slave is expected to apply to its database.
Finally once the slave tries to update its records it will then notifies the master of success/failure through another webservice, say 'updateStatus'. If successful then the master will return a new sync date for the slave to store (this will exactly match the date the master stores in its table). If it fails then the error is logged in the master and we go back to step 1 after a pause.
I have left out some detail out about error handling, getting the times accurate across all devices (there might be different time zones involved), and some other bits and pieces, but that's the gist of it.
I may make refinements after thinking on it more (or others might edit my post).
Hope that helps at least.
I will suggest you to go with the Trivial Solution, which according to me is:
Create a SQLlite or any database (MySQL or your choice) in local server
Keep a always running thread which will be pinging (makes an API call) your Master database every 5 minutes (depends on how much delay is accepted)
With that thread you can detect whether you're connected to the internet or not.
If connected to internet
a) Send local changes with the request to master server, this master server is an application server, which will be capable to update changes of local machines in school (you received this changes by an API call) to the master database after certain validations according to your application usage.
b) Receive updated changes from the server after the API call, this changes are served after solving conflicts (like if data in school server was updated earlier than data updated in master database so which one you will accept based on your requirement).
If not connected to internet, keep storing changes in local database and reflect those changes in Application which is running in school, but when you get connected push those changes to master server and pull actual changes which is applicable from the master server.
This is complicated to do it by your own, but if the scale is small I will prefer to implement your own APIs for the database applications which will connect in this manner.
Better solution will be to use Google Firebase, which is a real time database which is asynchronously updated whenever there is change in any machine, but can cost you higher if its really not required. But yes it will really give you Evernote type realtime editing features for your database systems.
This is not a problem that can be solved by database replication.
Generally speaking, database replication can operate in one of two modes:
Master/slave replication, which is what MySQL uses. In this mode, all writes must be routed to a single "master" server, and all of the replica databases receive a feed of changes from the master.
This doesn't suit your needs, as writes can only be made to the master. (Modifying one of the replicas directly would result in it becoming permanently out of sync with the master.)
Quorum-based replication, which is used by some newer databases. All database replicas connect to each other. So long as at least half of all replicas are connected (that is, the cluster has reached "quorum"), writes can be made to any of the active databases, and will be propagated to all of the other databases. A database that is not connected will be brought up to date when it joins the quorum.
This doesn't suit your needs either, as a disconnected replica cannot be written to. Worse, having more than half of all replicas disconnect from the master would prevent the remaining databases from being written to either!
What you need is some sort of data synchronization solution. Any solution will require some logic -- which you will have to write! -- to resolve conflicts. (For instance, if a record is modified in the master database while a school's local replica is disconnected, and the same record is also modified there, you will need some way to reconcile those differences.)
No need for any complicated setup or APIs. MySQL allows you to easily replicate your database. MySQL will ensure the replication is correctly and timely done and whenever internet is available. (and its fast too)
There are:
Master - slave: Master edits slave reads or in other words one way synchronization from master to slave.
Master - Master: Master1 edits master2 reads and edits or in other words two way synchronization. Both server will push and pull updates.
assuming your cloud server has schema for each school and each schema is accessible by its own username and password. i.e db_school1, db_school2
now you have the option to replicate only a selected database schema from your cloud to local master. In your case, school one's local master will only "do replicate db_school1"
in case if you want to replicate only specific table, MySQL also has that option "replicate-do-table"
the actual replication process is very easy but can get very deep when you have different scenarios.
few things you want to take a note, server ids, different auto-increment value on each server to avoid conflicts with new records. i.e Master1 generates records on odd number, Master 2 on even numbers so there won't be a duplicate primary key issues. Server down alerts/monitoring, error skipping
I'm not sure if you are on linux or windows, I've wrote simple c# application which checks if any of the master is not replicating or stopped for any reason and sends email. monitoring is crucial!
here some links for master master replication:
https://www.howtoforge.com/mysql_master_master_replication
https://www.digitalocean.com/community/tutorials/how-to-set-up-mysql-master-master-replication
also worth reading this optimised tabl-level replication info:
https://dba.stackexchange.com/questions/37015/how-can-i-replicate-some-tables-without-transferring-the-entire-log
hope this helps.
Edit:
The original version of this answer proposed MongoDB; but with further reading MongoDB is not so reliable with dodgy internet connections. CouchDB is designed for offline documents, which is what you need - although it's harder to get gong than MongoDB, unfortunately.
Original:
I'd suggest not using MySQL but deploy a document store designed for replication such as CouchDB - unless you go for the commercial MySQL clustering services.
Being a lover of the power of MySQL I find it hard to suggest you use something else, but in this case, you really should.
Here is why -
Problems using MySQL replication
Why MySQL had good replication (and that's most likely what you should be using if you're synchronizing a MySQL database - as recommended by others) there are some things to watch out for.
"Unique Key" clashes will give you a massive headache; the most
likely cause of this is "Auto Incrementing" IDs that are common in
MySQL applications (don't use them for syncing operation unless there
is a clear "read+write"->"read-only" relationship - which there isn't
in your case.)
Primary keys must be generated by each server but unique across all servers. Possibly by adding a mix of a server identifier and a unique ID for that server (Server1_1, Server1_2, Server1_3 etc will not clash with Server2_1)
MySQL sync only supports on-way unless you look at their clustering solutions (https://www.mysql.com/products/cluster/).
Problems doing it "manually" with time stamping the record.
Another answer recommends keeping "Time Updated" records. While I've done this approach there are some big gotchas to be careful of.
"Unique Key" clashes (as mentioned above; same problems - don't use them except primary keys, and generate primary keys unique to the server)
Multiple updates on multiple servers need to be precisely time-synced
and clashes handled according to rules. This can be a headache.
What happens when updates are received way out-of-order; which fields have been updated, which weren't? You probably don't need to update the whole record, but how do you know?
If you must, try one of the commercial solutions as mentioned in answers https://serverfault.com/questions/58829/how-to-keep-multiple-read-write-db-servers-in-sync and https://community.spiceworks.com/topic/352125-how-to-synchronize-multiple-mysql-servers and Strategy on synchronizing database from multiple locations to a central database and vice versa (etc - Google for more)
Problems doing it "manually" with journalling.
Journalling is keeping a separate record of what has changed and when. "Database X, Table Y, Field Z was updated to value A at time B" or "Table A had new record added with these details [...]". This allows you much finer control of what to update.
if you look at database sync techniques, this is actually what is going on in the background; in MySQL's case it keeps a binary log of the updates
you only ever share the journal, never the original record.
When another server receives a journal entry, if has a much greater picture of what has happened before/after and can replay updates and ensure you get the correct details.
problems arise when the journalling/database get out of Sync (MySQL is actually a pain when this happens!). You need to have a "refresh" script ready to roll that sits outside the journalling that will sync the DB to the master.
It's complicated. So...
Solution: Using a document store designed for replication, e.g. MongoDB
Bearing all this that in mind, why not use a document store that already does all that for you? CouchDB has support and handles all the journalling and syncing (http://docs.couchdb.org/en/master/replication/protocol.html).
There are others out there, but I believe you'll end up with less headaches and errors than with the other solutions.
Master to master replication in MySQL can be accomplished without key violations while using auto_increment. Here is a link that explains how.
If you have tables without primary keys I'm not sure what will happen (I always include auto_increment primary keys on tables)
http://brendanschwartz.com/post/12702901390/mysql-master-master-replication
The auto-increment-offset and auto-increment-increment effect the auto_increment values as shown in the config samples from the article...
server_id = 1
log_bin = /var/log/mysql/mysql-bin.log
log_bin_index = /var/log/mysql/mysql-bin.log.index
relay_log = /var/log/mysql/mysql-relay-bin
relay_log_index = /var/log/mysql/mysql-relay-bin.index
expire_logs_days = 10
max_binlog_size = 100M
log_slave_updates = 1
auto-increment-increment = 2
auto-increment-offset = 1
server_id = 2
log_bin = /var/log/mysql/mysql-bin.log
log_bin_index = /var/log/mysql/mysql-bin.log.index
relay_log = /var/log/mysql/mysql-relay-bin
relay_log_index = /var/log/mysql/mysql-relay-bin.index
expire_logs_days = 10
max_binlog_size = 100M
log_slave_updates = 1
auto-increment-increment = 2
auto-increment-offset = 2
I am currently thinking up a system to allow for online voting system for my old high school (a mock award ceremony really). Due to a restrictive school board I can guarentee that MySQL will not be an option to store votes. I am also under the assumption that should votes be stored in local files, data will overwrite when the file is called multiple times at the same time (which is a large possibility).
Does anyone have any suggestions as to how I might go about this? Perferably a PHP based solution as for the school board's restrictions. Please note the data will only need to be accessible for a few hours on a continuously running web server, so if the data is RAM-like (for a lack of a better term) that would be fine.
While I am tempted to reject the premiss of the question like some commenters have, here's an answer (I'm shamelessly trying to earn 200 reputation to try to help get a new site launched):
Write a recordVote function that stores each vote in its own file in a directory using a unique id in the file name (PHP doesn't have one guaranteed to yield truly unique GUIDs on all platforms, so use something like https://gist.github.com/dahnielson/508447).
When the polls close, run a tallyVote routine to compile the count of votes by reading all files in the directory.
I'm going to try to make this as brief as possible while covering all points - I work as a PHP/MySQL developer currently. I have a mobile app idea with a friend and we're going to start developing it.
I'm not saying it's going to be fantastic, but if it catches on, we're going to have a LOT of data.
For example, we'd have "clients," for lack of a better term, who would have anywhere from 100-250,000 "products" listed. Assuming the best, we could have hundreds of clients.
The client would edit data through a web interface, the mobile interface would just make calls to the web server and return JSON (probably).
I'm a lowly cms-developing kinda guy, so I'm not sure how to handle this. My question is more or less about performance; the most I've ever seen in a MySQL table was 340k, and it was already sort of slow (granted it wasn't the best server either).
I just can't fathom a table with 40 million rows (and potential to continually grow) running well.
My plan was to have a "core" database that held the name of the "real" database, so the user would come in and try to access a client's data, it would go to the core database and figure out which database to get the information from.
I'm not concerned with data separation or data security (it's not private information)
Yes, it's possible and my company does it. I'm certainly not going to say it's smart, though. We have a SAAS marketing automation system. Some client's databases have 1 million+ records. We deal with a second "common" database that has a "fulfillment" table tracking emails, letters, phone calls, etc with over 4 million records, plus numerous other very large shared tables. With proper indexing, optimizing, maintaining a separate DB-only server, and possibly clustering (which we don't yet have to do) you can handle a LOT of data......in many cases, those who think it can only handle a few hundred thousand records work on a competing product for a living. If you still doubt whether it's valid, consider that per MySQL's clustering metrics, an 8 server cluster can handle 2.5million updates PER SECOND. Not too shabby at all.....
The problem with using two databases is juggling multiple connections. Is it tough? No, not really. You create different objects and reference your connection classes based on which database you want. In our case, we hit the main database's company class to deduce the client db name and then build the second connection based on that. But, when you're juggling those connections back and forth you can run into errors that require extra debugging. It's not just "Is my query valid?" but "Am I actually getting the correct database connection?" In our case, a dropped session can cause all sorts of PDO errors to fire because the system no longer can keep track of which client database to access. Plus, from a maintainability standpoint, it's a scary process trying to push table structure updates to 100 different live database. Yes, it can be automated. But one slip up and you've knocked a LOT of people down and made a ton of extra work for yourself. Now, calculate the extra development and testing required to juggle connections and push updates....that will be your measure of whether it's worthwhile.
My recommendation? Find a host that allows you to put two machines on the same local network. We chose Linode, but who you use is irrelevant. Start out with your dedicated database server, plan ahead to do clustering when it's necessary. Keep all your content in one DB, index and optimize religiously. Finally, find a REALLY good DB guy and treat him well. With that much data, a great DBA would be a must.
This is something I am really curious about and I do not really understand how is that possible.
So lets say I am the owner of Facebook (ahah) and I have million of people visiting my website every day, thousands and thousands of images, videos, logs etc..
How do I store all this data?
Do I have more databases in different servers around the world and then I connect to them from a single location?
Do I use an internal API system that requests info from other servers where the data is stored?
For example I know that Facebook has a lot of data centers around the world and hundreds of servers..
How do they connect to these servers? Are the profiles stored in different locations and when I connect to my profile, I will then be using that specific server? Or is there one main server that has the support of other hundreds of servers around the world?
Is there a way to use PHP in a way that I will connect to different servers and to different mySQL (???) databases to store and retrieve data whenever I want?
Sorry if this looks like a silly question, but since it could happen a day to work on a successful website, I really want to know what I will have to do, and what is the logic behind.
Thank you very much.
I'll try to answer your (big) question but not from Facebook point of view since their architecture is pretty much known.
First thing you have to know is that you would have to distribute the workload of your web application. Question is how, so in order to determine what's going to be slow, you have to divide your app in segments.
First up is the HTTP server, or the one that accepts all the requests. By going to "www.your-facebook.com", you're contacting a service on an IP. Naturally, you would probably have more than one IP but let's say you have a single entry point.
Now what happens? You have an HTTP server software, let's say Apache and it handles incoming connections. Since Apache creates a thread per connected user, it requires certain amount of memory for that operation. Eventually, it will run out of memory and then shit hits the fan, stuff stops working, your site is unavailable.
Therefore, you have to somehow scale this part of your application that connects your PHP code / MySQL db to people who want to interact with it.
Let's assume you successfully scaled your Apache and you have a cluster of computers which can accept new computers in order to scale-out. You solved your first problem.
Next part is the actual layer that does the work. Accepts input from the user and saves it somewhere (MySQL) and that's the biggest problem you'll have - why?
Due to the database.
Databases store their data on mediums such as hard drives. Hard drives, be it an SSD or mechanical one - are limited by their ability to write or retrieve data. If I'm not mistaken, RAM operates at levels of around 6GB/sec transfer rate. Not to mention that the seek time is also much much lower than HDD's one is.
Therefore, if you have an X amount of users asking for a piece of information and you can only deliver it at a certain rate - your app crashes, or it becomes unresponsive and the layer handling database queries becomes slow since the hardware cannot match the speed at which you need the data.
What are the options here? There are many, I won't mention all of them
Split Reads and Writes. Set your database layer in such a way that you have dedicated machines that write the data and completely different ones that read it. You have to use replication and replication has its own quirks - it never works without breaking.
Optimize handling of your data set by sharding your data. Great for read / write performance, screwed up when you need to query multiple shards and merge the data.
Get better hardware, especially storage (such as FusionIO)
Pay for better storage engine (such as TokuDB)
Alleviate load on the database by using caching. The data that your users request probably doesn't change so often that you have to query the db every single time (say you're viewing someone's profile, what's the chance they'll change it every second?). That's why Facebook uses Memcached extensively - a system that stores small pieces of data in RAM, it's easily scalable and what not. Most important, it's damn quick!
Use different solutions next to MySQL. MySQL (and some other databases) aren't good for every type of data storage or retrieval. Someone mentioned NoSQL before. NoSQL solutions are quick, but still immature. They don't do as much as relational databases do. They use methods of delaying disk write (they keep cached copy of data they need to write in RAM) so that they can achieve fast insert rates. That's why it's not unusual to lose data when using NoSQL.
Topic about MySQL vs "insert database or whatever here" is broad, I don't want to go into that but remember - every single one of data stores out there saves data on the hard drive eventually. The difference (physical of course) is how they optimize their flushing to the disk itself.
I also didn't mention various reports you can run by gathering the data (how many men between 19 and 21 have clicked an advert X between 01:15 and 13:37 CET and such) which is what Facebook is actually gathering (scary stuff!).
Third up - the language gluing the data store (MySQL) and output (HTTP server). PHP.
As you can see, most of the work here is already done by Apache and MySQL. Optimization on PHP level is small, even facebook got small results (they claim 50%, but that's UP TO 50%). I tried HipHop extensively, it is not as fast as it claims to be. Naturally, Facebook guys mentioned that already, so it's no wonder. The advantage they get is because they replaced Apache with their own server built in into HipHop. Some people claim "language X is better than language Y" and they're right, but that's not always the case. Each language has its own advantages and disadvantages.
For example, PHP is widely-spread but it's slow for certain operations (implementing a Trie with over 1 billion entries for example). It's great for things like echo some HTML after parsing the output from the db. It's quick to insert and retrieve data from the database, and that's about 90% of the PHP usage - talk to the db, display the data, end.
Therefore, no matter what language you use (say we used C++ instead of PHP), your bottleneck will be the data storage / retrieval layer.
On the other hand, why is using C++ NOT handy? Because there are more people who know how to use PHP than ones who use C++. It's also MUCH slower to develop web apps in C++. Sure, they will execute faster, but who will notice the difference between 1 millisecond and 1 microsecond?
This post is more like an informative blog post, I know it's not filled with resources to back up my claims but anyone who did any work with larger data sets or websites will know that the P.I.T.A. is always the data storage component. Some things that I said probably won't fit with everyone, but in a NUTSHELL this is how you'd go about optimizing your site.
Unfortunately, your question doesn't have a simple answer. For the MySQL portion of it, you would need to investigate database scale-out. You can start looking at it here: http://www.mysql.com/why-mysql/scaleout/mixi.html. There are a number of different ways to set up Apache/PHP web sites across a server farm. One of them involves setting up round robin DNS. This is adding a DNS record with a number of different IP addresses. Your DNS then hands out a different IP address each time the record is requested so that the load is balanced across a number of servers. You can also set up clustering with MySQL, Apache and Heartbeat, but that is more of a high-availability solution than a scaling solution.
When you have a website with so many users you'll already have enough experience to know the answer of the question, you'll also have a lot of money to pay people to find the optimal architecture of your system.
I'm not saying that what I describe below is the Holy Grail, but it is certainly an option:
You will have a big, fragmented database with lots of backups and you'll have a few name servers which will know the location of servers and some rules about the data stored on each server. When data is searched the query will be sent to a name server which will find the server(s) where the answer can be found for the particular query. I've also upvoted N.B.'s answer, I think he is mostly right.
For lots of users, you should have a server with lots of memory and speed. Configure php.ini to allow more memory usage. A server with lots of users should have 4-12GB available. Also, save resources by closing the desktop environment. If you have this many users, you might want to consider a CDN and also make a database request queue.
Is there any difference between CMS and hight traffic websites (like news portals) in logic and database design and optimization (PHP and MySQL)?
I have searched for php site scalability in stackoverflow and memcached is in a majority.
Is there techniques for MySQL optimization? (Im looking for a book for this issue. I have searched in amazon but I dont know what is the best choise.)
Thanks in advance
this isnt so easy to answer.
there are different approaches and a variety of opinions but ill try to cover some common scenarios. but first some basics.
most web applications can be sperated in application and database.
database usage can be seperated into transactional (oltp) and analytical (olap)
in the best case you can just start a number of application servers and distribute traffic among them. they all have a connection to the same database server and can work independently.
this can be however difficult if you have other shared data, sessions etc.
you can accomplish this by simply adding multiple ip adresses to your domain namen in dns.
or you use load balancing techniques to forward the clients do different servers.
application scaling is generally very easy. database is much more complex.
the first thing to do is usually set up one or more replication servers which have the same data as the main database. they can be cascaded but have 1 serous disadvantage. their data is not always up to date. in general not more than some seconds old but it can be more under load. but for many use cases this is fine.
big sites that just display information could just replicate their database to some slave servers, set up some application servers (its a good practice to run one slave and one application server on the same server and let this application server access this database slave) and every is fine.
every olap query can be directed to a slave. olap querys are those that dont modify anything and dont need 100% up 2 date data.
so everything needs to be written to the very same database source server from which every other server gets its copy. for example every comment for an article.
if this bottleneck gets too tight you can go in two dirctions.
sharding
master-master replication
sharding means you decide on the application server where to store and where to fetch your data.
for example every comment that starts with a gets to server a, b-> b and so on.
thats a stupid example but its basically how it is. mostly some internal ids are involved.
if possible its good to shard data so that it can be completely pulled from that server agani.
in the example above, if i wanted to have all comments for an article i would have to ask eveyr server a-z and merge the results. this is inefficitient but possible, because those servers can be replicated. this is called mapping (you could check the famous google map-reduce algorithm whcih basically does just this).
master-master repliation means that you write your data to different master servers and they synchronize each other, and isnt stored seperately like if you do sharding.
this has to be done if your application is not able to decide on its own where to store and fetch data.
you just store to any master server, every server gets everything and everybody is happy?
no... because this involves another serious problem.
conflicts! imagine two users enter a comment. commentA gets stored on serverA, commentB gets stored on serverB. which id should we use. which one comes first?
the best is to design an application that avoids this cases and has different keys and stuff.
but what usually happens is conflict resolving, prioritizing and stuff. oracle has alot of features on this level and mysql is still behind. but trends are going into much more complex data structes like clouds anaway...
well i dont think i explained well but you should at least get some keywords from the text that oyu can investigate further.
Sure, there are all sorts of things you can do to optimize your PHP/MySQL web applications for high traffic websites. However, most of them depend on your specific situation, which you haven't given in your question.
Your database should be well structured regardless of whether you have a high-traffic site or not. If you use an off-the-shelf CMS, this is typically fine. Aside from good application architecture, there is no one-size-fits-all solution.