Let's assume I'm developing a AJAX, PHP chess game.
During the game, one movement of a player will be notified to the another but we are not saving that information. Normally, we used to store in MySQL every time a player makes movement and show update position to another player.
What I want is to reduce MySQL load as much as possible and server is not interested in movements between two players. Server will only save final result like who wins.
So what should I do?
I assume you are going to need some sort of persistent storage to save the moves, so the alternative would be a memcache, or storing the data in files. The former is lightning fast but requires to be set up on the server. The latter has no real advantage over using a database IMO.
Any form of storage that takes place only on the users' computers (e.g. storing info in cookies or some other kind of persistent storage, see for example here or here) I would find too shaky: The playing session would be destroyed if both clients would have to restart their computers, which is not totally out of the question to happen.
Without any hard data that the mySQL server is overloaded with this kind of traffic, I would stick with the database. Otherwise, look into memcache.
Related
We are building a social website using PHP (Zend Framework), MySQL, server running Apache.
There is a requirement where in dashboard the application will fetch data for different events (there are about 12 events) on which this dashboard for user will be updated. We expect the total no of users to be around 500k to 700k. While at one time on average about 20% users would be online (for peak time we expect 50% users to be online).
So the problem is the event data as per our current design will be placed in a MySQL database. I think running a few hundred thousands queries concurrently on MySQL wouldn't be a good idea even if we use Amazon RDS. So we are considering to use both DynamoDB (or Redis or any NoSQL db option) along with MySQL.
So the question is: Having data both in MySQL and any NoSQL database would give us this benefit to have this power of scalability for our web application? Or we should consider any other solution?
Thanks.
You do not need to duplicate your data. One option is to use the ElastiCache that amazon provides to give your self in memory caching. This will get rid of your database calls and in a sense remove that bottleneck, but this can be very expensive. If you can sacrifice rela time updates then you can get away with just slowing down the requests or caching data locally for the user. Say, cache the next N events if possible on the browser and display them instead of making another request to the servers.
If it has to be real time then look at the ElastiCache and then tweak with the scaling of how many of them you require to handle your estimated amount of traffic. There is no point in duplicating your data. Keep it in a single DB if it makes sense to keep it there, IE you have some relational information that you need and then also have a variable schema system then you can use both databases, but not to load balance them together.
I would also start to think of some bottle necks in your architecture and think of how well your application will/can scale in the event that you reach your estimated numbers.
I agree with #sean, there’s no need to duplicate the database. Have you thought about a something with auto-scalability, like Xeround. A solution like that can scale out automatically across several nodes when you have throughput peaks and later scale back in, so you don’t have to commit to a larger, more expansive instance just because of seasonal peaks.
Additionally, if I understand correctly, no code changes are required for this auto-scalability. So, I’d say that unless you need to duplicate your data on both MySQL and NoSQL DB’s for reasons other than scalability-related issues, go for a single DB with auto-scaling.
When is it best to use database as a session store in PHP? I know that 1 instance will be when I am sharing request load across multiple servers, I will need to maintain session state across those servers.
An article i am currently reading http://onlamp.com/pub/a/php/excerpt/webdbapps_8/index.html?page=2 says
Using files as the session store is adequate for most applications in
which the numbers of concurrent sessions are limited
I do not get why the number of concurrent sessions should affect the preferred session store.
Are there other reasons why I should choose to store session data in a database?
Storing a large amount of data in files can get slow as you have to search through large amounts of possibly unrelated data to the page you are currently serving up.
Interfaces for interacting with data in files are also not as streamlined as interfaces, to, say, MySQL - MySQL is built for this sort of thing, while single files aren't really.
Concurrency is another issue - databases will handle it for you if different users of your website are simultaneously trying to read and write data - this happens more often as the user base becomes larger. Doing these tasks simultaneously is possible with a database, but much more difficult with storing information in files. You would end up having to program some synchronization code, and that may get messy.
If you expect your userbase to grow over time, then it may be best to immediately design your program around a database, so that you may not have to switch as users start to wait on other requests being filled before their page gets loaded.
This is something I am really curious about and I do not really understand how is that possible.
So lets say I am the owner of Facebook (ahah) and I have million of people visiting my website every day, thousands and thousands of images, videos, logs etc..
How do I store all this data?
Do I have more databases in different servers around the world and then I connect to them from a single location?
Do I use an internal API system that requests info from other servers where the data is stored?
For example I know that Facebook has a lot of data centers around the world and hundreds of servers..
How do they connect to these servers? Are the profiles stored in different locations and when I connect to my profile, I will then be using that specific server? Or is there one main server that has the support of other hundreds of servers around the world?
Is there a way to use PHP in a way that I will connect to different servers and to different mySQL (???) databases to store and retrieve data whenever I want?
Sorry if this looks like a silly question, but since it could happen a day to work on a successful website, I really want to know what I will have to do, and what is the logic behind.
Thank you very much.
I'll try to answer your (big) question but not from Facebook point of view since their architecture is pretty much known.
First thing you have to know is that you would have to distribute the workload of your web application. Question is how, so in order to determine what's going to be slow, you have to divide your app in segments.
First up is the HTTP server, or the one that accepts all the requests. By going to "www.your-facebook.com", you're contacting a service on an IP. Naturally, you would probably have more than one IP but let's say you have a single entry point.
Now what happens? You have an HTTP server software, let's say Apache and it handles incoming connections. Since Apache creates a thread per connected user, it requires certain amount of memory for that operation. Eventually, it will run out of memory and then shit hits the fan, stuff stops working, your site is unavailable.
Therefore, you have to somehow scale this part of your application that connects your PHP code / MySQL db to people who want to interact with it.
Let's assume you successfully scaled your Apache and you have a cluster of computers which can accept new computers in order to scale-out. You solved your first problem.
Next part is the actual layer that does the work. Accepts input from the user and saves it somewhere (MySQL) and that's the biggest problem you'll have - why?
Due to the database.
Databases store their data on mediums such as hard drives. Hard drives, be it an SSD or mechanical one - are limited by their ability to write or retrieve data. If I'm not mistaken, RAM operates at levels of around 6GB/sec transfer rate. Not to mention that the seek time is also much much lower than HDD's one is.
Therefore, if you have an X amount of users asking for a piece of information and you can only deliver it at a certain rate - your app crashes, or it becomes unresponsive and the layer handling database queries becomes slow since the hardware cannot match the speed at which you need the data.
What are the options here? There are many, I won't mention all of them
Split Reads and Writes. Set your database layer in such a way that you have dedicated machines that write the data and completely different ones that read it. You have to use replication and replication has its own quirks - it never works without breaking.
Optimize handling of your data set by sharding your data. Great for read / write performance, screwed up when you need to query multiple shards and merge the data.
Get better hardware, especially storage (such as FusionIO)
Pay for better storage engine (such as TokuDB)
Alleviate load on the database by using caching. The data that your users request probably doesn't change so often that you have to query the db every single time (say you're viewing someone's profile, what's the chance they'll change it every second?). That's why Facebook uses Memcached extensively - a system that stores small pieces of data in RAM, it's easily scalable and what not. Most important, it's damn quick!
Use different solutions next to MySQL. MySQL (and some other databases) aren't good for every type of data storage or retrieval. Someone mentioned NoSQL before. NoSQL solutions are quick, but still immature. They don't do as much as relational databases do. They use methods of delaying disk write (they keep cached copy of data they need to write in RAM) so that they can achieve fast insert rates. That's why it's not unusual to lose data when using NoSQL.
Topic about MySQL vs "insert database or whatever here" is broad, I don't want to go into that but remember - every single one of data stores out there saves data on the hard drive eventually. The difference (physical of course) is how they optimize their flushing to the disk itself.
I also didn't mention various reports you can run by gathering the data (how many men between 19 and 21 have clicked an advert X between 01:15 and 13:37 CET and such) which is what Facebook is actually gathering (scary stuff!).
Third up - the language gluing the data store (MySQL) and output (HTTP server). PHP.
As you can see, most of the work here is already done by Apache and MySQL. Optimization on PHP level is small, even facebook got small results (they claim 50%, but that's UP TO 50%). I tried HipHop extensively, it is not as fast as it claims to be. Naturally, Facebook guys mentioned that already, so it's no wonder. The advantage they get is because they replaced Apache with their own server built in into HipHop. Some people claim "language X is better than language Y" and they're right, but that's not always the case. Each language has its own advantages and disadvantages.
For example, PHP is widely-spread but it's slow for certain operations (implementing a Trie with over 1 billion entries for example). It's great for things like echo some HTML after parsing the output from the db. It's quick to insert and retrieve data from the database, and that's about 90% of the PHP usage - talk to the db, display the data, end.
Therefore, no matter what language you use (say we used C++ instead of PHP), your bottleneck will be the data storage / retrieval layer.
On the other hand, why is using C++ NOT handy? Because there are more people who know how to use PHP than ones who use C++. It's also MUCH slower to develop web apps in C++. Sure, they will execute faster, but who will notice the difference between 1 millisecond and 1 microsecond?
This post is more like an informative blog post, I know it's not filled with resources to back up my claims but anyone who did any work with larger data sets or websites will know that the P.I.T.A. is always the data storage component. Some things that I said probably won't fit with everyone, but in a NUTSHELL this is how you'd go about optimizing your site.
Unfortunately, your question doesn't have a simple answer. For the MySQL portion of it, you would need to investigate database scale-out. You can start looking at it here: http://www.mysql.com/why-mysql/scaleout/mixi.html. There are a number of different ways to set up Apache/PHP web sites across a server farm. One of them involves setting up round robin DNS. This is adding a DNS record with a number of different IP addresses. Your DNS then hands out a different IP address each time the record is requested so that the load is balanced across a number of servers. You can also set up clustering with MySQL, Apache and Heartbeat, but that is more of a high-availability solution than a scaling solution.
When you have a website with so many users you'll already have enough experience to know the answer of the question, you'll also have a lot of money to pay people to find the optimal architecture of your system.
I'm not saying that what I describe below is the Holy Grail, but it is certainly an option:
You will have a big, fragmented database with lots of backups and you'll have a few name servers which will know the location of servers and some rules about the data stored on each server. When data is searched the query will be sent to a name server which will find the server(s) where the answer can be found for the particular query. I've also upvoted N.B.'s answer, I think he is mostly right.
For lots of users, you should have a server with lots of memory and speed. Configure php.ini to allow more memory usage. A server with lots of users should have 4-12GB available. Also, save resources by closing the desktop environment. If you have this many users, you might want to consider a CDN and also make a database request queue.
I'm creating a web service that often scrapes data from remote web pages. After scraping this data, I have a simple multidimensional array of information to use. The scraping process is fairly taxing on my server, and the page load takes a while. I was considering adding a simple cache system using a MySQL database, where I create one row per remote web page with a the array of information pulled from it stored as a JSON encoded string. Is this a good enough system? Or would something like a text file per web page be a better idea?
Since you're scraping multiple web pages, and you want to your data to be persistently cached, you have a few options -- the best of which would be to use memcache or a database such as MySQL. Using text files is not a good idea, because you would have to serialize / deserialize your data, and read from your filesystem. To query a database or a memcache is many times more efficient.
Since you're probably looking for your cache to be somewhat persistent, I would suggest going with MySQL. You would simply create a table that has an auto-incrementing primary key, which a column for each element in your parsed JSON object. (Note that MySQL currently does not support arrays. In order to emulate them, you will need to use relational tables, or serialize your array data and provide it to a text field. The former method is preferred).
Every time you scrape a page, you would run an UPDATE statement to update that individual page's information in the database. If you specify a unique index on whatever you use to uniquely identify your page (URL / etc), you will achieve optimal look-up performance.
If you're looking to store the cache locally on 1 server (e.g. if your mysql server and http server are on the same box), you might be better off using APC, which is a cache service that comes with PHP.
If you're looking to store the data remotely (e.g. a dedicated cache box) then I would go with Memcache instead of MySQL.
"When all you have is a hammer ..."
I don;'t tend to have particularly large APC configs, 64 - 128MB max. Memcache can go to a couple of gigabytes or maybe more (far more if you run multiple instances). Both are also transient - a restart of Apache, or Memcache (the the latter is slightly less likely, or often) will lose the data
It depends then, on how often you are willing to process the data to produce the cache, and how long that cache could otherwise be useful for. If it was good for weeks before you re-scraped the pages - Mysql is a entirely suitable backing store.
Potential pther options, depending on how many items are being cached & how big the data is, are, as you suggest, a file-based cache, SQlite, or other systems.
Background
We are currently working on a strategy web-browser game based on php, html and javascript.
The plan is to have 10,000+ users playing within the same world.
Currently we are using memcached to:
store json static data, language files
store changeable serialized php class objects
(such as armies, inventorys,
unit-containers, buildings, etc)
In the back we have a mysql server running and holding all the game data aswell. When a object is loaded through our ObjectLoader it loads in this order:
checks a static hashmap in the
script for the object
checks memcache if it has already
been loaded into it
otherwise loads from database, and
saves it into memcache and the
static temp hashmap
We have built the whole game using a class-object-oriented approach where functionality is always made between objects. Beause of this we think we have managed to get a nice structure, and with the help of memcached we have received good request times from client-server when interacting with the game.
I'm aware that memcache is not synchronized, and also is not commonly used for holding a full game in memory. In the beginning after a server's startup the load times when loading objects into memcache for the first time will be high, but after the server's been online for a while and most loads are from memcache, the loads will be well reduced.
Currently we are saving changed objects into memcache and database at the same time. Earlier we had an idea to save objects into db only after a certain time or at intervals, but due to risk inconsistency if the memcache/server went down, we skipped it for now.
Client requests to server often return object's status simple json-format without changing the object, which in turn is represented in the browser visually with images and javascript. But from time to time depending on when an object was last updated, it updates them with new information (e.g. a build-queue holding planned buildings time-progress is increased, and/or planned-queue-items-array has changed).
Questions:
Do you see how this could work or are
we walking in blindness here?
Do you expect us to have a lot of
inconsistency issues if someone loads
and updates the a memcache objects
while someone else does the same?
Is it even doable to do it in the way
he have done it? Seems to be working
fine atm, but so far we have only
been 4 people online at the same
time..
Is some other cache program more fit
for this class-object approach than
memcached?
Is there any other tips you have for this situation?
UPDATE
Since it is simply a "normal webpage" (no applet, flash, etc), we are implementing the game so that the server is the only one holding a "real game-state".. the state of the different javascript-objects on the client is more like a approximative version of the server's game state.
From time to time and before you do certain things important things, the client's visual state is updated to the server's state (e.g. the client things he can afford a barracks, asks the server to build a barracks, server updates current resources according to income-data on server and then tries to build a barracks or casts an error-message, and then sends the current server-state on resources, buildings back to the client)..
It is not a fast-paced game lika real strategy game. More like a quite slow 3-4 months playtime game, where buildings can take +1 minute up to several days to complete.
Whether the inconsistency is actually a problem depends on how you are implementing your game. If the whole game is updated with definitive snapshots at the client end every so often, it shouldn't matter so much as the player will just be getting higher latency (and latency is unavoidable). If your game states aren't synchronised, two players' versions of the world could drift off in completely different directions after a while, which is what you want to avoid.
But are you sure you really need memcache? It just seems to add more complication and potential inconsistency to your solution. It might be a premature optimisation. You're already reliant on the database, what is performance like if you just use the database alone? If set up correctly, I'd expect your database server to be caching everything in RAM anyway.