We are building a social website using PHP (Zend Framework), MySQL, server running Apache.
There is a requirement where in dashboard the application will fetch data for different events (there are about 12 events) on which this dashboard for user will be updated. We expect the total no of users to be around 500k to 700k. While at one time on average about 20% users would be online (for peak time we expect 50% users to be online).
So the problem is the event data as per our current design will be placed in a MySQL database. I think running a few hundred thousands queries concurrently on MySQL wouldn't be a good idea even if we use Amazon RDS. So we are considering to use both DynamoDB (or Redis or any NoSQL db option) along with MySQL.
So the question is: Having data both in MySQL and any NoSQL database would give us this benefit to have this power of scalability for our web application? Or we should consider any other solution?
Thanks.
You do not need to duplicate your data. One option is to use the ElastiCache that amazon provides to give your self in memory caching. This will get rid of your database calls and in a sense remove that bottleneck, but this can be very expensive. If you can sacrifice rela time updates then you can get away with just slowing down the requests or caching data locally for the user. Say, cache the next N events if possible on the browser and display them instead of making another request to the servers.
If it has to be real time then look at the ElastiCache and then tweak with the scaling of how many of them you require to handle your estimated amount of traffic. There is no point in duplicating your data. Keep it in a single DB if it makes sense to keep it there, IE you have some relational information that you need and then also have a variable schema system then you can use both databases, but not to load balance them together.
I would also start to think of some bottle necks in your architecture and think of how well your application will/can scale in the event that you reach your estimated numbers.
I agree with #sean, there’s no need to duplicate the database. Have you thought about a something with auto-scalability, like Xeround. A solution like that can scale out automatically across several nodes when you have throughput peaks and later scale back in, so you don’t have to commit to a larger, more expansive instance just because of seasonal peaks.
Additionally, if I understand correctly, no code changes are required for this auto-scalability. So, I’d say that unless you need to duplicate your data on both MySQL and NoSQL DB’s for reasons other than scalability-related issues, go for a single DB with auto-scaling.
Related
I currently have a MySQL database which deals a very large number of transactions. To keep it simple, it's a data stream of actions (clicks and other events) coming in real time. The structure is such, that users belong to sub-affiliates and sub-affiliates belong to affiliates.
I need to keep a balance of clicks. For the sake of simplicity, let's say I need to increase the clicks balance by 1 (there is actually more processing depending on an event) for each of - the user, for the sub-affiliate and the affiliate. Currently I do it very simply - once I receive the event, I do sequential queries in PHP - I read the balance of user, increment by one and store the new value, then I read the balance of the sub-affiliate, increment and write, etc.
The user's balance is the most important metric for me, so I want to keep it as real time, as possible. Other metrics on the sub-aff and affiliate level are less important, but the closer they are to real-time, the better, however I think 5 minute delay might be ok.
As the project grows, it is already becoming a bottleneck, and I am now looking at alternatives - how to redesign the calculation of balances. I want to ensure that the new design will be able to crunch 50 million of events per day. It is also important for me not to lose a single event and I actually wrap each cycle of changes to click balances in an sql transaction.
Some things I am considering:
1 - Create a cron job that will update the balances on the sub-affiliate and affiliate level not in real time, let's say every 5 mins.
2 - Move the number crunching and balance updates to the database itself by using stored procedures. I am considering adding a separate database, maybe Postgress will be better suited for the job? I tried to see if there is a serious performance improvement, but the Internet seems divided on the topic.
3 - Moving this particular data stream to something like hadoop with parquet (or Apache Kudu?) and just add more servers if needed.
4 - Sharding the existing db, basically adding a separate db server for each affiliate.
Are there some best practices / technologies for this type of task or some obvious things that I could do? Any help is really appreciated!
My advice for High Speed Ingestion is here. In your case, I would collect the raw information in the ping-pong table it describes, then have the other task summarize the table to do mass UPDATEs of the counters. When there is a burst of traffic, it become more efficient, thereby not keeling over.
Click balances (and "Like counts") should be in a table separate from all the associated data. This helps avoid interference with other activity in the system. And it is likely to improve the cacheability of the balances if you have more data than can be cached in the buffer_pool.
Note that my design does not include a cron job (other than perhaps as a "keep-alive"). It processes a table, flips tables, then loops back to processing -- as fast as it can.
If I were you, I would implement Redis in-memory storage, and increase there your metrics. It's very fast and reliable. You can also read from this DB. Create also cron job, which will save those data into MySQL DB.
Is your web tier doing the number crunching as it receives & processes the HTTP request? If so, the very first thing you will want to do is move this to work queue and process these events asynchronously. I believe you hint at this in your Item 3.
There are many solutions and the scope of choosing one is outside the scope of this answer, but some packages to consider:
Gearman/PHP
Sidekiq/Ruby
Amazon SQS
RabbitMQ
NSQ
...etc...
In terms of storage it really depends on what you're trying to achieve, fast reads, fast writes, bulk reads, sharding/distribution, high-availability... the answer to each points you in different directions
This sounds like an excellent candidate for Clustrix which is a drop in replacement for MySQL. They do something like sharding, but instead of putting data in separate databases, they split it and replicate it across nodes in the same DB cluster. They call it slicing, and the DB does it automatically for you. And it is transparent to the developers. There is a good performance paper on it that shows how it's done, but the short of it is that it is a scale-out OTLP DB that happens to be able to absorb mad amounts of analytical processing on real time data as well.
I'm currently designing and developing a web application that has the potential to grow very large at a fast rate. I will give some general information and move on to my question(s). I would say I am a mid-level web programmer.
Here are some specifications:
MySQL - Database Backend
PHP - Used in front/backend. Also used for SOAP Client
HTML, CSS, JS, jQuery - Front end widgets (highcharts, datatables, jquery-ui, etc.)
I can't get into too many fine details as it is a company project, but the main objective is to construct a dashboard that thousands of users will be accessing from various devices.
The data for this project is projected to grow by 50,000 items per year ( ~1000 items per week ).
1 item = 1 row in database
An item will also record a daily history starting at the day it was inserting.
1 day of history per item = 1 record
365 records per 1 year per device
365 * 50,000 = ~18,500,000 [first year]
multiply ~18,500,000 records by x for each year after.
(My forumla is a bit off since items will be added periodically throughout that year)
All items and history are accessed through a SOAP Client that connects to an API service, then writes the record to the database.
Majority of this data will be read and remain static (read only). But some item data may be updated or changed. The data will also be updated each day and need to write another x amount of history.
Questions:
1) Is MySQL a good solution to handle these data requirements? ~100 million records at some point.
2) I am limited to synchronous calls with my PHP Soap Client (as far as I know). This is becoming time consuming as more items are being extracted. Is there a better option for writing a SOAP Client so that I can send asynchronous requests without waiting for a response?
3) Are there any other requirements I should be thinking about?
The difficulty involved in scaling is almost always a function of users times data. If you have a lot of users, but not much data, it's not hard to scale. A typical example is a popular blog. Likewise, if you have a lot of data but not very many users, you're also going to be fine. This represents things like accounting systems or data-warehouse situations.
The first step towards any solution is to rough in a schema and test it at scale. You will have no idea how your application is going to perform until you run it through the paces. No two applications ever have exactly the same problems. Most of the time you'll need to adjust your schema, de-normalize some data, or cache things more aggressively, but these are just techniques and there's no standard cookbook for scaling.
In your specific case you won't have many problems if the rate of INSERT activity is low and your indexes aren't too complicated. What you'll probably end up doing is splitting out those hundreds of millions of rows into several identical tables each with a much smaller set of records in them.
If you're having trouble getting your queries to execute, consider the standard approach: index, optimize, then denormalize, then cache.
Where PHP can't cut it, consider using something like Python, Ruby, Java/Scala or even NodeJS to help facilitate your database calls. If you're writing a SOAP interface, you have many options.
1) Is MySQL a good solution to handle these data requirements? ~100 million records at some point.
Absolutely. Make sure you've got everything indexed properly, and if you hit a storage or query-per-second limit, you've got plenty of options that apply to most/all DBMS's. You can get beefier hardware, start sharding data across servers, clustering, etc..
2) I am limited to synchronous calls with my PHP Soap Client (as far as I know). This is becoming time consuming as more items are being extracted. Is there a better option for writing a SOAP Client so that I can send asynchronous requests without waiting for a response?
PHP 5+ allows you to execute multiple requests in parallel with CURL. Refer to the curl_muli* function for this, such as curl_multi_exec(). As far as I know, this requires you to handle SOAP/XML processing disjointly from the requests.
3) Are there any other requirements I should be thinking about?
Probably. But, you're usually on the right track if you start with a properly indexed, normalized database, for which you've thought about your objects at least mostly correctly. Start denormalizing if/when you find instances wherein denormalization solves an existing or obvious near-future efficiency problem. But, don't optimize for things that could become problems if the moons of Saturn align. Only optimize for problems that users will notice somewhat regularly.
While talking about large scale app the all the efforts and credits should not be given to the database alone. However it is the core part as our data in the main thing in any web aplication and side my side the your application depends upon the code optimization too that includes your backend and frontend script. Images and mainly server. Oh god many factors affecting the application.
We have a webapp that caters to hundreds of simultaneously logged in users (about 10K-30K users at any given time). The app collects analytics, specifically on certain user actions that may occur a few times a second.
So far our app design has been pretty decoupled (a lot of memcache/redis with delayed DB writes) and we avoided locks pretty well to make sure nothing is "centralized".
Management finally decided to build a real time analytics panel that should aggregate these actions in global counters (down to 1 second granularity). Whats the best way to have these "global" counters? We could increment some memcache key but we have a cluster of memcaches (EC2) so iterating over all of them to count up the keys would delay this metric.
DB is out of the question since we were bottlenecking alot in that regard so all DB writes are delayed thru a message queue (beanstalkd)
Any tips would be highly appreciated.
This would appear suited to a NoSQL dump of the actions, with periodic agregation. And being on EC2, you're in the right place to have access to the tools you need.
You could avoid your existing webserver infrastructure entirely by setting up a secondary webserver to record all the actions, pumping into a separate database server. Or if not appropriate, share the webserver but still offload to a separate NoSQL server.
Then, if "real time" can be delayed by a small period (seconds or a few minutes), you can have a sweeper function that agregates the NoSQL table into a format that more suits the analytics system, and pumps into your "live" database and clears out NoSQL data that has been processed.
Alternatively, you may be able to get your stats directly from the NoSQL?
NoSQL may be as fast as using Memcached (various benchmarks report various results, depending on who wrote the report) but it'll certainly be faster in pulling the data together when you need to agregate.
This is something I am really curious about and I do not really understand how is that possible.
So lets say I am the owner of Facebook (ahah) and I have million of people visiting my website every day, thousands and thousands of images, videos, logs etc..
How do I store all this data?
Do I have more databases in different servers around the world and then I connect to them from a single location?
Do I use an internal API system that requests info from other servers where the data is stored?
For example I know that Facebook has a lot of data centers around the world and hundreds of servers..
How do they connect to these servers? Are the profiles stored in different locations and when I connect to my profile, I will then be using that specific server? Or is there one main server that has the support of other hundreds of servers around the world?
Is there a way to use PHP in a way that I will connect to different servers and to different mySQL (???) databases to store and retrieve data whenever I want?
Sorry if this looks like a silly question, but since it could happen a day to work on a successful website, I really want to know what I will have to do, and what is the logic behind.
Thank you very much.
I'll try to answer your (big) question but not from Facebook point of view since their architecture is pretty much known.
First thing you have to know is that you would have to distribute the workload of your web application. Question is how, so in order to determine what's going to be slow, you have to divide your app in segments.
First up is the HTTP server, or the one that accepts all the requests. By going to "www.your-facebook.com", you're contacting a service on an IP. Naturally, you would probably have more than one IP but let's say you have a single entry point.
Now what happens? You have an HTTP server software, let's say Apache and it handles incoming connections. Since Apache creates a thread per connected user, it requires certain amount of memory for that operation. Eventually, it will run out of memory and then shit hits the fan, stuff stops working, your site is unavailable.
Therefore, you have to somehow scale this part of your application that connects your PHP code / MySQL db to people who want to interact with it.
Let's assume you successfully scaled your Apache and you have a cluster of computers which can accept new computers in order to scale-out. You solved your first problem.
Next part is the actual layer that does the work. Accepts input from the user and saves it somewhere (MySQL) and that's the biggest problem you'll have - why?
Due to the database.
Databases store their data on mediums such as hard drives. Hard drives, be it an SSD or mechanical one - are limited by their ability to write or retrieve data. If I'm not mistaken, RAM operates at levels of around 6GB/sec transfer rate. Not to mention that the seek time is also much much lower than HDD's one is.
Therefore, if you have an X amount of users asking for a piece of information and you can only deliver it at a certain rate - your app crashes, or it becomes unresponsive and the layer handling database queries becomes slow since the hardware cannot match the speed at which you need the data.
What are the options here? There are many, I won't mention all of them
Split Reads and Writes. Set your database layer in such a way that you have dedicated machines that write the data and completely different ones that read it. You have to use replication and replication has its own quirks - it never works without breaking.
Optimize handling of your data set by sharding your data. Great for read / write performance, screwed up when you need to query multiple shards and merge the data.
Get better hardware, especially storage (such as FusionIO)
Pay for better storage engine (such as TokuDB)
Alleviate load on the database by using caching. The data that your users request probably doesn't change so often that you have to query the db every single time (say you're viewing someone's profile, what's the chance they'll change it every second?). That's why Facebook uses Memcached extensively - a system that stores small pieces of data in RAM, it's easily scalable and what not. Most important, it's damn quick!
Use different solutions next to MySQL. MySQL (and some other databases) aren't good for every type of data storage or retrieval. Someone mentioned NoSQL before. NoSQL solutions are quick, but still immature. They don't do as much as relational databases do. They use methods of delaying disk write (they keep cached copy of data they need to write in RAM) so that they can achieve fast insert rates. That's why it's not unusual to lose data when using NoSQL.
Topic about MySQL vs "insert database or whatever here" is broad, I don't want to go into that but remember - every single one of data stores out there saves data on the hard drive eventually. The difference (physical of course) is how they optimize their flushing to the disk itself.
I also didn't mention various reports you can run by gathering the data (how many men between 19 and 21 have clicked an advert X between 01:15 and 13:37 CET and such) which is what Facebook is actually gathering (scary stuff!).
Third up - the language gluing the data store (MySQL) and output (HTTP server). PHP.
As you can see, most of the work here is already done by Apache and MySQL. Optimization on PHP level is small, even facebook got small results (they claim 50%, but that's UP TO 50%). I tried HipHop extensively, it is not as fast as it claims to be. Naturally, Facebook guys mentioned that already, so it's no wonder. The advantage they get is because they replaced Apache with their own server built in into HipHop. Some people claim "language X is better than language Y" and they're right, but that's not always the case. Each language has its own advantages and disadvantages.
For example, PHP is widely-spread but it's slow for certain operations (implementing a Trie with over 1 billion entries for example). It's great for things like echo some HTML after parsing the output from the db. It's quick to insert and retrieve data from the database, and that's about 90% of the PHP usage - talk to the db, display the data, end.
Therefore, no matter what language you use (say we used C++ instead of PHP), your bottleneck will be the data storage / retrieval layer.
On the other hand, why is using C++ NOT handy? Because there are more people who know how to use PHP than ones who use C++. It's also MUCH slower to develop web apps in C++. Sure, they will execute faster, but who will notice the difference between 1 millisecond and 1 microsecond?
This post is more like an informative blog post, I know it's not filled with resources to back up my claims but anyone who did any work with larger data sets or websites will know that the P.I.T.A. is always the data storage component. Some things that I said probably won't fit with everyone, but in a NUTSHELL this is how you'd go about optimizing your site.
Unfortunately, your question doesn't have a simple answer. For the MySQL portion of it, you would need to investigate database scale-out. You can start looking at it here: http://www.mysql.com/why-mysql/scaleout/mixi.html. There are a number of different ways to set up Apache/PHP web sites across a server farm. One of them involves setting up round robin DNS. This is adding a DNS record with a number of different IP addresses. Your DNS then hands out a different IP address each time the record is requested so that the load is balanced across a number of servers. You can also set up clustering with MySQL, Apache and Heartbeat, but that is more of a high-availability solution than a scaling solution.
When you have a website with so many users you'll already have enough experience to know the answer of the question, you'll also have a lot of money to pay people to find the optimal architecture of your system.
I'm not saying that what I describe below is the Holy Grail, but it is certainly an option:
You will have a big, fragmented database with lots of backups and you'll have a few name servers which will know the location of servers and some rules about the data stored on each server. When data is searched the query will be sent to a name server which will find the server(s) where the answer can be found for the particular query. I've also upvoted N.B.'s answer, I think he is mostly right.
For lots of users, you should have a server with lots of memory and speed. Configure php.ini to allow more memory usage. A server with lots of users should have 4-12GB available. Also, save resources by closing the desktop environment. If you have this many users, you might want to consider a CDN and also make a database request queue.
Replication
I have an app that Is polling data from a large number of data feeds. It processes thousands of records per day and this number is ever increasing. The data is stored in Mysql.
I then have a website that utilises this data.
I'm trying to build my environment with future in mind.
I thought of mysql replication so that the website can use it's own database on a different server and get bogged down by the thousands of write commands that are happening on the main database.
I am having difficulty getting this setup, despite mysql reporting it's all working fine.
I then started think - is there not a better way ?
From what I understand mysql sends the write command to the slave database as the master.
Does this not mean that what I am trying to avoid is just happening anyway?
Does this mean that the slave database will suffer thousands of writes
I am a one man band, doing this venture with my own money so I need to do this a cheapest way. I am getting a bit lost !
I have a dedicated server,
A vps
Using Php5, mysql 5 in a lamp stack.
I cannot begin to tell you how much I would appreciate some guidance!
If the slaves are a 1:1 clone of the master, than all writes to the master MUST be propagated down to the slaves. Otherwise replication would be useless.
Thousands of records per day is actually very small. Assuming the same processing time for each, and doing 5000 records, you'd have 86400/5000 = 17.28 seconds per record. That's very minimal write overhead.
If you were doing millions of records a day, THEN you'd have a write bottleneck.
I would split this in three layers.
Data Feed layer. Data read from the feeds is preprocessed and posted into a queue. This layer has a temporary queue that serves also as a temporary storage, a buffer to allow all data feed to post its data. I'd use a Message Queue System. It's fast and reliable.
Data Store layer. This layer reads from the queue, maybe processes someway the data read, and stores the data in the database.
Data Analysis layer. This is your "slave" database. It's a data warehouse. It periodically does ETL (extract, transform and load) data from the Data Store layer to this secondary database.
This layeread approach allows you isolate concerns (speed, reliability, security) and implementation details; and allows for future scalability.
Replication is literally what the word suggest - replicating queries on another machine.
MySQL creates a log that's filled with queries that were used to create the dataset on the original machine (master) and sends it to the slave(s) that read the log and re-execute those queries.
Basically, what you want is to increase your write ratio. That's achievable trough using different engines, for example TokuDB is one of them (however it isn't free, but you are allowed to store 50gb of user data for free and use it).
What you want (for the moment) is fast HDD subsystem more than a monolithic write-scalable storage system. InnoDB is capable of achieving a lot of queries per second on properly configured machine with sufficient hardware. I am not sure about pricing, but SSD and 4-8 gigs of ram shouldn't be that expensive. As Marc. B said - until you reach millions of records per day, you don't have to worry about scaling reads and writes trough replication.
You say you have an app "polling" your data from datafeeds. Does that mean you are doing full text searches? I'm making an assumption here in that you are batch processing date feeds and then querying that. If that is the case I'd offload all your fulltext queries to something like Solr. It actually isn't too time consuming to setup, depending on the size of your DB you can get away with running it on a fairly small VPS or on your dedicated, and best yet the difference is search speed is incredible. I've had full text mysql queries that would take 20 minutes to run be done in solr in under a second.
Just make sure you use a try statement in the event your solr instance goes down.