Mysql replication - is it worth it? - php

Replication
I have an app that Is polling data from a large number of data feeds. It processes thousands of records per day and this number is ever increasing. The data is stored in Mysql. 
I then have a website that utilises this data.
I'm trying to build my environment with future in mind. 
 I thought of mysql replication so that the website can use it's own database on a different server and get bogged down by the thousands of write commands that are happening on the main database. 
I am having difficulty getting this setup, despite mysql reporting it's all working fine. 
I then started think - is there not a better way ?
From what I understand mysql sends the write command to the slave database as the master. 
Does this not mean that what I am trying to avoid is just happening anyway?
Does this mean that the slave database will suffer thousands of writes 
I am a one man band, doing this venture with my own money so I need to do this a cheapest way. I am getting a bit lost !
I have a dedicated server,
A vps
Using Php5, mysql 5 in a lamp stack.
I cannot begin to tell you how much I would appreciate some guidance!

If the slaves are a 1:1 clone of the master, than all writes to the master MUST be propagated down to the slaves. Otherwise replication would be useless.
Thousands of records per day is actually very small. Assuming the same processing time for each, and doing 5000 records, you'd have 86400/5000 = 17.28 seconds per record. That's very minimal write overhead.
If you were doing millions of records a day, THEN you'd have a write bottleneck.

I would split this in three layers.
Data Feed layer. Data read from the feeds is preprocessed and posted into a queue. This layer has a temporary queue that serves also as a temporary storage, a buffer to allow all data feed to post its data. I'd use a Message Queue System. It's fast and reliable.
Data Store layer. This layer reads from the queue, maybe processes someway the data read, and stores the data in the database.
Data Analysis layer. This is your "slave" database. It's a data warehouse. It periodically does ETL (extract, transform and load) data from the Data Store layer to this secondary database.
This layeread approach allows you isolate concerns (speed, reliability, security) and implementation details; and allows for future scalability.

Replication is literally what the word suggest - replicating queries on another machine.
MySQL creates a log that's filled with queries that were used to create the dataset on the original machine (master) and sends it to the slave(s) that read the log and re-execute those queries.
Basically, what you want is to increase your write ratio. That's achievable trough using different engines, for example TokuDB is one of them (however it isn't free, but you are allowed to store 50gb of user data for free and use it).
What you want (for the moment) is fast HDD subsystem more than a monolithic write-scalable storage system. InnoDB is capable of achieving a lot of queries per second on properly configured machine with sufficient hardware. I am not sure about pricing, but SSD and 4-8 gigs of ram shouldn't be that expensive. As Marc. B said - until you reach millions of records per day, you don't have to worry about scaling reads and writes trough replication.

You say you have an app "polling" your data from datafeeds. Does that mean you are doing full text searches? I'm making an assumption here in that you are batch processing date feeds and then querying that. If that is the case I'd offload all your fulltext queries to something like Solr. It actually isn't too time consuming to setup, depending on the size of your DB you can get away with running it on a fairly small VPS or on your dedicated, and best yet the difference is search speed is incredible. I've had full text mysql queries that would take 20 minutes to run be done in solr in under a second.
Just make sure you use a try statement in the event your solr instance goes down.

Related

Best practice for high-volume transactions with real time balance updates

I currently have a MySQL database which deals a very large number of transactions. To keep it simple, it's a data stream of actions (clicks and other events) coming in real time. The structure is such, that users belong to sub-affiliates and sub-affiliates belong to affiliates.
I need to keep a balance of clicks. For the sake of simplicity, let's say I need to increase the clicks balance by 1 (there is actually more processing depending on an event) for each of - the user, for the sub-affiliate and the affiliate. Currently I do it very simply - once I receive the event, I do sequential queries in PHP - I read the balance of user, increment by one and store the new value, then I read the balance of the sub-affiliate, increment and write, etc.
The user's balance is the most important metric for me, so I want to keep it as real time, as possible. Other metrics on the sub-aff and affiliate level are less important, but the closer they are to real-time, the better, however I think 5 minute delay might be ok.
As the project grows, it is already becoming a bottleneck, and I am now looking at alternatives - how to redesign the calculation of balances. I want to ensure that the new design will be able to crunch 50 million of events per day. It is also important for me not to lose a single event and I actually wrap each cycle of changes to click balances in an sql transaction.
Some things I am considering:
1 - Create a cron job that will update the balances on the sub-affiliate and affiliate level not in real time, let's say every 5 mins.
2 - Move the number crunching and balance updates to the database itself by using stored procedures. I am considering adding a separate database, maybe Postgress will be better suited for the job? I tried to see if there is a serious performance improvement, but the Internet seems divided on the topic.
3 - Moving this particular data stream to something like hadoop with parquet (or Apache Kudu?) and just add more servers if needed.
4 - Sharding the existing db, basically adding a separate db server for each affiliate.
Are there some best practices / technologies for this type of task or some obvious things that I could do? Any help is really appreciated!
My advice for High Speed Ingestion is here. In your case, I would collect the raw information in the ping-pong table it describes, then have the other task summarize the table to do mass UPDATEs of the counters. When there is a burst of traffic, it become more efficient, thereby not keeling over.
Click balances (and "Like counts") should be in a table separate from all the associated data. This helps avoid interference with other activity in the system. And it is likely to improve the cacheability of the balances if you have more data than can be cached in the buffer_pool.
Note that my design does not include a cron job (other than perhaps as a "keep-alive"). It processes a table, flips tables, then loops back to processing -- as fast as it can.
If I were you, I would implement Redis in-memory storage, and increase there your metrics. It's very fast and reliable. You can also read from this DB. Create also cron job, which will save those data into MySQL DB.
Is your web tier doing the number crunching as it receives & processes the HTTP request? If so, the very first thing you will want to do is move this to work queue and process these events asynchronously. I believe you hint at this in your Item 3.
There are many solutions and the scope of choosing one is outside the scope of this answer, but some packages to consider:
Gearman/PHP
Sidekiq/Ruby
Amazon SQS
RabbitMQ
NSQ
...etc...
In terms of storage it really depends on what you're trying to achieve, fast reads, fast writes, bulk reads, sharding/distribution, high-availability... the answer to each points you in different directions
This sounds like an excellent candidate for Clustrix which is a drop in replacement for MySQL. They do something like sharding, but instead of putting data in separate databases, they split it and replicate it across nodes in the same DB cluster. They call it slicing, and the DB does it automatically for you. And it is transparent to the developers. There is a good performance paper on it that shows how it's done, but the short of it is that it is a scale-out OTLP DB that happens to be able to absorb mad amounts of analytical processing on real time data as well.

Designing a global action counter in high traffic web development

We have a webapp that caters to hundreds of simultaneously logged in users (about 10K-30K users at any given time). The app collects analytics, specifically on certain user actions that may occur a few times a second.
So far our app design has been pretty decoupled (a lot of memcache/redis with delayed DB writes) and we avoided locks pretty well to make sure nothing is "centralized".
Management finally decided to build a real time analytics panel that should aggregate these actions in global counters (down to 1 second granularity). Whats the best way to have these "global" counters? We could increment some memcache key but we have a cluster of memcaches (EC2) so iterating over all of them to count up the keys would delay this metric.
DB is out of the question since we were bottlenecking alot in that regard so all DB writes are delayed thru a message queue (beanstalkd)
Any tips would be highly appreciated.
This would appear suited to a NoSQL dump of the actions, with periodic agregation. And being on EC2, you're in the right place to have access to the tools you need.
You could avoid your existing webserver infrastructure entirely by setting up a secondary webserver to record all the actions, pumping into a separate database server. Or if not appropriate, share the webserver but still offload to a separate NoSQL server.
Then, if "real time" can be delayed by a small period (seconds or a few minutes), you can have a sweeper function that agregates the NoSQL table into a format that more suits the analytics system, and pumps into your "live" database and clears out NoSQL data that has been processed.
Alternatively, you may be able to get your stats directly from the NoSQL?
NoSQL may be as fast as using Memcached (various benchmarks report various results, depending on who wrote the report) but it'll certainly be faster in pulling the data together when you need to agregate.

Scalable web application

We are building a social website using PHP (Zend Framework), MySQL, server running Apache.
There is a requirement where in dashboard the application will fetch data for different events (there are about 12 events) on which this dashboard for user will be updated. We expect the total no of users to be around 500k to 700k. While at one time on average about 20% users would be online (for peak time we expect 50% users to be online).
So the problem is the event data as per our current design will be placed in a MySQL database. I think running a few hundred thousands queries concurrently on MySQL wouldn't be a good idea even if we use Amazon RDS. So we are considering to use both DynamoDB (or Redis or any NoSQL db option) along with MySQL.
So the question is: Having data both in MySQL and any NoSQL database would give us this benefit to have this power of scalability for our web application? Or we should consider any other solution?
Thanks.
You do not need to duplicate your data. One option is to use the ElastiCache that amazon provides to give your self in memory caching. This will get rid of your database calls and in a sense remove that bottleneck, but this can be very expensive. If you can sacrifice rela time updates then you can get away with just slowing down the requests or caching data locally for the user. Say, cache the next N events if possible on the browser and display them instead of making another request to the servers.
If it has to be real time then look at the ElastiCache and then tweak with the scaling of how many of them you require to handle your estimated amount of traffic. There is no point in duplicating your data. Keep it in a single DB if it makes sense to keep it there, IE you have some relational information that you need and then also have a variable schema system then you can use both databases, but not to load balance them together.
I would also start to think of some bottle necks in your architecture and think of how well your application will/can scale in the event that you reach your estimated numbers.
I agree with #sean, there’s no need to duplicate the database. Have you thought about a something with auto-scalability, like Xeround. A solution like that can scale out automatically across several nodes when you have throughput peaks and later scale back in, so you don’t have to commit to a larger, more expansive instance just because of seasonal peaks.
Additionally, if I understand correctly, no code changes are required for this auto-scalability. So, I’d say that unless you need to duplicate your data on both MySQL and NoSQL DB’s for reasons other than scalability-related issues, go for a single DB with auto-scaling.

SQL Insert at 15 minute intervals, big MySQL table

I'm fairly familiar with most aspects of web development and I consider myself a junior level programmer. I'm always anxious when I think about application scaling and would like to learn a little more about it. Let's have a hypothetical situation.
I'm working on a web application that polls a device and fetches about 2kb of XML data at 15 minute intervals. This data must be stored for A Very Long Time (at least a couple years?). Now imagine that this web application has 100 users that each have this device.
After 10 years we're talking tens of millions of table rows. With 100 users we have a cron job that is querying each users device, getting 2kb of XML, and inserting it into the SQL database every 15 minutes.
Assuming my queries are relatively simple, only collecting the columns necessary, using joins, and avoiding subqueries, is there any reason this should not scale?
Inserting doesn't generally get slower as a table gets larger, but index updates may take longer. At some point you may want to split the table into two parts. One for archival storage, optimized for data retrieval (basically index the heck out of it), and a second table to handle the newer data, optimized more for insertion (fewer indexes).
But as always, the only way to tell for sure is to benchmark things. Set up some cloned tables with a few thousand rows, and some with multi-millions of rows, and see what happens.
You could always consider using partitioning to automagically split your data files by date, and age older records off to an slower, high-capacity disk array while keeping the newer records (and the INSERTs) on a high-speed array. Then, your index builds will only have to work on a subset of the data rather than the whole deal, and should go quickly (disk I/O is typically the slowest part of a database system).
Assuming my queries are relatively simple, only collecting the columns
necessary, using joins, and avoiding subqueries, is there any reason
this should not scale?
When you get large you should put you active dataset in a in-memory database(faster than disc) just like Facebook, Twitter, etc do. Twitter became very slow when they did not put active dataset in memory/scale up => A lot of people called this fail whale. Both use memcached for this, but you could also use Redis(I like this) or APC if you are just a single box. You should always install APC if want performance because APC is used for caching the compiled bytecode.
Most PHP accelerators work by caching the compiled bytecode of PHP
scripts to avoid the overhead of parsing and compiling source code on
each request (some or all of which may never even be executed). To
further improve performance, the cached code is stored in shared
memory and directly executed from there, minimizing the amount of slow
disk reads and memory copying at runtime.

Best practice to record large amount of hits into MySQL database

Well, this is the thing. Let's say that my future PHP CMS need to drive 500k visitors daily and I need to record them all in MySQL database (referrer, ip address, time etc.). This way I need to insert 300-500 rows per minute and update 50 more. The main problem is that script would call database every time I want to insert new row, which is every time someone hits a page.
My question, is there any way to locally cache incoming hits first (and what is the best solution for that apc, csv...?) and periodically send them to database every 10 minutes for example? Is this good solution and what is the best practice for this situation?
500k daily it's just 5-7 queries per second. If each request will be served for 0.2 sec, then you will have almost 0 simultaneous queries, so there is nothing to worry about.
Even if you will have 5 times more users - all should work fine.
You can just use INSERT DELAYED and tune your mysql.
About tuning: http://www.day32.com/MySQL/ - there is very useful script (will change nothing, just show you the tips how to optimize settings).
You can use memcache or APC to write log there first, but with using INSERT DELAYED MySQL will do almost same work, and will do it better :)
Do not use files for this. DB will serve locks much better, than PHP. It's not so trivial to write effective mutexes, so let DB (or memcache, APC) do this work.
A frequently used solution:
You could implement an counter in memcached which you increment on an visit, and push an update to the database for every 100 (or 1000) hits.
We do this by storing locally on each server to CSV, then having a minutely cron job to push the entries into the database. This is to avoid needing a highly available MySQL database more than anything - the database should be able to cope with that volume of inserts without a problem.
Save them to a directory-based database (or flat file, depends) somewhere and at a certain time, use a PHP code to insert/update them into your MySQL database. Your php code can be executed periodically using Cron, so check if your server has Cron so that you can set the schedule for that, say every 10 minutes.
Have a look at this page: http://damonparker.org/blog/2006/05/10/php-cron-script-to-run-automated-jobs/. Some codes have been written in the cloud and are ready for you to use :)
One way would be to use Apache access.log. You can get a quite fine logging by using cronolog utility with apache . Cronolog will handle the storage of a very big number of rows in files, and can rotate it based on volume day, year, etc. Using this utility will prevent your Apache from suffering of log writes.
Then as said by others, use a cron-based job to analyse these log and push whatever summarized or raw data you want in MySQL.
You may think of using a dedicated database (or even database server) for write-intensive jobs, with specific settings. For example you may not need InnoDB storage and keep a simple MyIsam. And you could even think of another database storage (as said by #Riccardo Galli)
If you absolutely HAVE to log directly to MySQL, consider using two databases. One optimized for quick inserts, which means no keys other than possibly an auto_increment primary key. And another with keys on everything you'd be querying for, optimized for fast searches. A timed job would copy hits from the insert-only to the read-only database on a regular basis, and you end up with the best of both worlds. The only drawback is that your available statistics will only be as fresh as the previous "copy" run.
I have also previously seen a system which records the data into a flat file on the local disc on each web server (be careful to do only atomic appends if using multiple proceses), and periodically asynchronously write them into the database using a daemon process or cron job.
This appears to be the prevailing optimium solution; your web app remains available if the audit database is down and users don't suffer poor performance if the database is slow for any reason.
The only thing I can say, is be sure that you have monitoring on these locally-generated files - a build-up definitely indicates a problem and your Ops engineers might not otherwise notice.
For an high number of write operations and this kind of data you might find more suitable mongodb or couchdb
Because INSERT DELAYED is only supported by MyISAM, it is not an option for many users.
We use MySQL Proxy to defer the execution of queries matching a certain signature.
This will require a custom Lua script; example scripts are here, and some tutorials are here.
The script will implement a Queue data structure for storage of query strings, and pattern matching to determine what queries to defer. Once the queue reaches a certain size, or a certain amount of time has elapsed, or whatever event X occurs, the query queue is emptied as each query is sent to the server.
you can use a Queue strategy using beanstalk or IronQ

Categories