Background
I run (read: inherited) a network that is setup very similar to a shared hosting provider. There are between 300-400 sites running on the infrastructure. Over the years the database topology has become extremely fragmented, in that it's a 1 to 1 relationship from webserver->database.
Problems
The applications are 9 times out of 10 designed by third party design firms that have implemented wordpress/joomla/drupal etc.
The databases are sort of haphazardly spread across 6 database servers. They are not replicated anywhere.
The applications have no concept of separate database handles to separate INSERT to a master and a SELECT to a slave.
Using single master builtin mysql replication creates a huge bottleneck. The amount of inserts will down the master db very quickly.
Question
My question becomes, how can I make my database topology as flat as possible while leaving room for future scalability?
In the future I'd like to add more geographic locations to my network that can replicate the same databases across a 'backnet'.
In the past I've looked into multi-master replication but saw a lot of issues with things like auto_increment column collisions.
I'm open to enterprise solutions. Something similar to the Shareplex product for Oracle replication.
Whatever the solution is, it's not reasonable to expect the applications to change to accommodate this new design. So things like auto_increment columns need to remain the same and gel across the entire cluster.
Goal
My goal is to have an internally load balanced hostname for each cluster that I can point all the applications at. I
This would also afford me fault tolerance that I don't presently have. At present, removing a database from rotation is not possible.
Applications like Cassandra and Hadoop look amazingly similar to what I want to achieve but NoSQL isn't an option for these applications.
Any tips/pointers/tutorials/documentation/product recommendations are greatly appreciated. Thank you.
In the past I've looked into multi-master replication but saw a lot of issues with things like auto_increment column collisions.
We use multi-master in production at work. The auto-inc conundrum was fixed a while ago with auto_increment_increment and auto_increment_offset, which allows each server to have it's own pattern of increment IDs. As long as the application isn't designed blindly assuming that all the IDs will be sequential, it should work fine.
The real problem with multi-master is that MySQL still occasionally corrupts the binary log. This is mainly a problem over unreliable connections, so it won't be a problem if all the instances are local.
Another problem with multi-master is that it simply doesn't scale with writes, as you've already experienced or assumed, given a point in your answer. All of the writes on one master have to be replicated by the others. Even if you spread out read load correctly, you will eventually hit an I/O bottleneck that can only be resolved by either more hardware, an application redesign, or sharding (read: application redesign). It's slightly better now that MySQL has row-based replication available.
If you need geographic diversity, multi-master could work.
Also look into DRBD a disk-block-level replication system that's now been built into modern Linux kernels. It's been used by others to replicate MySQL and PostgreSQL before, though I don't have any personal experience with it.
I can't tell from your question if you're looking for high availability, or simply are looking for (manual or automatic) fail-over. If you just need fail-over and can have a bit of downtime, then traditional master/slave replication might be enough for you. The trouble is turning a slave that became a master back into a slave.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
A couple days ago I was asked by my company to find requirements to start a project. The project is creating an e-book store. The term simple, but the total amount of data is about 4TB and the number of files are around 500,000.
As my team members use php and mysql, I tried to look around apache for big data. I obviously faced apache haadoop, and mysql-cluster for big data. But after several days of digging on google, I'm now just completely confused! I now have these questions:
Are even these amount of data (4-5TB) considered as big data? (Some sources said that at least 5TB of data should use hadoop, some other said big data for hadoop mean Zetabytes and Petabytes)
Does hadoop ship with it's own special database, or should be used with mysql or etc.?
Does hadoop works only on a cluster, or it works on a single-nod server as fine?
As I faced these terms very recent, I believe that some or all of my questions maybe really silly... But I'll be really grateful if you have other suggestions for this type project.
Here are my short answers
Are even these amount of data (4-5TB) considered as big data? (Some sources said that at least 5TB of data should use hadoop, some other said big data for hadoop mean Zetabytes and Petabytes)
Yes and no. For certain usecases, this is not big enough data while for others, it is. Questions that should be asked and answered
Is this data is growing. What is the rate of growth.
Are you going to run some analytics on this data from time to time
Does hadoop ship with it's own special database, or should be used with mysql or etc.?
Yes, Hadoop has HDFS file system, which can store flatfile and can be treated like data repository. But that may not be the best solution. You may want to look at NoSQL DBs like Cassandra, HBase, MongoDB
Does hadoop works only on a cluster, or it works on a single-nod server as fine?
Technically, yes, hadoop can run on a single nod in Pseudo cluster or standalone mode. But that is used only for learning or testing purpose for development. For any production environment you should think of Hadoop clusters spanning multiple VMs.... Minimum I saw in prod was 6 VM.
As such 5TB is not very big volume for Relational DB (that supports clustering). But cost of supporting relational DB goes up exponentially with capacity. While with Hadoop and just HDFS, the cost is very low.... add Cassandra or HBase...not much difference. But remember, simply using hadoop, you are looking at a high latency system. If your expectation is that Hadoop will answer your queries in real time ...please look out for other solutions. (eg:queries like list all books checked out to Xyz", then just get it from DB... don't use Hadoop for that query).
Overall my suggestion will be, take a crash course on Hadoop from youtube, cloudera, try to gain some expertise on what is Hadoop and what is not and then decide. Your questions gives an impression , that you have a long learning curve ahead and it is worth taking that challenge.
This should be a comment, but it is too long.
Hadoop is a framework for writing parallel software, originally written by Yahoo. It is loosely based on a framework developed at Google in the 1990s, which in turn was a parallel implementation of map-reduce primitives from the Lisp language. You can think of Hadoop as a bunch of libraries that run either on hardware you own or on hardware on the cloud. These libraries provide a programming interface to java and to other languages. It allows you to take advantage of a cluster of processors and disks (with HDFS). Its major features are scalability and fault tolerance, both very important for large data problems.
Hadoop implements a programming methodology build around a parallel implementation of map-reduce. That was the original application. Nowadays, lots of things are built on Hadoop. You should start with the Apache project description and Wikipedia page to learn more.
Several databases support interfaces to Hadoop (Asterdata comes to mind). Often when one thinks of "databases" and "Hadoop", one is thinking of Pig or Hive or some related open-source project.
As for your question. If your data conforms naturally to a relational database (table with columns connected by keys) then use a relational database. If you need fast performance on web applications with hierarchical data, then learn about NoSQL solutions, such as MongoDB. If your data has a complex structure and requires scalability and you have programming skills on your team, then think about a Hadoop-based component to the solution. And, for a large project, multiple technologies are often needed for different components -- real time operations using NoSQL, reporting using SQL, ad hoc querying using a combination of SQL and Hadoop (for instance).
I've recently built a reasonably complex database application for a university handling almost 200 tables. Some tables can (e.g. Publications) hold 30 or more fields and store 10 one-to-one FK relations and up to 2 or 3 many-to-many FK relations (using crossreference-ref tables). I use integer IDs throughout and normalisation has been key every step of the way. AJAX is minimal and most pages are standard CRUD forms/processes.
I used Symfony 1.4, and Doctrine ORM 1.2, MySQL, PHP.
While the benefits to development time and ease of maintenance have been enormous (in using an MVC and ORM), we've been having problems with speed. That is, when we have more than a few users logged in and active at any one time the application slows considerbly (up to 20 seconds to save or edit a record).
We're currently engaged in discussions with our SysAdmin but they say that we should have more than enough power. With 6 or more users engaged in activity, we end up queuing 4 CPUs in a virtual server environment while memory usage is low (no bleeds).
Of course, we're considering multi-threading our mySQL application (if this will help), refining our code (though much of it is generated by the MVC) and refining our cache usage (this could be better, though the majority of the screen used is user-login specific and dynamic); we've installed APC, extra memory, de-fragmented our database, I've tried unsetting all recordsets (though I understand this is now automatic within the ORM), instigating manual garbage recycling...
But the question I'm asking is whether mySQL, PHP, and Symfony MVC was actually a poor choice for developing an application of this size in the first place? If so, what do people normally use/recommend for a web-based database interface application of this kind of size/complexity?
I don't have any experience with Symfony or Doctrine. However, there are certainly larger sites than yours that are built on these projects. DailyMotion for instance uses both, and it serves a heck of a lot more than a few simultaneous users.
Likewise, PHP and MySQL are used on sites as large as Wikipedia, so scalability should not be a problem. (BTW, MySQL should be multithreaded by default—one thread per connection.)
Of course, reasonable performance is relative. If you're trying to run a site serving 200 concurrent requests off a beige box sitting in a closet, then you may need to optimize your code a lot more than a Fortune 500 company with their own data centers.
The obvious thing to do is actually profile your application and see where the bottleneck is. Profiling MySQL queries is pretty simple, and there are also assorted PHP profiling tools like Xdebug. From there you can figure out whether you should switch frameworks, ORMs (or ditch ORM altogether), databases, or refactor some code, or just invest in more processing power.
The best way to speed up complex database operation is not calling them :).
So analyize which parts of your application you can cache.
Symfony has a pretty well caching system (in symfony2 even better) that can be used pretty granulary.
On database side another way would be to use views or nested sets to store aggregated data.
Try to find parts where this is appropriate.
I need to run Linux-Apache-PHP-MySQL application (Moodle e-learning platform) for a large number of concurrent users - I am aiming 5000 users. By concurrent I mean that 5000 people should be able to work with the application at the same time. "Work" means not only do database reads but writes as well.
The application is not very typical, since it is doing a lot of inserts/updates on the database, so caching techniques are not helping to much. We are using InnoDB storage engine. In addition application is not written with performance in mind. For instance one Apache thread usually occupies about 30-50 MB of RAM.
I would be greatful for information what hardware is needed to build scalable configuration that is able to handle this kind of load.
We are using right now two HP DLG 380 with two 4 core processors which are able to handle much lower load (typically 300-500 concurrent users). Is it reasonable to invest in this kind of boxes and build cluster using them or is it better to go with some more high-end hardware?
I am particularly curious
how many and how powerful servers are
needed (number of processors/cores, size of RAM)
what network equipment should
be used (what kind of switches,
network cards)
any other hardware,
like particular disc storage
solutions, etc, that are needed
Another thing is how to put together everything, that is what is the most optimal architecture. Clustering with MySQL is rather hard (people are complaining about MySQL Cluster, even here on Stackoverflow).
Once you get past the point where a couple of physical machines aren't giving you the peak load you need, you probably want to start virtualising.
EC2 is probably the most flexible solution at the moment for the LAMP stack. You can set up their VMs as if they were physical machines, cluster them, spin them up as you need more compute-time, switch them off during off-peak times, create machine images so it's easy to system test...
There are various solutions available for load-balancing and automated spin-up.
If you can make your app fit, you can get use out of their non-relational database engine as well. At very high loads, relational databases (and MySQL in particular) don't scale effectively. The peak load of SimpleDB, BigTable and similar non-relational databases can scale almost linearly as you add hardware.
Moving away from a relational database is a huge step though, I can't say I've ever needed to do it myself.
I'm not so sure about hardware, but from a software point-of-view:
With an efficient data layer that will cache objects and collections returned from the database then I'd say a standard master-slave configuration would work fine. Route all writes to a beefy master and all reads to slaves, adding more slaves as required.
Cache data as objects returned from your data-mapper/ORM and not HTML, and use Memcached as your caching layer. If you update an object then write to the db and update in memcached, best use IdentityMap pattern for this. You'll probably need quite a few Memcached instances although you could get away with running these on your web servers.
We could never get MySQL clustering to work properly.
Be careful with the SQL queries you write and you should be fine.
Piotr, have you tried asking this question on moodle.org yet? There are a couple of similar scoped installations whose staff members answer that currently.
Also, depending on what your timeframe for deployment is, you might want to check out the moodle 2.0 line rather than the moodle 1.9 line, it looks like there are a bunch of good fixes for some of the issues with moodle's architecture in that version.
also: memcached rocks for this. php acceleration rocks for this. serverfault is probably the better *exchange site for this question though
I am planning to build a web app running on a single computer and exploit the hardware resources as efficient as possible. The logic of app will not be complex. The following is my design:
OS: Linux (CentOS 5)
Web Server: Nginx
Web script: PHP
Database: Tokyo cabinet + Tokyo Tyrant
Index: Sphinx
I am not going to use RDBMS such as MySQL, cause I think a key-value store (Tokyo cabinet) with a indexer (Sphinx) will meet all the needs to deploy a normal web app, also with better performance than MySQL.
My question is: is this design to be the an efficient architecture for a single computer? Or how to improve it?
(I know this question might to be subjective but I really need your help)
Thank you very much~
EDIT:
The computer I am going to host my app on, is a normal PC, like 8GB~16GB memory, 500G~1TB Hard disk, etc. I think it won't need to consider the "scalability". Every first step of a web app is started from one machine and it will always the beginning.
Choice of DB
I think that the choice of type of database you make depends less on how many computers the system is hosted on. I think this should be more a function of the quality of data that you want/need to preserve.
For example, if you need to store the shipping addresses for a customer, you will need to account for that in your storage structure. A name value pair may seem an easy enough structure to begin with, but if you foresee any of the following, you should consider moving to a standard database system
keeping track of changes
reporting activity / reports
concurrent users
Performance
This is dependent on your code, images, content, caching, etc just as much as it is on your database.
well, one way to see is to load test it:
http://grinder.sourceforge.net/
I've never worked with Tokyo cabinet, but if it's functionally sufficient, then it will probably be significantly faster then a DB.
In the long run though, any savings you realize by tuning your app to work on one box will be quickly lost when you start to scale beyond that box. trying to add a lot of caching, and hacks to get the app to be faster will only go so far. More importantly you should try to think about how easily you can decouple the various layers.
I am new in the website scalability realm. Can you suggest to me some the techniques for making a website scalable to a large number of users?
Test your website under heavy load.
Monitor all statistics
Find bottleneck
Fix bottleneck
Go back to 1
good luck
If you expect your site to scale beyond the capabilities of a single server you will need to plan carefully. Design so the following will be possible:-
Make it so your database can be on a separate server. This isn't normally too hard.
Ensure all your static content can be moved to a CDN, as this will normally pull a lot of load off your servers.
Be prepared to spend a lot of money on hardware. More RAM and faster disks help a LOT.
It gets a lot harder when you need to split either the database or the php from a single server to multiple servers, so optimise everything, from your code, your database schema, your server config and anything else you can think of to put this final step off for as long as possible.
Other than that, all you can do is stress test your site, figure out where the bottlenecks are and try and design them away.
Check out this talk by Rasmus Lerdorf (creator of PHP)
Specially Page 8 and beyond.
You might want to look at this resource- highscalability.com.
A number of people have mentioned tools for identifying bottlenecks, and that is of course necessary. You can't spend productive time speeding something up without knowing where it's slow. But the other thing you need to know is where your target scalability lies. Is it value for money to spend a couple of months making your site scale to the same number of users as Twitter if it's going to be used by three people in HR? Do you have a known rate of transactions, or response latency, or number of users, in the requirements of the product? If so, target those numbers with your optimisation strategy. If not, find those out before chasing the performance rat down the hole.
Very similar: How Is PHP Done the Right Way?
Scalability is no small subject and certainly more material than can be reasonably covered in a single question.
For instance, with some kinds of applications, joins (in SQL) don't scale, which brings up all sorts of caching and sharding strategies.
Beanstalk is another scalability and performance tool in high-performance PHP sites. As is memcache (different kind).
The biggest problem for scalability is usually shared resources like DBMS's. The problem arises because DBMS's usually have no way to relax consistency guarantees.
If you want to increase scalability when you use something like MySQL you have to change your schema design to relax consistency.
For instance, you can separate your database schema to have your normalized data model for writes, and a replicated read only denormalized part for the 90% of read operations. The read only data can be spread over several servers.
Another way to increase scalability of a database is to partition the data, e.g. separate the data into a database for every department and aggregate them either in the ORM or in the DBMS.
In order of importance:
If you run PHP, use an opcode cache like APC. (This is important enough to be built-in in the next generation of PHP.)
Use YSlow or Google Page Speed to identify bottlenecks. (This will reveal structural problems with your website that affect both client and server performance.)
Ensure that your web server sends a proper Expires header for static content (images, Javascript, CSS), such that the browser can cache it properly. (YSlow will warn you about this, too.)
Use an HTTP accelerator, such as Varnish. (This picture says it all – and they already had an HTTP accelerator in place.)
Develop your site using solid OOP techniques. You will need your site to be modular as not all performance bottlenecks are obvious at the start. Be ready to refactor parts of your site as traffic increases. The first sentence I wrote will help you do it more easily and safely. Also, use test driven development, As refactor means new introduced bugs, and good TDD is good in catching them before they go into production.
Separate as much as possible client side code from server side code, as they will likely to be served from different servers, if your site traffic justify this.
Read articles (read the YSlow tips for instance).
GL
In addition to the other suggestions, look into splitting your sites into tiers, as in multitier architecture. If done right, you can then use one server per tier.