I am currently in a debate with a coworker about the best practices concerning the database design of a PHP web application we're creating. The application is designed for businesses, and each company that signs up will have multiple users using the application.
My design methodology is to create a new database for every company that signs up. This way everything is sand-boxed, modular, and small. My coworkers philosophy is to put everyone into one database. His argument is that if we have 1000+ companies sign up, we wind up with 1000+ databases to deal with. Not to mention the mess that doing Business Intelligence becomes.
For the sake of example, assume that the application is an order entry system. With separate databases, table size can remain manageable even if each company is doing 100+ orders a day. In a single-bucket application, tables can get very big very quickly.
Is there a best practice for this? I tried hunting around the web, but haven't had much success. Links, whitepapers, and presentations welcome.
Thanks in advance,
The1Rob
I talked to the database architect from wordpress.com, the hosting service for WordPress. He said that they started out with one database, hosting all customers together. The content of a single blog site really isn't that much, after all. It stands to reason that a single database is more manageable.
This did work well for them until they got hundreds and thousands of customers, they realized that they needed to scale out, running multiple physical servers and hosting a subset of their customers on each server. When they add a server, it would be easy to migrate individual customers to the new server, but harder to separate data within a single database that belongs to an individual customer's blog.
As customers come and go, and some customers' blogs have high-volume activity while others go stale, the rebalancing over multiple servers becomes an even more complex maintenance job. Monitoring size and activity per individual database is easier too.
Likewise doing a database backup or restore of a single database containing terrabytes of data, versus individual database backups and restores of a few megabytes each, is an important factor. Consider: a customer calls and says their data got SNAFU'd due to some bad data entry, and could you please restore the data from yesterday's backup? How would you restore one customer's data if all your customers share a single database?
Eventually they decided that splitting into a separate database per customer, though complex to manage, offered them greater flexibility and they re-architected their hosting service to this model.
So, while from a data modeling perspective it seems like the right thing to do to keep everything in a single database, some database administration tasks become easier as you pass a certain breakpoint of data volume.
I would never create a new database for each company. If you want a modular design, you can create this using tables and properly connected primary and secondary keys. This is where i learned about database normalization and I'm sure it will help you out here.
This is the method I would use. SQL Article
I'd have to agree with your co-worker. Relational databases are designed to handle large amounts of data, and the numbers you're talking about (1000+ companies, multiple users per company, 100+ orders/day) are well within the expected bounds. Separate databases means:
multiple database connections in each script (memory and speed penalty)
maintenance is harder (DB systems generally do not provide tools for acting on databases as a group) so schema changes, backups, and similar tasks will be more difficult
harder to run queries on data from multiple companies
If your site becomes huge, you may eventually need to distribute your data across multiple servers. Deal with that when it happens. To start out that way for performance reasons sounds like premature optimization.
I haven't personally dealt with this situation, but I would think that if you want to do business intelligence, you should aggregate the data into an offline database that you can then run any analysis you want on.
Also, keeping them in separate databases makes it easier to partition across servers (which you will likely have to do if you have 1000+ customers) without resorting to messy replication technologies.
I had a similar question a while back and came to the conclusion that a single database is drastically more manageable. Right now, we have multiple databases (around 10) and it is already becoming a pain to manage especially when we upgrade the code. We have to migrate every single database.
The upside is that the data is segregated cleanly. Due to the sensitivity of our data, this is a good thing, but it does make it quite a bit more difficult to keep up with.
The separate database methodology has a very big advance over the other:
+ You could broke it up into smaller groups, this architecture scales much better.
+ You could make stand alone servers in an easy way.
That depends on how likely your schemas are to change. If they ever have to change, will you be able to safely make those changes to 1000 separate databases? If a scalability problem is found with your design, how are you going to fix it for 1000 databases?
We run a SaaS (Software-as-a-Service) business with a large number of customers and have elected to keep all customers in the same database. Managing 1000's of separate databases is an operational nightmare.
You do have to be very diligent creating your data model and the business objects / reporting queries that access them. One approach you may want to consider is to carry the company ID in every table and ensure that every WHERE clause includes the company ID for the currently logged-in user. If you use a data access layer, you can enforce that condition there.
As you grow large, you can still vertically partition by placing groups of companies on each physical server, e.g. the first 100 companies on Server A, the next 100 companies on Server B.
Related
I have a question regarding databases and performances, so let me explain the situation.
The application - to be build - has the following set-up:
A group, with under that group, users.
Data / file-locations, (which is used to search through), estimated that one group can easily reach one million "search" terms.
Now, groups can never look at each other's data, and users can only look at the data which belongs to their group.
The only thing they should have in common is, some place to send error logs to (maybe, not even necessary).
Now in this situation, would you create a new database per group, or always limit your search results with a query, which will take someones user-group-id into account?
Now my idea was to just create a new Database, because you do not need to limit your query, every single time and it will keep the results to search through lower (?) but is that really necessary or is, even on over a million records, a "where groupid = 1" fast enough to not notice a decrease in performance.
This is the regular multi-tenant SaaS Architecture problem, which has been discussed at length, and the solution always varies according to your own situation. Here is one example of this discussion that I will just link to instead of copy-paste since all of it is worth a read: Multi-tenant PHP SaaS - Separate DB's for each client, or group them?
In addition to that I would like to add some more high level considerations:
Are there any legal requirements regarding the storage of your user's data? Some businesses operate in a regulatory environment where they are not allowed to store their data in a shared environment, quite common in the financial and medical industries.
Will you offer the same security (login method, data storage encryption), backup/restore service, geolocation redundancy and up-time guarantee to all users?
Are there any users who are willing to pay extra to have their data stored in a separate environment?
Are there any users who will potentially have requirements that are not compatible with the standard product that you will be offering? If so will you try to accommodate them? Note that occasionally there is some big customer that comes along and offers a lot of cash for a special treatment.
What is a separate environment? Is it a separate database, a separate virtual machine, a separate physical machine, a machine managed by the customer?
What parts of your application is part of each environment (hardware configuration, network config, database, source code, binaries, encryption certificates, etc)?
Will there be some heavy users that may produce loads on your application that will negatively impact the performance for the smaller users?
If you go for all users in one environment then is there a possibility that you in the future will create a separate environment for some customer? If so this will impact where you put shared data, eg configuration data like tax rates, and exchange rate data, etc.
I hope this helps.
Performance isn't really your problem, maintaining and data security is. If you have a lot of databases, you will have more to maintain. Not only backups but connection strings, patches, schema updates on release and so on. Multiple databases also suggests that you will have multiple PHP sites too. That will gradually get more expensive as the number of groups grows.
If you have one database then you need to ensure that every query contains the group id before it can run.
Database tables can be very, very large if you choose your indexes and constraints carefully. If you are performing joins against very large tables then it will be slow but a simple lookup, where you have an index on the group column should be fast enough.
If you were to share a single database, would you ever move a group out of it? If that's a possibility then split the databases now. If you are going to have one PHP site then I would recommend a single database with a group column.
I've recently taken over a project linking to a large MySQL DB that was originally designed many years ago and need some help.
Currently the DB has 5 tables per client that store their users information, transaction history, logs etc. However we currently have ~900 clients that have applied to use our services, with an average of 5 new clients applying weekly. So the DB has grown to nearly 5000 tables and ever increasing. Many of our clients do not end up using our services so their tables are all empty but still in the DB.
The original DB designer says it was created this way so if a table was ever compromised it would not reveal information on any other client.
As I'm redesigning the project in PHP I'm thinking of redesigning the DB to have an overall user, transaction history, log etc tables using the clients unique id to reference them.
Would this approach be correct or should the DB stay as is?
Could you see any possible security / performance concerns
Thanks for all your help
You should redesign the system to have just five tables, with a separate column identifying which client the row pertains to. SQL handles large tables well, so you shouldn't have to worry about performance. In fact, having many, many tables can be a hinderance to performance in many cases.
This has many advantages. You will be able to optimize the table structures for all clients at once. No more trying to add an index to 300 tables to meet some performance objective. Managing the database, managing the tables, backing things up -- all of these should be easier with a single table.
You may find that the database even gets smaller in size. This is because, on average, each of those thousands of tables has a half-paged filled at the end. This will go from thousands of half-pages to just one.
The one downside is security. It is easier to put security on tables than one rows in tables. If this is a concern, you may need to think about these requirements.
This may just be a matter of taste, but I would find it far more natural - and thus maintainable - to store this information in as few tables as possible. Also most if not all database ORMs will be expecting a structure like this, and there is no reason to reinvent that wheel.
From the perspective of security, it sounds like this project could be described as a web app. Obviously I don't know the realities of the business logic you're dealing with, but it seems like regardless of the table permissions all access to the database would be via the code base, in which case the app itself needs full permissions for all tables - nullifying any advantage of keeping the tables separated.
If there is a compelling reason for the security measures - say, different services that feed data into the DB independently of the web app, I would still explore ways to handle that authentication at the application layer instead of at the database layer. It will be much easier to handle your security rules in that way. Instead of having rules set in 5000+ different places, a single security rule of 'only let a user view a row of data if their user id equals the user_id column" is far simpler, easier to understand, and therefore far more maintainable (and possibly more secure).
Different people approach databases in different ways. I am a web developer, so I view databases as the place to store my data and nothing more, as it's always a dedicated and generally single-purpose DB installation, and I handle all other logic at the application level. There are people who view databases as the application itself, who make far more extensive use of built-in security features for their massive, distributed, multi-user systems - but I honestly don't know enough about those scenarios to comment on exactly where that line should be drawn.
So, today we had long conversation in company between project leader and programmers, shall we run one DB with all tables in it for our new project or to run multiple databases with each DB storing one set of module.
Project is about shop, we have separate (lets call it modules) such as users, payment methods, products, statistics.
Now one side said that we should place all of it inside one DB(its standard procedure) because it will be faster, with one query you can get all data, while other side said that we should split it between multiple databases so that its more secure, if someone breaches inside products, they wont see users tables as that DB will be on different server or virtual machine.
So my question is, what are pros and cons of having single DB for all data VS having multiple databases. I read few questions on stack-overflow, but none of them were precisely about pros and cons. And if having multiple databases is slower, how to speed it up?
Thanks!
Rather than providing you generic pros/cons (because everything depends on the use case...), I would say that we tend to prematurely optimize systems while optimization shouldn't be a problem in the future, if the whole system is architected with refactoring in mind.
IMHO, I had the same discussion some time ago and my conclusion is starting with a single database. It simplifies a lot of details:
Single database to backup, less maintainance.
You don't need to manage multiple connections.
Multiple databases can break the chance to perform atomic transactions, a feature I would never throw away.
You avoid synchronizing two or more databases to avoid integrity problems.
Also, since we're in the cloud computing era, infrastructures should scale horizontally. That is, if you need more power, add a replication node and distribute your load across multiple servers instead of scaling in the application level. This ensures your software will be still easy to maintain and develop, and good solutions should scale out easily if your code has quality and, of course, you've budget to support an increased load!
I´m new on php/mysql, and i´m codding a simple CMS. But in this case i will host multiple companies (each company with their multiple users), that pays a fee to use the system.
So... My question is about how to organize the Data Base... Talking about security, management and performance, i just want to know the opinion of ou guys of wich of these cases is the best:
Host all companies on a single DB and they get a company id to match with the users.
Each company have a separated DB that holds the users in there (and dont need the companies id anymore).
I would start the development following the first situation... But than i thought if i have some hacker attack / sql injection, every client would be harmed. Having separated DBs, the damage will get only one client. So maybe the 2nd situation could be better in terms of security. But could not say the same about management and performance.
So, based on your experience, any help or tip would be great!
Thanks in advance, and sorry about my poor english.
I would go for seperate DBs. But not only for hacking.
Scalability:
Lets say you have a server that handles 10 websites, but 1 of those websites in growing fast in requests, content, etc. Your server is having a hard time to host all of them.
With seperate DB's it is a piece of cake to spread over multiple servers. With a single one you would have to upgrade you current DB or cluster it, but that is sometimes not possible with the hosting company or very expensive.
Performance:
You they are all on 1 DB and data of multiple users is in 1 table, locks might slow down other users.
Large tables, mean large indices, large lookups, etc. So splitting to diffrent DB's would actualy speed that up.
You would have to deal with extra memory and CPU overhead per DB but they normaly do not have an amazingly large impact.
And yes, management for multiple DBs is more work, but having proper update scripts and keeping a good eye on the versions of the DB schema will reduce your management concerns a lot.
Update: also see this article.
http://msdn.microsoft.com/en-us/library/aa479086.aspx
Separate DBs has many advantages including performance, security, scalability, mobility, etc. There is more risk less reward trying to pack everything into 1 database especially when you are talking about separate companies data.
You haven't provided any details, but generally speaking, I would opt for separate databases.
Using an autonomous database for every client allows a finer degree of control, as it would be possible to manage/backup/trash/etc. them individually, without affecting the others. It would also require less grooming, as data is easier to be distinguished, and one database cannot break the others.
Not to mention it would make the development process easier -- note that separate databases mean that you don't have to always verify the "owner" of the rows.
If you plan to have this database hosted in a cloud environment such as Azure databases where resources are (relatively) cheap, clients are running the same code base, the database schema is the same (obviously), and there is the possibility of sharing some data between the companies then a multi-tenant database may be the way to go. For anything else you, you will probably be creating a lot of extra work going with a multi-tenant database.
Keep in mind that if you go the separate databases route, trying to migrate to a multi-tenant cloud solution later on is a HUGE task. I only mention this because all I've been hearing for the past few years around the IT water coolers is "Cloud! Cloud! Cloud!".
I am writing a PHP application in ZF. Customers will use it to sell their products to final customers. Customers will host their application on my server or they could use their own. Most of them will host this application on my server.
I could design one database for all customers at once, so every customer will use the same database, but of course products etc. will be assigned to particular customer. Trivial.
I could use separate database for every customer, so the database structure will be simpler. I will then probably use separate subdomains and maybe even file location, but that is just a detail.
Which solution will have better performance and how big will be the difference? Which one would you choose?
I would use a separate database for each customer. It makes backup and scaling easier. If you ever get a large customer that needs some custom changes to the schema, you can do it easily.
If one customer needs you to restore their data, with a single database it is trivial. On a shared db, much harder.
And that if large customer ever gets a lot of traffic, you can easily put them on another server with minimal changes.
If one site gets compromised, you don't have all of teh data for everyone in one place, the damage is mitigated to just the site that was hacked.
I'd definitely recommend going with 1 db per customer if possible.
Personally, I would go with multiple databases - i.e. a database for each client.
As I understand it all your clients will be using just an instance of your application so these instances should have their own databases.
If you go with a single database, you are creating a great potential security risk. One client compromising the login details to the db server would automatically compromise data of all your clients.
Also a single security vulnerability (a SQL injection attack) could destroy data of all clients (with multiple dbs you could still have time to fix the security hole and release a patch before all other sites are attacked).
You don't want to have an army of 1000000 mad clients instead of just 1 angry client.
Multiple databases also give you a greater possibility of load balancing (you can have the dbs spread across more servers).
Performance wise you're basically start with a 'sharding' approach. Because of this, the sharding performance strategy will be piece of cake.
The downside is that you could argue you're losing some (undefined) bit of overhead in the duplication.
One pitfall is that you might not notice performance issues in major components as quickly. This is because they are so scattered, so they might not be visible on your radar. Load testing is the way to get ahead of this.
To some extent this is a question of personal opinion. There are pros and cons of both models.
Personally, and because of the "they could use their own" comment, I would go with a seperate database per customer. This gives you
The ability to move customer data around when necessary. For example moving a single customer onto a different servers/setups depending on things like load.
If something goes wrong you only impact one customer and not everybody.
You can spread DB load across multiple DB servers if necessary.
If a customer comes to you with a specific requirement you can more easily cater for this without impact other customers.
From a performance perspective, to be honest I don't think there is any real performace gain in either model. That said this does of course depend on the structure of your DB and the hardware it runs on.
Don't choose multiple databases solution, if your needs can be fulfilled with one database. Because multiple databases will lead to big burden in long run, and your system will become highly complicated and unmanageable as you grow.
Using proper relationship you can go long way
A Client model can have many Products // why multiple databases?
Performance can achieved in either ways, just going multiple dbs will NOT benefit in that direction