I'm part of a team currently developing a dedicated SaaS service for a specific crowd of organizations.
I have knowledge in PHP and mySQL, so I am developing it on these platforms. Will be deploying this on the cloud and releasing in multiple countries.
Ive come to the point of separating organizations/main-users in the database and wanted to see what you guys think.
When the SaaS manages invoices and many other sensitive information what would be the best process of distributing it on the mySQL server? Ive thought of the below options:
1) having all information in a single database in single tables and separated by a organization identifying row. - does seem secure and may be slow when there are a few thousand users and 10,000 rows?
2) having a single database but separating tables with a user id eg. '1000_invoices' - again may be faster but not as secure.
3) have separate databases created on each organization signup and a specific user used to access the database and the database name is stored in the sessions/cookie? per organizations users.
Anyway i was wondering what you guys think will be the best option? and if not the above then what do you recommend? and why? also anything regarding security will be greatly appreciated. Have not worked with large multi-organization applications before.
thanks in advance!
I've developed numerous SaaS applications in the past and we've found the "single application deployment, single datatabase" setup as used by large "public" SaaS services (like KashFlow, possibly Salesforce?) didn't make much sense. Here's why:
Number 1: Client companies with confidential information are going to want assurances their data is more "secure" and it's easier to make these promises when their data is partitioned beyond the application tier.
Different clients sometimes want their software application customized to them, such as their own extra database fields, a different login screen visual design, or different (or custom) system "modules" - having different application instances makes this possible
It also makes scaling easier, at least when beginning. It's easier to load-balance by provisioning a single client's application to its own server separate from the others, whereas with a single application it means you need to spend longer developing it to make it scalable.
Database primary keys are easier to work with. Your customers might start asking questions such as why their "CustomerID" values increment by 500 each time instead of by 1.
If you've got customers located around the world, it's easier to provision a deployment in another country (with its own local server and DB server) rather than deploying a giant application abroad or forcing users to use intercontinential (i.e. slow+laggy) connections to your servers located thousands of miles away.
There are downsides, such as the extra administrative burden of managing hundreds, possibly thousands of databases, in addition to application software deployments, but the vast, vast simplifications it makes to the program code make it worthwhile. Besides, automating provisioning and deployment is easy with a bunch of shell-scripts.
Related
Team,
We are building a web project (like IT ticketing system). And we expect to have some big clients as soon we release the product. There should be three ways to raise a ticket: 1) via web application (forms), 2) via email or 3) via phone call to agent. According to our research 99% of tickets come via email and that means we shall be storing a lot of long messages etc.
The project is scoped so that we have two interfaces: agents (IT folks handling queries) and clients (people who ask for help).
The question here is what would you suggest us to do considering expected data and storage growth:
centralize everything so that we have one app with a
single huge database (easy to backup etc. unless we stuck with ex.
data corruption or similar)...
separate app in two parts one for IT agents and another one for
clients. The idea is to split application in two: one centralized
interface and back-end for IT agents and another one for clients.
For each client we would create a separate database along with a
copy of the PHP project (code syncing is easy to automate). Multiple
client instances could be hosted on one or many servers. They would
communicate via APIs. For example: IT agent opens a dashboard and
the list of outstanding tickets is displayed. If that agent is
working on 10 big clients back-end would need to contact 10
instances via API and request outstanding tickets. We can ensure
only certain number of queries would be displayed...
Please feel free to add third option as well.
I am not quite sure that I understood everything correctly but from what I understood I can point out the following key points about
your system requirement:
You are dealing with lot of data, and the data will grow fast
Most of traffic is coming from Email ticketing system
You have a multi-client system
You have an agent which can view data from multiple clients.
Question is can this agent manipulate(create, update, delete) data from multiple clients?
This is quite important point for future limitations of the architecture or not. I will assume that it can only read data from multiple clients.
Your 2 suggestions:
I would not recommend that as approach as many other problems could arise as the database grows.
For example you will be forced to add Indexes to speed up queries on your db which will help in the
beginning but later this will come to hunt you down especially if you have to add a lot of Non-Clustered
indexes. You could make it a little better by using Read-only replicas but even with this you will at some point
have issues. The problem will still remain in your 1 main database which will grow.
Quote:
separate app in two parts one for IT agents and another one for
clients. The idea is to split application in two: one centralized
interface and back-end for IT agents and another one for clients. For
each client we would create a separate database along with a copy of
the PHP project (code syncing is easy to automate). Multiple client
instances could be hosted on one or many servers. They would
communicate via APIs. For example: IT agent opens a dashboard and the
list of outstanding tickets is displayed. If that agent is working on
10 big clients back-end would need to contact 10 instances via API and
request outstanding tickets. We can ensure only certain number of
queries would be displayed...
You can split it to 2 separate apps as you said:
Centralized Interface + back-end, would call the 1 or multiple databases
Client application + back-end(monolith or multiple services), would call the same database as Centralized interface
but only for current client
As far as I understood your problem is not scaling Web-Servers(your back-end) but the db? If your problem is scaling the back-end as well
then you can consider either scaling to multiple instances or splitting your domain to micro-services and scaling that architecture on
micro-service level for each service independently.
My Suggestion:
1. Scaling your back-end:
You can keep everything in one service(monolithic approach) and deploy it
on multiple servers and scale the whole service together. There is nothing wrong with this.
Like everything it depends of your Business/Domain requirements and what worked best for you.
Although it is very popular these days to use micro-services they are not the best solution for
every problem. I have worked with both types of architecture and they have worked fine for
different scenarios.
You can even have middle-ground solution between them to take on specific part which has
high scaling demand and extract that to be a separate service(like creating Tickets sub-system service)
and the rest of the application which has low demand would be one big service.
2. Scaling your database:
Considering the above points I would suggest you to use Data Sharding or Data Partitioning.
You can read about data sharding here.
In general it is a way to logically and physically split your data from one database to multiple based
on some partitioning or shard key.
This means that you can take one specific concept in your Domain as the Shard key to split the data based on it.
In your case this could be CustomerId. This can only be done if the Business operations which include more
then one Customer is not the case for your Business. Means if the all your operations are done within one Customer.
The only exception here would be reading/viewing more customers together. This is fine as this does not need any
transnational behavior.
This really depends on your Business-scenarios and logic.
If splitting your database to multiple databases based on shard-key CustomerId is not enough you can take a shard-key
which is even more specific inside the Customer scope. Again it depends if your Domain allows this.
In this case it could be for example the concept a CustomerA would have CustomerA-Europe shard
CustomerA-USA shard, CustomerA-Africa and so on.
This would represent the logical shard. The physical shard would be the physical database.
The important point is that you pick your logical shard-key in the beginning so that you can easily
migrate your data to different physical databases later when you need it based on that shard-key.
Additionally to this you could include Historization for some heavy tables to separate the up to date
data from your historical data. You can read more about this here.
I have a question regarding databases and performances, so let me explain the situation.
The application - to be build - has the following set-up:
A group, with under that group, users.
Data / file-locations, (which is used to search through), estimated that one group can easily reach one million "search" terms.
Now, groups can never look at each other's data, and users can only look at the data which belongs to their group.
The only thing they should have in common is, some place to send error logs to (maybe, not even necessary).
Now in this situation, would you create a new database per group, or always limit your search results with a query, which will take someones user-group-id into account?
Now my idea was to just create a new Database, because you do not need to limit your query, every single time and it will keep the results to search through lower (?) but is that really necessary or is, even on over a million records, a "where groupid = 1" fast enough to not notice a decrease in performance.
This is the regular multi-tenant SaaS Architecture problem, which has been discussed at length, and the solution always varies according to your own situation. Here is one example of this discussion that I will just link to instead of copy-paste since all of it is worth a read: Multi-tenant PHP SaaS - Separate DB's for each client, or group them?
In addition to that I would like to add some more high level considerations:
Are there any legal requirements regarding the storage of your user's data? Some businesses operate in a regulatory environment where they are not allowed to store their data in a shared environment, quite common in the financial and medical industries.
Will you offer the same security (login method, data storage encryption), backup/restore service, geolocation redundancy and up-time guarantee to all users?
Are there any users who are willing to pay extra to have their data stored in a separate environment?
Are there any users who will potentially have requirements that are not compatible with the standard product that you will be offering? If so will you try to accommodate them? Note that occasionally there is some big customer that comes along and offers a lot of cash for a special treatment.
What is a separate environment? Is it a separate database, a separate virtual machine, a separate physical machine, a machine managed by the customer?
What parts of your application is part of each environment (hardware configuration, network config, database, source code, binaries, encryption certificates, etc)?
Will there be some heavy users that may produce loads on your application that will negatively impact the performance for the smaller users?
If you go for all users in one environment then is there a possibility that you in the future will create a separate environment for some customer? If so this will impact where you put shared data, eg configuration data like tax rates, and exchange rate data, etc.
I hope this helps.
Performance isn't really your problem, maintaining and data security is. If you have a lot of databases, you will have more to maintain. Not only backups but connection strings, patches, schema updates on release and so on. Multiple databases also suggests that you will have multiple PHP sites too. That will gradually get more expensive as the number of groups grows.
If you have one database then you need to ensure that every query contains the group id before it can run.
Database tables can be very, very large if you choose your indexes and constraints carefully. If you are performing joins against very large tables then it will be slow but a simple lookup, where you have an index on the group column should be fast enough.
If you were to share a single database, would you ever move a group out of it? If that's a possibility then split the databases now. If you are going to have one PHP site then I would recommend a single database with a group column.
I´m new on php/mysql, and i´m codding a simple CMS. But in this case i will host multiple companies (each company with their multiple users), that pays a fee to use the system.
So... My question is about how to organize the Data Base... Talking about security, management and performance, i just want to know the opinion of ou guys of wich of these cases is the best:
Host all companies on a single DB and they get a company id to match with the users.
Each company have a separated DB that holds the users in there (and dont need the companies id anymore).
I would start the development following the first situation... But than i thought if i have some hacker attack / sql injection, every client would be harmed. Having separated DBs, the damage will get only one client. So maybe the 2nd situation could be better in terms of security. But could not say the same about management and performance.
So, based on your experience, any help or tip would be great!
Thanks in advance, and sorry about my poor english.
I would go for seperate DBs. But not only for hacking.
Scalability:
Lets say you have a server that handles 10 websites, but 1 of those websites in growing fast in requests, content, etc. Your server is having a hard time to host all of them.
With seperate DB's it is a piece of cake to spread over multiple servers. With a single one you would have to upgrade you current DB or cluster it, but that is sometimes not possible with the hosting company or very expensive.
Performance:
You they are all on 1 DB and data of multiple users is in 1 table, locks might slow down other users.
Large tables, mean large indices, large lookups, etc. So splitting to diffrent DB's would actualy speed that up.
You would have to deal with extra memory and CPU overhead per DB but they normaly do not have an amazingly large impact.
And yes, management for multiple DBs is more work, but having proper update scripts and keeping a good eye on the versions of the DB schema will reduce your management concerns a lot.
Update: also see this article.
http://msdn.microsoft.com/en-us/library/aa479086.aspx
Separate DBs has many advantages including performance, security, scalability, mobility, etc. There is more risk less reward trying to pack everything into 1 database especially when you are talking about separate companies data.
You haven't provided any details, but generally speaking, I would opt for separate databases.
Using an autonomous database for every client allows a finer degree of control, as it would be possible to manage/backup/trash/etc. them individually, without affecting the others. It would also require less grooming, as data is easier to be distinguished, and one database cannot break the others.
Not to mention it would make the development process easier -- note that separate databases mean that you don't have to always verify the "owner" of the rows.
If you plan to have this database hosted in a cloud environment such as Azure databases where resources are (relatively) cheap, clients are running the same code base, the database schema is the same (obviously), and there is the possibility of sharing some data between the companies then a multi-tenant database may be the way to go. For anything else you, you will probably be creating a lot of extra work going with a multi-tenant database.
Keep in mind that if you go the separate databases route, trying to migrate to a multi-tenant cloud solution later on is a HUGE task. I only mention this because all I've been hearing for the past few years around the IT water coolers is "Cloud! Cloud! Cloud!".
Is there any difference between CMS and hight traffic websites (like news portals) in logic and database design and optimization (PHP and MySQL)?
I have searched for php site scalability in stackoverflow and memcached is in a majority.
Is there techniques for MySQL optimization? (Im looking for a book for this issue. I have searched in amazon but I dont know what is the best choise.)
Thanks in advance
this isnt so easy to answer.
there are different approaches and a variety of opinions but ill try to cover some common scenarios. but first some basics.
most web applications can be sperated in application and database.
database usage can be seperated into transactional (oltp) and analytical (olap)
in the best case you can just start a number of application servers and distribute traffic among them. they all have a connection to the same database server and can work independently.
this can be however difficult if you have other shared data, sessions etc.
you can accomplish this by simply adding multiple ip adresses to your domain namen in dns.
or you use load balancing techniques to forward the clients do different servers.
application scaling is generally very easy. database is much more complex.
the first thing to do is usually set up one or more replication servers which have the same data as the main database. they can be cascaded but have 1 serous disadvantage. their data is not always up to date. in general not more than some seconds old but it can be more under load. but for many use cases this is fine.
big sites that just display information could just replicate their database to some slave servers, set up some application servers (its a good practice to run one slave and one application server on the same server and let this application server access this database slave) and every is fine.
every olap query can be directed to a slave. olap querys are those that dont modify anything and dont need 100% up 2 date data.
so everything needs to be written to the very same database source server from which every other server gets its copy. for example every comment for an article.
if this bottleneck gets too tight you can go in two dirctions.
sharding
master-master replication
sharding means you decide on the application server where to store and where to fetch your data.
for example every comment that starts with a gets to server a, b-> b and so on.
thats a stupid example but its basically how it is. mostly some internal ids are involved.
if possible its good to shard data so that it can be completely pulled from that server agani.
in the example above, if i wanted to have all comments for an article i would have to ask eveyr server a-z and merge the results. this is inefficitient but possible, because those servers can be replicated. this is called mapping (you could check the famous google map-reduce algorithm whcih basically does just this).
master-master repliation means that you write your data to different master servers and they synchronize each other, and isnt stored seperately like if you do sharding.
this has to be done if your application is not able to decide on its own where to store and fetch data.
you just store to any master server, every server gets everything and everybody is happy?
no... because this involves another serious problem.
conflicts! imagine two users enter a comment. commentA gets stored on serverA, commentB gets stored on serverB. which id should we use. which one comes first?
the best is to design an application that avoids this cases and has different keys and stuff.
but what usually happens is conflict resolving, prioritizing and stuff. oracle has alot of features on this level and mysql is still behind. but trends are going into much more complex data structes like clouds anaway...
well i dont think i explained well but you should at least get some keywords from the text that oyu can investigate further.
Sure, there are all sorts of things you can do to optimize your PHP/MySQL web applications for high traffic websites. However, most of them depend on your specific situation, which you haven't given in your question.
Your database should be well structured regardless of whether you have a high-traffic site or not. If you use an off-the-shelf CMS, this is typically fine. Aside from good application architecture, there is no one-size-fits-all solution.
I am currently in a debate with a coworker about the best practices concerning the database design of a PHP web application we're creating. The application is designed for businesses, and each company that signs up will have multiple users using the application.
My design methodology is to create a new database for every company that signs up. This way everything is sand-boxed, modular, and small. My coworkers philosophy is to put everyone into one database. His argument is that if we have 1000+ companies sign up, we wind up with 1000+ databases to deal with. Not to mention the mess that doing Business Intelligence becomes.
For the sake of example, assume that the application is an order entry system. With separate databases, table size can remain manageable even if each company is doing 100+ orders a day. In a single-bucket application, tables can get very big very quickly.
Is there a best practice for this? I tried hunting around the web, but haven't had much success. Links, whitepapers, and presentations welcome.
Thanks in advance,
The1Rob
I talked to the database architect from wordpress.com, the hosting service for WordPress. He said that they started out with one database, hosting all customers together. The content of a single blog site really isn't that much, after all. It stands to reason that a single database is more manageable.
This did work well for them until they got hundreds and thousands of customers, they realized that they needed to scale out, running multiple physical servers and hosting a subset of their customers on each server. When they add a server, it would be easy to migrate individual customers to the new server, but harder to separate data within a single database that belongs to an individual customer's blog.
As customers come and go, and some customers' blogs have high-volume activity while others go stale, the rebalancing over multiple servers becomes an even more complex maintenance job. Monitoring size and activity per individual database is easier too.
Likewise doing a database backup or restore of a single database containing terrabytes of data, versus individual database backups and restores of a few megabytes each, is an important factor. Consider: a customer calls and says their data got SNAFU'd due to some bad data entry, and could you please restore the data from yesterday's backup? How would you restore one customer's data if all your customers share a single database?
Eventually they decided that splitting into a separate database per customer, though complex to manage, offered them greater flexibility and they re-architected their hosting service to this model.
So, while from a data modeling perspective it seems like the right thing to do to keep everything in a single database, some database administration tasks become easier as you pass a certain breakpoint of data volume.
I would never create a new database for each company. If you want a modular design, you can create this using tables and properly connected primary and secondary keys. This is where i learned about database normalization and I'm sure it will help you out here.
This is the method I would use. SQL Article
I'd have to agree with your co-worker. Relational databases are designed to handle large amounts of data, and the numbers you're talking about (1000+ companies, multiple users per company, 100+ orders/day) are well within the expected bounds. Separate databases means:
multiple database connections in each script (memory and speed penalty)
maintenance is harder (DB systems generally do not provide tools for acting on databases as a group) so schema changes, backups, and similar tasks will be more difficult
harder to run queries on data from multiple companies
If your site becomes huge, you may eventually need to distribute your data across multiple servers. Deal with that when it happens. To start out that way for performance reasons sounds like premature optimization.
I haven't personally dealt with this situation, but I would think that if you want to do business intelligence, you should aggregate the data into an offline database that you can then run any analysis you want on.
Also, keeping them in separate databases makes it easier to partition across servers (which you will likely have to do if you have 1000+ customers) without resorting to messy replication technologies.
I had a similar question a while back and came to the conclusion that a single database is drastically more manageable. Right now, we have multiple databases (around 10) and it is already becoming a pain to manage especially when we upgrade the code. We have to migrate every single database.
The upside is that the data is segregated cleanly. Due to the sensitivity of our data, this is a good thing, but it does make it quite a bit more difficult to keep up with.
The separate database methodology has a very big advance over the other:
+ You could broke it up into smaller groups, this architecture scales much better.
+ You could make stand alone servers in an easy way.
That depends on how likely your schemas are to change. If they ever have to change, will you be able to safely make those changes to 1000 separate databases? If a scalability problem is found with your design, how are you going to fix it for 1000 databases?
We run a SaaS (Software-as-a-Service) business with a large number of customers and have elected to keep all customers in the same database. Managing 1000's of separate databases is an operational nightmare.
You do have to be very diligent creating your data model and the business objects / reporting queries that access them. One approach you may want to consider is to carry the company ID in every table and ensure that every WHERE clause includes the company ID for the currently logged-in user. If you use a data access layer, you can enforce that condition there.
As you grow large, you can still vertically partition by placing groups of companies on each physical server, e.g. the first 100 companies on Server A, the next 100 companies on Server B.