Duplicate tables for deploying a PHP MySQL app to multiple customers - php

I am taking a PHP MySQL app which was built for one customer and deploying it to be used by multiple customers. Each customer account will have many users (30-200), each user can have several classes, each class has several metrics, each metric contains several observations. Several means about 2-8.
Originally I was planning to have one instance of the application code which would connect to the appropriate table set for that customer based on a table prefix. But I now considering using only one set of tables for all my customer accounts. This would simplify the application design which would be best int he long run. My question is whether I would be taxing the database server by combining all the customer data into the same tables. Most queries will be SELECTs, but due to the nature of the schema there can be quite a few JOINS required. Most INSERT or UPDATE queries are just one row in one table, and possibly one or two bridge entity tables at most.
I know this is one of those "it depends" questions but I am hoping to get a little guidance regarding how slow/fast MySQL is with what I am trying to do.
Here is an example of the longest JOIN queryvI would be doing.
SELECT $m_measure_table_name.*, $m_metric_table_name.metric_name,$m_metric_table_name.metric_descrip, $m_metric_table_name.metric_id, $c_class_table_name.class_size,$c_class_table_name.class_id,$c_class_table_name.class_field,$c_class_table_name.class_number,$c_class_table_name.class_section, $lo_table_name.*,$lc_table_name.*, $user_table_name.user_name,$user_table_name.user_id, $department_table_name.*
FROM $m_measure_table_name
LEFT JOIN $m_metric_table_name ON $m_measure_table_name.measure_metric_id=$m_metric_table_name.metric_id
LEFT JOIN $c_class_table_name ON $m_metric_table_name.metric_class_id=$c_class_table_name.class_id
LEFT JOIN $lo_table_name ON $m_metric_table_name.metric_lo_id=$lo_table_name.lo_id
LEFT JOIN $lc_table_name ON $lo_table_name.lo_lc_id=$lc_table_name.lc_id
LEFT JOIN $class_user_table_name ON $c_class_table_name.class_id=$class_user_table_name.cu_class_id
LEFT JOIN $user_table_name ON $user_table_name.user_id=$class_user_table_name.cu_user_id
LEFT JOIN $department_class_table_name ON $c_class_table_name.class_id=$department_class_table_name.dc_class_id
LEFT JOIN $department_table_name ON $department_class_table_name.dc_department_id=$department_table_name.department_id
WHERE $c_class_table_name.class_semester=:class_semester AND $c_class_table_name.class_year=:class_year
AND $department_table_name.department_id=:id
ORDER BY $department_table_name.department_name, $lc_table_name.lc_name, $lo_table_name.lo_id
Ultimately my question is whether doing long strings of JOINS like this on primary keys is taxing to the database. Also whether using one set of tables seems like the better approach to deployment.

This is too long for a comment.
SQL is designed to perform well on tables with millions of rows, assuming you have appropriate indexing and table partitioning. I wouldn't worry about data volume being an issue in this case.
However, you may have an issue with security. You probably don't want different customers to see each other's data. Row-level security is a pain in SQL. Table-level is much easier.
Another approach is to create a separate database for each customer. In addition to the security advantages, this also allows you to moving different customers to different servers to meet demand.
It does come at a cost. If you have common tables, then you need to replicate them or have a "common tables" database. And, when you update the code, then you need to update all the databases. The latter may actually be an advantage as well. It allows you to move features out to customers individually, instead of requiring all to upgrade at the same time.
EDIT: (about scaling one database)
Scaling should be fine for one database, in general. Databases scale, you just have to throw more hardware, essentially in a single server, at the problem. You will need the judicious use of indexes for performance and possibly partitions if the data grows quite large. With multiple databases you can throw more "physical" servers at the problem, with one database you throw "bigger" servers at the problem. (Those are in double quotes because many servers nowadays are virtual anyway.)
As an example of the difference. If you have 100 clients, then you can back-up the 100 databases at times convenient to them and all in parallel. And, if the databases are on separate servers, the backups won't interfere with each other. With a single database, you back up once and it affects everyone at the same time. And the backup may take longer because you are not running separate jobs (backups can take advantage of parallelism).

Related

Will splitting one large MySQL query into multiple smaller queries help mitigate table locking / force a script to "yield" to other PHP scripts?

We have a large number of tables in our company's MySQL database, each representing different products (and/or history/transactions for those products) plus a "main" table for parent establishments. Almost all of these tables are using MyISAM (changing everything to InnoDB might help but it's not an option at the moment).
We have a "filter" tool in our backend for finding establishments that match certain criteria. The results are printed in tabular format with all data available for that establishment (ID, name, which products they do/don't have, how many transactions, etc. etc.) and currently this is achieved with a very large MySQL statement with many JOINs.
We had a situation last week where a particularly large filter was run during peak business hours and the resulting READ LOCKs on dependent tables (via the aforementioned JOIN statements) caused the entire database to stop responding for almost 30mins even though the filter in question only takes ~43s to run on it's own (locally, anyway). Very bad.
While important, this filter tool is only used by a few people on the team and not by clients. The speed/performance of this filter tool is not critical nor the goal of this question. I would prefer for this tool to "yield" to other apps that need access to these tables rather than force them to wait until the entire filter has finished.
Which brings me to my question; Will splitting one large query (with multiple JOINs) into multiple smaller queries help mitigate table locking and force a script to "yield" to other, higher priority scripts that might need access to the same tables in between the smaller queries?
Disclaimer: I have reviewed so many other questions here on StackOverflow and on other sites via Google over the last week and they're all interested in speed. That is not what I am asking. If this is a duplicate I apologize and it can be locked, but please provide a link to it so that I may use it. Thank you!
EDIT: I appreciate the comments thus far and the additional information/ideas they provide, though none have answered the question unfortunately. I'm in a position at the company where I have control over the filter's code and that's it. I cannot change the database engine, I cannot initiate replication or create data warehouses, and I'm already aware that MyISAM is the inferior choice for tables, but I don't have control over that. Thank you.

Multiple databases, or always limit with query

I have a question regarding databases and performances, so let me explain the situation.
The application - to be build - has the following set-up:
A group, with under that group, users.
Data / file-locations, (which is used to search through), estimated that one group can easily reach one million "search" terms.
Now, groups can never look at each other's data, and users can only look at the data which belongs to their group.
The only thing they should have in common is, some place to send error logs to (maybe, not even necessary).
Now in this situation, would you create a new database per group, or always limit your search results with a query, which will take someones user-group-id into account?
Now my idea was to just create a new Database, because you do not need to limit your query, every single time and it will keep the results to search through lower (?) but is that really necessary or is, even on over a million records, a "where groupid = 1" fast enough to not notice a decrease in performance.
This is the regular multi-tenant SaaS Architecture problem, which has been discussed at length, and the solution always varies according to your own situation. Here is one example of this discussion that I will just link to instead of copy-paste since all of it is worth a read: Multi-tenant PHP SaaS - Separate DB's for each client, or group them?
In addition to that I would like to add some more high level considerations:
Are there any legal requirements regarding the storage of your user's data? Some businesses operate in a regulatory environment where they are not allowed to store their data in a shared environment, quite common in the financial and medical industries.
Will you offer the same security (login method, data storage encryption), backup/restore service, geolocation redundancy and up-time guarantee to all users?
Are there any users who are willing to pay extra to have their data stored in a separate environment?
Are there any users who will potentially have requirements that are not compatible with the standard product that you will be offering? If so will you try to accommodate them? Note that occasionally there is some big customer that comes along and offers a lot of cash for a special treatment.
What is a separate environment? Is it a separate database, a separate virtual machine, a separate physical machine, a machine managed by the customer?
What parts of your application is part of each environment (hardware configuration, network config, database, source code, binaries, encryption certificates, etc)?
Will there be some heavy users that may produce loads on your application that will negatively impact the performance for the smaller users?
If you go for all users in one environment then is there a possibility that you in the future will create a separate environment for some customer? If so this will impact where you put shared data, eg configuration data like tax rates, and exchange rate data, etc.
I hope this helps.
Performance isn't really your problem, maintaining and data security is. If you have a lot of databases, you will have more to maintain. Not only backups but connection strings, patches, schema updates on release and so on. Multiple databases also suggests that you will have multiple PHP sites too. That will gradually get more expensive as the number of groups grows.
If you have one database then you need to ensure that every query contains the group id before it can run.
Database tables can be very, very large if you choose your indexes and constraints carefully. If you are performing joins against very large tables then it will be slow but a simple lookup, where you have an index on the group column should be fast enough.
If you were to share a single database, would you ever move a group out of it? If that's a possibility then split the databases now. If you are going to have one PHP site then I would recommend a single database with a group column.

How to handle user's data in MySQL/PHP, for large number of users and data entries

Let's pretend with me here:
PHP/MySQL web-application. Assume a single server and a single MySQL DB.
I have 1,000 bosses. Every boss has 10 workers under them. These 10 workers (times 1k, totaling 10,000 workers) each have at least 5 database entries (call them work orders for this purpose) in the WebApplication every work day. That's 50k entries a day in this work orders table.
Server issues aside, I see two main ways to handle the basic logic of the database here:
Each Boss has an ID. There is one table called workorders and it has a column named BossID to associate every work order with a boss. This leaves you with approximately 1 million entries a month in a single table, and to me that seems to add up fast.
Each Boss has it's own table that is created when that Boss signed up, i.e. work_bossID where bossID = the boss' unique ID. This leaves you with 1,000 tables, but these tables are much more manageable.
Is there a third option that I'm overlooking?
Which method would be the better-functioning method?
How big is too big for number of entries in a table (let's assume a small number of columns: less than 10)? (this can include: it's time to get a second server when...)
How big is too big for number of tables in a database? (this can include: it's time to get a second server when...)
I know that at some point we have to bring in talks of multiple servers, and databases linked together... but again, let's focus on a single server here with a singly MySQL DB.
If you use a single server, I don't think there is a problem with how big the table gets. It isn't just the number of records in a table, but how frequently it is accessed.
To manage large datasets, you can use multiple servers. In this case:
You can keep all workorders in a single table, and mirror them across different servers (so that you have slave servers)
You can shard the workorders table by boss (in this case you access the server depending on where the workorder belongs) - search for database sharding for more information
Which option you choose depends on how you will use your database.
Mirrors (master/slave)
Keeping all workorders in a single table is good for querying when you don't know which boss a workorder belongs to, eg. if you are searching by product type, but any boss can have orders in any product type.
However, you have to store a copy of everything on every mirror. In addition only one server (the master) can deal with update (or adding workorder) SQL requests. This is fine if most of your SQL queries are SELECT queries.
Sharding
The advantage of sharding is that you don't have to store a copy of the record on every mirror server.
However, if you are searching workorders by some attribute for any boss, you would have to query every server to check every shard.
How to choose
In summary, use a single table if you can have all sorts of queries, including browsing workorders by an attribute (other than which boss it belongs to), and you are likely to have more SELECT (read) queries than write queries.
Use shards if you can have write queries on the same order of magnitude as read queries, and/or you want to save memory, and queries searching by other attributes (not boss) are rare.
Keeping queries fast
Large databases are not really a big problem, if they are not overwhelmed by queries, because they can keep most of the database on hard disk, and only keep what was accessed recently in cache (on memory).
The other important thing to prevent any single query from running slowly is to make sure you add the right index for each query you might perform to avoid linear searches. This is to allow the database to binary search for the record(s) required.
If you need to maintain a count of records, whether of the whole table, or by attribute (category or boss), then keep counter caches.
When to get a new server
There isn't really a single number you can assign to determine when a new server is needed because there are too many variables. This decision can be made by looking at how fast queries are performing, and the CPU/memory usage of your server.
Scaling is often a case of experimentation as it's not always clear from the outset where the bottlenecks will be. Since you seem to have a pretty good idea of the kind of load the system will be under, one of the first things to do is capture this in a spreadsheet so you can work out some hypotheticals. This allows you do do a lot of quick "what if" scenarios and come up with a reasonable upper end for how far you have to scale with your first build.
For collecting large numbers of records there's some straight-forward rules:
Use the most efficient data type to represent what you're describing. Don't worry about using smaller integer types to shave off a few bytes, or shrinking varchars. What's important here is using integers for numbers, date fields for dates, and so on. Don't use a varchar for data that already has a proper type.
Don't over-index your table, add only what is strictly necessary. The larger the number of indexes you have, the slower your inserts will get as the table grows.
Purge data that's no longer necessary. Where practical delete it. Where it needs to be retained for an extended period of time, make alternate tables you can dump it into. For instance, you may be able to rotate out your main orders table every quarter or fiscal year to keep it running quickly. You can always adjust your queries to run against the other tables if required for reporting. Keep your working data set as small as practical.
Tune your MySQL server by benchmarking, tinkering, researching, and experimenting. There's no magic bullet here. There's many variables that may work for some people but might slow down your application. They're also highly dependent on OS, hardware, and the structure and size of your data. You can easily double or quadruple performance by allocating more memory to your database engine, for instance, either InnoDB or MyISAM.
Try using other MySQL forks if you think they might help significantly. There are a few that offer improved performance over the regular MySQL, Percona in particular.
If you query large tables often and aggressively, it may make sense to de-normalize some of your data to reduce the number of expensive joins that have to be done. For instance, on a message board you might include the user's name in every message even though that seems like a waste of data, but it makes displaying large lists of messages very, very fast.
With all that in mind, the best thing to do is design your schema, build your tables, and then exercise them. Simulate loading in 6-12 months of data and see how well it performs once really loaded down. You'll find all kinds of issues if you use EXPLAIN on your slower queries. It's even better to do this on a development system that's slower than your production database server so you won't have any surprises when you deploy.
The golden rule of scaling is only optimize what's actually a problem and avoid tuning things just because it seems like a good idea. It's very easy to over-engineer a solution that will later do the opposite of what you intend or prove to be extremely difficult to un-do.
MySQL can handle millions if not billions of rows without too much trouble if you're careful to experiment and prove it works in some capacity before rolling it out.
i had database size problem as well in one of my networks so big that it use to slow the server down when i run query on that table..
in my opinion divide your database into dates decide what table size would be too big for you - let say 1 million entries then calculate how long it will take you to get to that amount. and then have a script every that period of time to either create a new table with the date and move all current data over or just back that table up and empty it.
like putting out dated material in archives.
if you chose the first option you'll be able to access that date easily by referring to that table.
Hope that idea helps
Just create a workers table, bosses table, a relationships table for the two, and then all of your other tables. With a relationship structure like this, it's very dynamic. Because, if it ever got large enough you could create another relationship table between the work orders to the bosses or to the workers.
You might want to look into bigints, but I doubt you'll need that. I know it that the relationships table will get massive, but thats good db design.
Of course bigint is for mySQL, which can go up to -9223372036854775808 to 9223372036854775807 normal. 0 to 18446744073709551615 UNSIGNED*

Normalization or Alternative with MySQL

building a site using PHP and MySQL that needs to store a lot of properties about users (for example their DOB, height, weight etc) which is fairly simple (single table, lots of properties (almost all are required)).
However, the system also needs to store other information, such as their spoken languages, instrumental abilities, etc. All in all their are over a dozen such characteristics. By default I assumed creating a separate table (called maybe languages) and then a link table with a composite id (user_id, language_id).
The problem I foresee though is when visitors attempt to search for users using these criteria. The dataset we're looking to use will have over 15,000 users at time of launch and the primary function will be searching and refining users. That means hundreds of queries daily and the prospect of using queries with up a dozen or more JOINs in them is not appealing.
So my question is, is there an alternative that's going to be more efficient? One way I was thinking is storing the M2M values as a CSV of IDs in the user table and then running a LIKE query against it. I know LIKE isn't the best, but is it better than a join?
Any possible solutions will be much appreciated.
Do it with joins. Then, if your performance goals are not met, try something else.
Start with a normalized database (e.g. a languages table, linked to the users table by a mapping table) to make sure you data is represented cleanly and logically.
If you have performance problems, examine your queries and make sure you have suitable indexes.
If you dislike repeatedly coding up queries with many joins, define some views.
If views are very slow to query, consider materialized views.
If you have several thousand records and a few hundred queries per day (really, that's pretty small and low-usage), these techniques will allow your site to run at full speed, with no compromise on data integrity. If you need to scale to many millions of records and millions of queries per day, even these techniques may not be enough; in which case, investigate cacheing and denormalization.

Need Help regarding Optimization

First of all I am an autodidact so I don't have great know how about optimization and stuff. I created a social networking website.
It contains 29 tables right now. I want to extend its functionality by adding things like yellow pages, events etc to make it more like a portal.
Now the question is should I simply add the tables in the same database or should I use a different database?
And in case if I create a new database, I also want users to be able to comment on business listing etc just like reviews. So how will I be able to pull out entries since the reviews will be on one database and user details on other.
Is it possible to join tables on 2 different databases ?
You can join tables in separate databases by fully justifying the name, but the real question is why do you want the information in separate databases? If the information you are storing all relates together, it should go in one database unless there is a compelling (usually performance related) reason against it.
The main reason I could see for separating your YellowPages out is if you wished to have one YellowPages accessible to several different, non-interacting, websites. That said, assumably you wouldn't want cross-talk comments on the listings, so comments would need to be stored in the website databases rather than the YellowPages database. And that just sounds like a maintenance nightmare.
Don't Optimize until you need to.
If performance is ok, go for the easiest to maintain solution.
Monitor the performance of your site and if it starts to get slow, figure out exactly what is causing the slowdown and focus on performance on that section only.
You definitely can query and join tables from two different databases - you just need to specify the tables in a dbname.tablename format.
SELECT a.username, b.post_title
FROM dbOne.users a INNER JOIN dbTwo.posts b USING (user_id)
However, it might make management and maintenance a lot more complicated for you. For example, you'll have to track which table belongs in which database, and will continually need to be adding the database names into all your queries. When it comes time to back up the data, your work will increase there as well. MySQL databases can easily contain hundreds of tables so I see no benefit in splitting it up - just stick with one.
You can prove an algorithm is the fastest it can. math.h and C libraries are very optimized since half a century and other very advances when optimizing is perl strucutres. Just avoid put everything on online to easify debugging. There're conventions, try keep every programmer in the team following same convention. Which convention is "right" makes less optimum than being consequent and consistent. Performance is the last thing you do, security and intelligibility top prios. Read about ordo notation depends on software only while suboptimal software can be faster than optimal relative different hardware. A totally buginfested spaghetti code with no structure can respond many times faster than the most proven optimal software relative hardware.

Categories