Referential integrity for mysql

Referential integrity for mysql - php

I am currently maintaining a rather large office web-application. I recently became aware that via various developer-tools within web-browsers that values of select-boxes can easily be modified by a user (among other things). On the server side I do validation if the the posted data is numerical or not (for drop-downs), but don't actually check if the value exists in a database table, for example I have a dropdown box for salutation ('mr','ms','mrs','Mr/ms') etc. which correspond with a numerical values.
Currently I use Mysql's Myisam tables which don't offer foreign keys referential integrity, so I am thinking about moving to Innodb, yet this posses the following issue:
If I want to apply referential integrity (to insure valid ID's are inserted), it would mean I'd have to index all columns (if using integrity checks) that do not necessarily need to be indexed for performance issues at all (e.g. a salutation dropdown). If a very large database client-table has say 10 simular dropdowns (e.g. clientgroup, no. of employees, country-region etc) it would seem an overkill to index every linked table.
My questions:
1) when using referential integrity, do columns really need to be indexed also?
2) are there other practical solutions I may be overlooking? (e.g. use a separate query for every dropdown-list to see if the value exists in a table?)
3) How do other web-applications deal with such issues?
Help Appreciated!
thanks
Patrick

You only have to index the fields used in the foreign key relationships, and recent version of mysql do this automatically for you anyways. It's not "overkill". it's actually an optimization.
Consider that anytime you update/delete/insert a record, the foreign tables have to be checked for matching records - without the indexes, those checks could be glacially slow.

InnoDB automatically creates an index when you define a foreign key. If an index on that column already exists, InnoDB uses it instead of creating a new index.
As #MarcB mentioned in his answer, InnoDB uses these indexes to make referential integrity checks more efficient during some types of data changes. These changes include updating or deleting values in the parent table, and cascading operations.
You could use the ENUM data type to restrict a column to a fixed set of values. But ENUM has some disadvantages too.
Some web developers eschew foreign keys. To provide the same data integrity assurances, they have to write application code for every such case. So if you like to write and test lots of repetitive code, unnecessarily duplicating features the RDBMS already provides more efficiently, then go ahead! :-)
Most developers who don't use foreign keys don't write those extra checks either. They just don't have data integrity enforcement. I.e. they have sacrificed quality.
PS: I do recommend switching to InnoDB, and referential integrity is just one of the reasons to do so. Basically, if you want a database that supports ACID, InnoDB supports all aspects of that and MyISAM supports none.

Related

Downside not using sql foreign keys

I have a Magento shop (using MySql db) and just noticed that some developer introduced a custom db for capturing some structured data.
Now I noticed that the tables are not linked via foreign keys with each other, but just added a column e.g. priceListID = 01124 which is the same Id as on price list table. So linking the data together must happen within the code by firing different select statements I assume.
Now I am wondering if this needs to be fixed soon or if it actually is ok not to use foreign keys on db level to link data together?
What are the down sides of doing this and are there maybe some benefits (like flexibility?)
Hope you can help me with this! Thanks a lot!

There're few advantages of keeping such constraints inside a database:
Performance. Most of constraints, such as foreign keys, are better implemented if stored inside a database, close to the data. You want to check data integrity with additional select? You have to make an extra request to the database. It requires some time.
What if you have several applications that work with your database? You have to write code for checking data integrity in all of them, which implies additional expenses.
Synchronization. While you're checking data integrity with additional select, some other user may delete this data at the same time. And you will not know about it. Of course, these checks can be properly implemented, but this is still an extra work you have to do.
To me, this is all a smell of bad, not scalable design which can bring many problems. Data integrity is what databases are built for. And these types of verifications should stay inside a database.

From your description, I understand that tables are indeed functionaly related, as they share a common piece of information (priceListID in the new table relates to id in the original table). On the one hand, this set-up would still allow writing queries that join the tables together.
The downside of not creating a foreign key to represent that relationship, however, is that, from database perspective, the consistency of the relationship cannot be guaranteed. It is, for example, possible that records are created in the new table where priceListID do not exist in the original table. It would also be possible to delete records in the old table while related records exists in the new one, hence turning the children to orphans.
As a conclusion: by not using foreign keys, the developers rely solely on the application to maintain data integrity. There is no obvious benefit not using the built-in features that the RDBMS offers to protect data consistency, and chances are developers just forgot that critical part of the table definition. I would suggest having a talk with them and intimate them to create the missing foreign key (unless they can give a clear explanation why they did not).
This should be as simple as:
ALTER TABLE newtable
ADD CONSTRAINT fk_new_to_original_fk
FOREIGN KEY (priceListID )
REFERENCES originaltable(id);
Please note that this requires all values in the referrencing column to be available in the parent table.

Fastest vs best - unique index or check if exists

I'm sorry if this question is stupid or already asked, but I couldn't find much about it.
What is fastest / best method of unique storing in SQL
Option 1: Create unique index, and use a try -> catch block with PHP?
Option 2: Query to check if exists, and then act on that?
I would think option 2 is the best, but with option 1 I only have 1 query, vs 2 queries if not exists.
And since I need to minimize DB queries the best I can, I would go for option 1, but not sure if it a good option to bypass it with the try block?
Thanks

As with all optimisation related question, the anser is: well, it depends.
But let's get one thing straight: if you do not want duplicate value in a field or combination of fields, then use primary key or unique index constraints just to make sure that the integrity of data is not compromised under any circumstances.
So, your question is really: shall I check before inserting a record in a table if the data would violate uniqueness constraints.
If you do not really care whether the insert is successful or not, then do not check. Moreover, use insert ignore, so you do not even get an error message if the insert violates uniqueness constraints. Such situation could be, if you want to log if a user logs in within a certain period at least once or not.
Consider how costly it is to check before each and every insert and update, if the data violates any constraint and how often do you think it would occur. If it is a costly operation, then rely on the indexes to prevent the inserts with duplicate data and find out what data violates the constraints after you know that the query has failed.
Do not only consider the cost of the select, but also take into account if the insert is part of a larger transaction which may have to be rolled back in case an insert fails. Checking before the transaction even starts for constraint violations may have a positive impact on your db's performance.

In my opinion
Always use unique property for the field you want to be, actually - unique !
If by any means you can not do so, and also want to know before hand if the desired value already exists in the table / collection then additionally use if-Exists functionality.
Why not any one?
May be because improper sharding key in mongodb allows non-unique value in each shards!
Cannot offer knowledge in SQL but I think methodology goes same - use unique indexing where possible.
Cost effectiveness
Both methods cost server resources and hits the db server minimum twice.
So what's the big deal?
In the unknown universe -- you sent a request to know if a value exists and response was false to let you know the uniqueness! By the time you recognize it in the application server and requests the insertion may be someone else has already inserted the same value, by any chance in a busy server!
The server may never bother telling you the discrepancy as the indexing was unavailable!
For this point of view if - uniqueness is that important then you should enable indexing in table or collection level first!

Create unique auto incremented primary key index.
and then insert the data to SQL without entering the auto incremented primary key index value.
The inserting will never duplicate the data.

Which is faster in SQL: many Many MANY tables vs one huge table?

I am in the process of creating a website where I need to have the activity for a user (similar to your inbox in stackoverflow) stored in sql. Currently, my teammates and I are arguing over the most effective way to do this; so far, we have come up with two alternate ways to do this:
Create a new table for each user and have the table name be theirusername_activity. Then when I need to get their activity (posting, being commented on, etc.) I simply get that table and see the rows in it...
In the end I will have a TON of tables
Possibly Faster
Have one huge table called activity, with an extra field for their username; when I want to get their activity I simply get the rows from that table "...WHERE username=".$loggedInUser
Less tables, cleaner
(assuming I index the tables correctly, will this still be slower?)
Any alternate methods would also be appreciated

"Create a new table for each user ... In the end I will have a TON of tables"
That is never a good way to use relational databases.
SQL databases can cope perfectly well with millions of rows (and more), even on commodity hardware. As you have already mentioned, you will obviously need usable indexes to cover all the possible queries that will be performed on this table.

Number 1 is just plain crazy. Can you imagine going to manage it, and seeing all those tables.
Can you imagine the backup! Or the dump! That many create tables... that would be crazy.
Get you a good index, and you will have no problem sorting through records.

here we talk about MySQL. So why would it be faster to make separate tables?
query cache efficiency, each insert from one user would'nt empty the query cache for others
Memory & pagination, used tables would fit in buffers, unsued data would easily not be loaded there
But as everybody here said is semms quite crazy, in term of management. But in term of performances having a lot of tables will add another problem in mySQL, you'll maybe run our of file descriptors or simply wipe out your table cache.
It may be more important here to choose the right engine, like MyIsam instead of Innodb as this is an insert-only table. And as #RC said a good partitionning policy would fix the memory & pagination problem by avoiding the load of rarely used data in active memory buffers. This should be done with an intelligent application design as well, where you avoid the load of all the activity history by default, if you reduce it to recent activity and restrict the complete history table parsing to batch processes and advanced screens you'll get a nice effect with the partitionning. You can even try a user-based partitioning policy.
For the query cache efficiency, you'll have a bigger gain by using an application level cache (like memcache) with history-per-user elements saved there and by emptying it at each new insert .

You want the second option, and you add the userId (and possibly a seperate table for userid, username etc etc).
If you do a lookup on that id on an properly indexed field you'd only need something like log(n) steps to find your rows. This is hardly anything at all. It will be way faster, way clearer and way better then option 1. option 1 is just silly.

In some cases, the first option is, in spite of not being strictly "the relational way", slightly better, because it makes it simpler to shard your database across multiple servers as you grow. (Doing this is precisely what allows wordpress.com to scale to millions of blogs.)
The key is to only do this with tables that are entirely independent from a user to the next -- i.e. never queried together.
In your case, option 2 makes the most case: you'll almost certainly want to query the activity across all or some users at some point.

Use option 2, and not only index the username column, but partition (consider a hash partition) on that column as well. Partitioning on username will provide you some of the same benefits as the first option and allow you to keep your sanity. Partitioning and indexing the column this way will provide a very fast and efficient means of accessing data based on the username/user_key. When querying a partitioned table, the SQL Engine can immediately lop off partitions it doesn't need to scan as it can tell based off of the username value queried vs. the ability of that username to reside within a partition. (in this case only one partition could contain records tied to that user) If you have a need to shard the table across multiple servers in the future, partitioning doesn't hinder that ability.
You will also want to normalize the table by separating the username field (and any other elements in the table related to username) into its own table with a user_key. Ensure a primary key on the user_key field in the username table.

This majorly depends now on where you need to retrieve the values. If its a page for single user, then use first approach. If you are showing data of all users, you should use single table. Using multiple table approach is also clean but in sql if the number of records in a single table are very high, the data retrieval is very slow

Database Design: Selecting and limiting Favorites

Right now in a database I have a Members table and a Products table with a joining Favorites table that consists of primary foreign keys from both the Members and Products tables. I have a requirement to place a restriction on amount of products that a member can place in their favorites at 5.
Where can this restriction come from? Is it something done within the database (MySQL) and hence would be part of my existing schema? Or is this a programming function that could be accomplished with something like PHP?

The question has been answered, however, since you are seeking understanding ...
The idea with Databases is that all such such limits and Constraints on data are placed in the Database itself (as a self-contained unit). Data Constraints should be in the Database, not only in the app. ISO/IEC/ANSI SQL provide several types of Constraints, for different purposes:
FOREIGN KEY Constraints, for Referential Integrity (as well as performance; Open Architecture compliance, etc)
CHECK Constraints, to check against data values of other columns, and disallow violations
RULE Constraints, to disallow data that is out-of-range or specify exact data value formats
Yours is a classic simple RULE or CHECK. And the correct answer for Database and Database Design is a RULE or CHECK, not code.
That is not to say that the app should not check the count, and avoid attempting an invalid action. That is just good sense. And it is not a repetition, it is stopping invalid actions at a higher level, which saves resource use. And data in the Db cannot be relied upon, if the integrity is managed outside, in app code, written by developers. The rules implemented inside the server can be relied upon, they are enforced for all apps or app components.
But the freeware Non-SQLs do not have the basics of Standard-SQL. No Checks or Rules. Therefore the integrity of data in the database relies solely on the developer: their quality, knowledge, consistency, etc.
And the correct answer for MySQL/PHP is code. In every location that attempts that insert.

You would do this in PHP.
Just do a SELECT COUNT(*) FROM members_products WHERE member_id = 3 before inserting.

Is it a common practice for PHP/MySQL apps to have Foreign Keys declared?

I am working on an open source PHP/MySQL application
I have looked at phpBB, Wordpress, and other apps to see if they specified Foreign Keys or not (to ensure referential integrity)
I cannot find that they have
Is it a common practice in these types of applications to specify Foreign Keys in the MySQL database structure?

Past versions of MySQL use the MyISAM storage engine by default, which does not support foreign key constraints. Unless you explicitly declare tables to use the InnoDB storage engine, or change the default storage engine server-wide, no foreign keys appear, and it's no surprise that software developers who design for MySQL don't bother to use foreign key constraints.
MySQL 5.5 is currently in beta and finally InnoDB will be the default storage engine. So foreign key constraints will be supported out of the box.
Yes, foreign keys are recommended. These constraints help to ensure your data always satisfies referential integrity. The alternative is that your database gradually fills with "crumbs" or rows that refer to a parent row that is no longer in the database. These can lead to strange results from queries and wasted space, inefficient queries, and you end up doing cleanup chores manually that would be unnecessary if you just had the database enforce cleanliness for you.
Re comment form #Jacob: Good points, but be sure to read recent articles comparing InnoDB vs. MyISAM Years ago, MyISAM was considered the "fast storage engine" and InnoDB was considered the storage engine you'd reluctantly have to use if you couldn't do without transactions.
But InnoDB has improved dramatically in the past few years, and in most cases today InnoDB performs faster than MyISAM.
Aside from MyISAM still supporting fulltext indexing as you mention, there are fewer and fewer reasons to use MyISAM. When I need fulltext indexing, I either maintain a MyISAM table as a mirror of my primary storage in InnoDB, or else I use Apache Solr.

I'm not sure how common it is, but I feel that you should express the conditions of the object model fully regardless of whether the underlying database fully supports them.
If you're generally writing ANSI SQL, then if you go ahead and add the foreign key constraints, then when your database supports them, you use an engine that supports them, or you move to another database that supports them, then you'll get them for "free" and not have to go back and attempt to find all the relations.
So, I would put the foreign keys in SQL anyway, but that's me and again may not be common.

MySQL used to not honor foreign keys. It still doesn't, unless you take measures.
Out of sight, out of mind, right?

In MySQL only InnoDB even supports Foreign Keys, and only after MySQL 5.1.
Edit: InnoDB will be the default storage engine in MySQL 5.5
Edit-Ignore: Referential Integrety will be a "new feature" in 6.1 according to their roadmap: http://en.wikipedia.org/wiki/MySQL#Future_releases

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.