I'm writing up a job recruitment site for ex-military personnel in PHP/MySQL and I'm at a stage where I'm not sure which direction to take
I have a "candidate" table with all the usual firstname, surname etc and a unique candidate_id, and an address table linked by the unique candidate_id
The client has asked for more data to be captured, as in driving licence type, religion, SIA Level (Security Industry Authority), languages spoken etc
My question is, with all this different data, is it worth setting up dedicated tables for each? So for example having a driving licence table, with all the different types of driving licence, each with a unique id, then linked to the candidate table with a driving_licence_id cell?
Or should I just serialize all the extra data as text and put it in one cell in the candidate table?
My question is, with all this different data, is it worth setting up dedicated tables for each?
Yes.
That is what databases are for.
Dedicated tables versus serialized data is called Database Normalization and Denormalization, respectively. In some cases both options are acceptable, but you should really make an educated choice, by reading up on the subject (for example here on about.com).
Personally I usually prefer working with normalized databases, as they are much easier to query for complex aggregated data. Besides, I feel they are easier to maintain as well, since you usually don't have to refactor when adding new fields and tables.
Finally, unless you have a lot of tables, you are unlikely to run into performance problems due to the number of one-to-one joins (the kind of data that's easy to denormalize).
It depends on whether you wish to query this data. If so keep the data normalised (eg in it's own logically separated table), otherwise, if it's just meta data to be pulled along for the ride, whatever is simplest seems reasonable.
Neither approach necessarily precludes the other in the future, simple migration scripts can be created to move the data from one format to the other. I would suggest doing what is simplest to enable you to work on other features of the site soonest.
You must Always go for normalization, believe me.
I made the mistake of going through the easy way and store data improperly (not only serialized, implode strings of multidimensional arrays ), then when the time came i had to re design the whole thing and it was a lot of time wasted.
I will never go by the wrong way again, clients can say "no" today, but "report (queries)" tomorrow.
Related
First of all, I apologize if a similar question has been asked and answered. I searched and found similar questions, but not one quite close enough.
My question is basically whether or not it is a good idea to separate tables of virtually the same data in my particular circumstance. The tables track data track data for two very different groups (product licensing data for individual users and product licensing data for enterprise users). I am thinking of separating them into two tables so that the user verification process runs faster (especially for individual users since the number of records is significantly lower (eg ~500 individual records vs ~10,000 enterprise records)). Lastly, there is a significant difference in the user types that isn't apparent in the table structure - individual users all have a fixed number of activations while enterprise users may have up to unlimited activations and the purpose of tracking is more for activation stats.
The reason I think separating the tables would be a good idea is because each table would be smaller, resulting in faster queries (at least I think it would...). On the other hand, I will have to do two queries to obtain analytical data. Additionally, I may wish to change the data I am tracking from time to time and obviously, this is more of a pain with two duplicate tables.
I am guessing the query time difference is probably insignificant, even with tens of thousands of records?? However, I would like to hear peoples' thoughts on this (mainly regarding efficiency and overall best practices) if they would be so kind to share.
Thanks in advance!
When designing your database structure you should try to normalize your data as much as possible. So to answer your question
"whether or not it is a good idea to separate tables of virtually the same data in my particular circumstance."
If you normalize your database correctly, the answer is no, it's not a good idea to create two tables with almost identical information. With normalization you should be able to separate out similar data into mapping tables which will allow you to create more complex queries that will run faster.
A very basic example of a first normal form normalization would be you have a table of users, and in the table you have a column for role. Instead of having the physical word "admin" or "member" you have an id that is mapped to another table called roles where 1 = admin and 2 = member. The idea is it is more efficient to store repeated ids rather then repeated words like admin and member.
I'm building a very large website currently it uses around 13 tables and by the time it's done it should be about 20.
I came up with an idea to change the preferences table to use ID, Key, Value instead of many columns however I have recently thought I could also store other data inside the table.
Would it be efficient / smart to store almost everything in one table?
Edit: Here is some more information. I am building a social network that may end up with thousands of users. MySQL cluster will be used when the site is launched for now I am testing using a development VPS however everything will be moved to a dedicated server before launch. I know barely anything about NDB so this should be fun :)
This model is called EAV (entity-attribute-value)
It is usable for some scenarios, however, it's less efficient due to larger records, larger number or joins and impossibility to create composite indexes on multiple attributes.
Basically, it's used when entities have lots of attributes which are extremely sparse (rarely filled) and/or cannot be predicted at design time, like user tags, custom fields etc.
Granted I don't know too much about large database designs, but from what i've seen, even extremely large applications store their things is a very small amount of tables (20GB per table).
For me, i would rather have more info in 1 table as it means that data is not littered everywhere, and that I don't have to perform operations on multiple tables. Though 1 table also means messy (usually for me, each object would have it's on table, and an object is something you have in your application logic, like a User class, or a BlogPost class)
I guess what i'm trying to say is that do whatever makes sense. Don't put information on the same thing in 2 different table, and don't put information of 2 things in 1 table. Stick with 1 table only describes a certain object (this is very difficult to explain, but if you do object oriented, you should understand.)
nope. preferences should be stored as-they-are (in users table)
for example private messages can't be stored in users table ...
you don't have to think about joining different tables ...
I would first say that 20 tables is not a lot.
In general (it's hard to say from the limited info you give) the key-value model is not as efficient speed wise, though it can be more efficient space wise.
I would definitely not do this. Basically, the reason being if you have a large set of data stored in a single table you will see performance issues pretty fast when constantly querying the same table. Then think about the joins and complexity of queries you're going to need (depending on your site)... not a task I would personally like to undertake.
With using multiple tables it splits the data into smaller sets and the resources required for the query are lower and as an extra bonus it's easier to program!
There are some applications for doing this but they are rare, more or less if you have a large table with a ton of columns and most aren't going to have a value.
I hope this helps :-)
I think 20 tables in a project is not a lot. I do see your point and interest in using EAV but I don't think it's necessary. I would stick to tables in 3NF with proper FK relationships etc and you should be OK :)
the simple answer is that 20 tables won't make it a big DB and MySQL won't need any optimization for that. So focus on clean DB structures and normalization instead.
I recently started working for a fairly small business, which runs a small website. I over heard a co worker mention that either our site or MySQL databases get hit ~87 times a second.
I was also tasked today, with reorganizing some tables in these databases. I have been taught in school that good database design dictates that to represent a many-to-many relationship between two tables I should use a third table as a middle man of sorts. (This third table would contain the id of the two related rows in the two tables.)
Currently we use two separate databases, totalling to a little less than 40 tables, with no table having more than 1k rows. Right now, some PHP scripts use a third table to relate certain rows that has a third column that is used to store a string of comma separated ids if a row in one table relates to more than one row in some other table(s). So if they want to use an id from the third column they would have to get the string and separate it and get the proper id.
When I mentioned that we should switch to using the third table properly like good design dictates they said that it would cause too much overhead for such small tables, because they would have to use several join statements to get the data they wanted.
Finally, my question is would creating stored procedures for these joins mitigate the impact these joins would have on the system?
Thanks a bunch, sorry for the lengthy explanation!
By the sound of things you should really try to redesign your database schema.
two separate databases, totalling to a little less than 40 tables, with no table having more than 1k rows
Sounds like it's not properly normalized - or it has been far to aggressively normalized and would benefit from some polymorphism.
comma separated ids
Oh dear - surrogate keys - not intrinsically bad but often a sign of bad design.
a third table to relate certain rows that has a third column that is used to store a string of comma separated ids
So it's a very long way from normalised - this is really bad.
they said that it would cause too much overhead for such small tables
Time to start polishing up your resume I think. Sounds like 'they' know very little about DBMS systems.
But if you must persevere with this - its a lot easier to emulate a badly designed database from a well designed one (hint - use views) than vice versa. Redesign the database offline and compare the performance of tuned queries - it will run at least as fast. Add views to allow the old code to run unmodified and compare the amount of code you need to performa key operations.
I don't understand how storing a comma separated list of id's in a single column, and having to parse the list of ids in order to get all associated rows, is less complex than a simple table join.
Moving your queries into a stored procedure normally won't provide any sort of benefit. But if you absolutely have to use the comma separated list of values that represent foreign key associations, then a stored procedure may improve performance. Perhaps in your stored procedure you could declare a temporary table (see Create table variable in MySQL for example), and then populate the temporary table, 1 row for every value contained in your comma separated string.
I'm not sure what type of performance gain you would get by doing that though, considering like you mentioned there's not a lot of rows in any of the tables. The whole exercise seems a bit silly to do. Ditching the comma separated list of id's would be the best way to go.
It will be both quicker and more simple to do it in the database than in PHP; that's what database engines are good at. Make indexes on the keys (InnoDB will do this by default) and the joins will be fast; to be honest, with tables that tiny, the joins will almost always be fast.
Stored procedures don't really come into the picture, for two reasons; mainly, they're not going to make any difference to the impact (not that there's any impact anyway - you will be improving app performance by doing this in the DB rather than at the PHP level).
Secondly, avoid MySQL stored procedures like the plague, they're goddamn awful to write and debug if you've ever worked in the stored procedure languages for any other DB at all.
Users can do advanced searches (they are many possible parameters):
/search/?query=toto&topic=12&minimumPrice=0&maximumPrice=1000
I would like to store the search parameters (after the /search/?) for an email alert.
I have 2 possibilites:
Storing the raw request (query=toto&topicId=12&minimumPrice=0&maximumPrice=1000) in a table with a structure like id, parameters.
Storing the request in a structured table id, query, topicId, minimumPrice, maximumPrice, etc.
Each solution has its pros and cons. Of course the solution 2 is the cleaner, but is it really worth the (over)effort?
If you already have implemented such a solution and have experienced the maintenance of it, what is the best solution?
The better solution should be the best for each dimension:
Rigidity
Fragility
Viscosity
Performance
Daniel's solution is likely to be the cleanest solution, but I get your point about performance. I'm not very familiar with PHP, but there should be some db abstraction library that takes care relations and multiple inserts so that you get the best performance, right? I only mention it because there may not be a real performance issue. DO you have load tests that point to an issue perhaps?
Anyway, if it is between your original 2 solutions, I would have to select the first. Having a table with column names (like your solution #2) is just asking for trouble. If you add new params, you have to modify the table columns. And there is the ever present issue of "what do we put to indicate not selected vs left empty?"
So I don't agree that solution 2 is cleaner.
You could have a table consisting of three columns: search_id, key, value with the two first being the primary key. This way you can reconstruct a particular search if you have the ID of a saved search. This also allows you to expand with additional search keywords without having to actually modify your table.
If you wish, you can also have key be a foreign key to another table containing valid search terms to ensure integrity. Whether you want to do that depends on your specific needs though.
Well that's completely dependent on what you want to do with the data. For the PHP part, you need to process it anyway, either on insertion or selection time.
For really large number of parameters you may save some time with the 1st on the database management/maintenance, since you don't need to change anything about your database scheme.
Daniel's answer is a generic solution, but if you consider performance an issue, you may end up doing too many inserts on the database side for a single search (one for each parameter). Too many inserts is a common source of performance problems.
You know your resources.
I'm just about to expand the functionality of our own CMS but was thinking of restructuring the database to make it simpler to add/edit data types and values.
Currently, the CMS is quite flat - the CMS requires a field in the database for every type of stored value (manually created).
The first option that comes to mind is simply a table which keeps the data types (ie: Address 1, Suburb, Email Address etc) and another table which holds values for each of these data types. Just like how Wordpress keeps values in the 'options' table, PHP serialize would be used to store an array of values.
The second option is how Drupal works, the CMS creates tables for every data type. Unlike Wordpress, this can be a bit of an overkill but really useful for SQL queries when ordering and grouping by a particular value.
What's everyone's thoughts?
In my opinion, you should avoid serialization where possible. Your relational database should be relational, and thus be structured as such. This would include the 'Drupal Method', e.g. one table per data type. This also keeps your database healthy in a sense that it can be indexed en easily queried upon.
Unless you plan to have lots of different data types that will be added in the future which are unknown now, this is not really going to help you and would be overkill. If you have very wide tables and lots of holes in your data (i.e. lots of columns that seem to be NULL at random) then that is a pattern that is screaming to maybe have a seperate table for data that may only belong to certain entries.
Keep it simple and logical. Don't abstract for the sake of abstraction. Indeed, storing integers is cheaper with regards to storage space but unless that is a problem then don't do it in this case.