I'm building a web-app with PHP on an Apache server.
The app contains a lot of optional data about persons. Depending on the category of the person (one person can be in may categories), they can choose to specify data or not: home-address (== 5 fields for street, city, country, ...), work-address (again 5 fields), age, telephone number, .... The app stores some additional data too, of course (created, last updated, username, password, userlevel, ...).
The current/outdated version of the app has 86 fields in the "users" table, and is (depending on the category of the person), extended with an additonal table with another 23 fields (1-1 relationship).
All this is stored in a Postgresql database.
I'm wondering if this is the best way to handle this type of data. Most records have (a lot of) empty fields, making the db larger and the queries slower. Is it worth looking into an other solution like a Triple Store, or am I worrying too much about it and should I just keep the current setup? It seems odd and feels awkward to just add fields to a table for every new purpose of the site. On the other hand, I have the impression that triple stores are not that common yet. Any pointers, or suggestions how to approach this?
I've read "Programming the semantic web" by Toby Segaran and others, but from that book I get the impression that the main advantage of triple stores and RDF is the exchange of information over the web (which is not the goal of my app)
Most records have (a lot of) empty fields
This implies that your data is far from normalized.
The current/outdated version of the app has 86 fields in the "users" table, and is (depending on the category of the person), extended with an additonal table with another 23 fields (1-1 relationship).
Indeed, yes, it's a very long way from being normalized.
If you've got a good reason to move away from where you are just now, then the firs step would be to structure your data much better. Even if you choose to move to a different type of DBMS e.g. noSQL or object db.
This does not just save space in your DBMS, it makes retrieving the data faster and reduces the amount of code you need to write (e.g. you can re-use the same code for maintaining a home address as maintaining a work address if you have a single table for 'address' with a field flagging the type of address).
There are lots of resources on the web (in addition to the wikipedia link above) describing how to apply the rules of normalization (it starts getting a little involved after 1,2 and 3 - but if you can master these then you're well equipped to take on most tasks).
Related
Iam a software engineer (since a few months ready with my study) and for my work i develop a large scalable web application. Another firm does the programming work and makes the database behind it. We defined the data and the relations between them but didn't give a hard database structure they should use.
Now the first (internal) things are visable. I looked in the database structure behind ans saw (in my opinion) something weirds.
For users they created a table users which contains stuff like id, email and password. Near this they created an user_meta table which contains an id, key and value.
When i have a user with:
userid
email
password
name
lastname
address
phone
the id, email and password is stored in the user table. For the other information are rows created in the user_meta table. This means in this case there are 4 rows created for 1 user (in our case are this more than 20 rows for each user) This structure is user for every object what should be saved in the database.
I learned to create a table which contains all the data which is necessary for the user. So in my logic it should be one table with 7 fields...
Extra information:
Our software is build on the Laravel framework (on my advice)
Our software uses MySQL database
Questions:
Is this a normal way to create database for large systems or something like that, becouse i never see this in my study or other projects in my live.
Is this best or bad practice?
In my opinion this is bad for performance. Especially when the user count grows, is this true? or can this be done to get the performance high (becouse not a whole record needs to be selected ?(in our case we expect AT LEAST 100.000 users at the end of 2017 which wil grow faster and faster when our project passes. there is a material chance that we grow far above 1.000.000 users in a few years)
I think the reason why they do it like this is that the "objects" can be changed very easy but in my opinion is it allways posible to add fields to an table and you should NEVER delete fields (also not when it is posible by the structure of the database), is my opinion in this right?
sorry when the questions are "noob questions" but for me this is the first big project in my live so i miss some experience but try to manage the project professionally. We will discuss this wednesday in an meeting. On this way i want to prepare my self a bit in this theme.
In my opinion this is bad for performance. Especially when the user count grows, is this true?
No.
There is a small difference in the insert/update cost which will not be dependent on the volume of data previously accumulated. Retrieval cost for a full record will be higher but slightly reduced for a partial record. At the end of the day, the performance differences are negligible as long as the record is still resolved in a single round trip to the DB (unlike a lot of trivial ORM implementations).
The biggest impact is functional. Its no effort to add, say, an attribute for title to the EAV table, but the normalized table may be knocked offline while the table is stresteched to accomodate the wider records (or the rows are migrated). OTOH you can't enforce a constraint at the database tier like every customer must have an email address.
Is this best or bad practice
Of itself neither. Although not documenting design decisions is bad practice.
far above 1.000.000 users in a few years
a million records is not a lot (unless the database design is really bad).
Iam a software engineer (since a few months ready with my study) and for my work i develop a large scalable web application. Another firm does the programming work and makes the database behind it. We defined the data and the relations between them but didn't give a hard database structure they should use.
Now the first (internal) things are visable. I looked in the database structure behind ans saw (in my opinion) something weirds.
For users they created a table users which contains stuff like id, email and password. Near this they created an user_meta table which contains an id, key and value.
When i have a user with:
userid
email
password
name
lastname
address
phone
the id, email and password is stored in the user table. For the other information are rows created in the user_meta table. This means in this case there are 4 rows created for 1 user (in our case are this more than 20 rows for each user) This structure is user for every object what should be saved in the database.
I learned to create a table which contains all the data which is necessary for the user. So in my logic it should be one table with 7 fields...
Extra information:
Our software is build on the Laravel framework (on my advice)
Our software uses MySQL database
Questions:
Is this a normal way to create database for large systems or something like that, becouse i never see this in my study or other projects in my live.
Is this best or bad practice?
In my opinion this is bad for performance. Especially when the user count grows, is this true? or can this be done to get the performance high (becouse not a whole record needs to be selected ?(in our case we expect AT LEAST 100.000 users at the end of 2017 which wil grow faster and faster when our project passes. there is a material chance that we grow far above 1.000.000 users in a few years)
I think the reason why they do it like this is that the "objects" can be changed very easy but in my opinion is it allways posible to add fields to an table and you should NEVER delete fields (also not when it is posible by the structure of the database), is my opinion in this right?
sorry when the questions are "noob questions" but for me this is the first big project in my live so i miss some experience but try to manage the project professionally. We will discuss this wednesday in an meeting. On this way i want to prepare my self a bit in this theme.
In my opinion this is bad for performance. Especially when the user count grows, is this true?
No.
There is a small difference in the insert/update cost which will not be dependent on the volume of data previously accumulated. Retrieval cost for a full record will be higher but slightly reduced for a partial record. At the end of the day, the performance differences are negligible as long as the record is still resolved in a single round trip to the DB (unlike a lot of trivial ORM implementations).
The biggest impact is functional. Its no effort to add, say, an attribute for title to the EAV table, but the normalized table may be knocked offline while the table is stresteched to accomodate the wider records (or the rows are migrated). OTOH you can't enforce a constraint at the database tier like every customer must have an email address.
Is this best or bad practice
Of itself neither. Although not documenting design decisions is bad practice.
far above 1.000.000 users in a few years
a million records is not a lot (unless the database design is really bad).
The system I'm working is structured as below. Given that I'm planning to use Joomla as the base.
a(www.a.com),b(www.b.com),c(www.c.com) are search portals which allows user to to search for reservation.
x(www.x.com),y(www.y.com),z(www.z.com) are hotels where booking are made by users.
www.a.com's user can only search for the booking which are in
www.x.com
www.b.com's user can only search for the booking which are in
www.x.com,www.y.com
www.c.com's user can search for all the booking which are in
www.x.com, www.y.com, www.z.com
All a,b,c,x,y,z runs the same system. But they should have separate domains. So according to my finding and research architecture should be as above where an API integrate all database calls.
Given that only 6 instance are shown here(a,b,c,x,y,z). There can be up to 100 with different search combinations.
My problems,
Should I maintain a single database for the whole system ? If so how can I unplug one instance if required(EG : removing www.a.com from the system or removing www.z.com from the system) ? Since I'm using mysql will it not be cumbersome for the system due to the number of records?
If I maintain separate database for each instance how can I do the search? How can I integrate required records into one and do the search?
Is there a different database approach to be used rather than mentioned above ?
The problem you describe is "multitenancy" - it's a fairly tricky problem to solve, but luckily, others have written up some useful approaches. (Though the link is to Microsoft, it applies to most SQL environments except in the details).
The trade-offs in your case are:
Does the data from your hotels fit into a single schema? Do their "vacancy" records have the same fields?
How many hotels will there be? 3 separate databases is kinda manageable; 30 is probably not; 300 is definitely not.
How large will the database grow? How many vacancy records?
How likely is it that the data structures will change over time? How likely is it that one hotel will need a change that the others don't?
By far the simplest to manage and develop against is the "single database" model, but only if the data is moderately homogenous in schema, and as long as you can query the data with reasonable performance. I'd not worry about putting a lot of records in MySQL - it scales very well.
In such a design, you'd map "portal" to "hotel" in a lookup table:
PortalHotelAccess
PortalID HotelID
-----------------
A X
B X
B Y
C X
C Y
C Z
I can suggest 2 approaches. Which one to choose depends from some additional information about whole system. In fact, the main question is whether your system can impersonate (substitune by itself, in legal meaning) any of data providers (x, y, z, etc) from consumers point of view (a, b, c, etc) or not.
Centralized DB
First one is actually based on your original scheme with centralized API. It implies a single search engine, collecting required data from data sources, aggregating it in its own DB, and providing to data cosumers.
This is most likely a preferrable solution if data sources are different in their data representation, so you need to preprocess it for uniformity. Also this variant protects your clients from possible problems in connectivity, that is if one of source site goes offline for a short period (I think this may be even up to several hours without a great impact on the booking service actuality), you can still handle requests to the offline site, and store all new documents in the central DB until the problems solved. On the other hand, this means that you should provide some sort of two-way synchronization between your DB and every data source site. Also the centralized DB should be created with reliability in mind in the first place, so it seems that it should be distributed (preferrably over different data centers).
As a result - this approach will probably give best user experience, but will require sufficient efforts for robust implementation.
Multiple Local DBs
If every data provider runs its own DB, but all of them (including backend APIs) are based on a single standard, you can eliminate the need to copy their data into central DB. Of course, the central point should remain, but it will host a middle-layer logic only without DB. The layer is actually an API which binds (x, y, z) with appropriate (a, b, c) - that is a configuration, nothing more. Every consumer site will host a widget (can be just a javascript or fully-fledged web-application) loaded from your central point with appropriate settings embedded into it.
The widget will request all specified backends directly, and aggregate their results in a single list.
This variant is much like most of todays web-applications work, it's simplier to implement, but it is more error-prone.
I'm writing up a job recruitment site for ex-military personnel in PHP/MySQL and I'm at a stage where I'm not sure which direction to take
I have a "candidate" table with all the usual firstname, surname etc and a unique candidate_id, and an address table linked by the unique candidate_id
The client has asked for more data to be captured, as in driving licence type, religion, SIA Level (Security Industry Authority), languages spoken etc
My question is, with all this different data, is it worth setting up dedicated tables for each? So for example having a driving licence table, with all the different types of driving licence, each with a unique id, then linked to the candidate table with a driving_licence_id cell?
Or should I just serialize all the extra data as text and put it in one cell in the candidate table?
My question is, with all this different data, is it worth setting up dedicated tables for each?
Yes.
That is what databases are for.
Dedicated tables versus serialized data is called Database Normalization and Denormalization, respectively. In some cases both options are acceptable, but you should really make an educated choice, by reading up on the subject (for example here on about.com).
Personally I usually prefer working with normalized databases, as they are much easier to query for complex aggregated data. Besides, I feel they are easier to maintain as well, since you usually don't have to refactor when adding new fields and tables.
Finally, unless you have a lot of tables, you are unlikely to run into performance problems due to the number of one-to-one joins (the kind of data that's easy to denormalize).
It depends on whether you wish to query this data. If so keep the data normalised (eg in it's own logically separated table), otherwise, if it's just meta data to be pulled along for the ride, whatever is simplest seems reasonable.
Neither approach necessarily precludes the other in the future, simple migration scripts can be created to move the data from one format to the other. I would suggest doing what is simplest to enable you to work on other features of the site soonest.
You must Always go for normalization, believe me.
I made the mistake of going through the easy way and store data improperly (not only serialized, implode strings of multidimensional arrays ), then when the time came i had to re design the whole thing and it was a lot of time wasted.
I will never go by the wrong way again, clients can say "no" today, but "report (queries)" tomorrow.
I'm just about to expand the functionality of our own CMS but was thinking of restructuring the database to make it simpler to add/edit data types and values.
Currently, the CMS is quite flat - the CMS requires a field in the database for every type of stored value (manually created).
The first option that comes to mind is simply a table which keeps the data types (ie: Address 1, Suburb, Email Address etc) and another table which holds values for each of these data types. Just like how Wordpress keeps values in the 'options' table, PHP serialize would be used to store an array of values.
The second option is how Drupal works, the CMS creates tables for every data type. Unlike Wordpress, this can be a bit of an overkill but really useful for SQL queries when ordering and grouping by a particular value.
What's everyone's thoughts?
In my opinion, you should avoid serialization where possible. Your relational database should be relational, and thus be structured as such. This would include the 'Drupal Method', e.g. one table per data type. This also keeps your database healthy in a sense that it can be indexed en easily queried upon.
Unless you plan to have lots of different data types that will be added in the future which are unknown now, this is not really going to help you and would be overkill. If you have very wide tables and lots of holes in your data (i.e. lots of columns that seem to be NULL at random) then that is a pattern that is screaming to maybe have a seperate table for data that may only belong to certain entries.
Keep it simple and logical. Don't abstract for the sake of abstraction. Indeed, storing integers is cheaper with regards to storage space but unless that is a problem then don't do it in this case.