My website has a followers/following system (like Twitter's). My dilemma is creating the database structure to handle who's following who.
What I came up with was creating a table like this:
id | user_id | followers | following
1 | 20 | 23,58,84 | 11,156,27
2 | 21 | 72,35,14 | 6,98,44,12
... | ... | ... | ...
Basically, I was thinking that each user would have a row with columns for their followers and the users they're following. The followers and people they're following would have their user id's separated by commas.
Is this an effective way of handling it? If not, what's the best alternative?
That's the worst way to do it. It's against normalization. Have 2 seperate tables. Users and User_Followers. Users will store user information. User_Followers will be like this:
id | user_id | follower_id
1 | 20 | 45
2 | 20 | 53
3 | 32 | 20
User_Id and Follower_Id's will be foreign keys referring the Id column in the Users table.
There is a better physical structure than proposed by other answers so far:
CREATE TABLE follower (
user_id INT, -- References user.
follower_id INT, -- References user.
PRIMARY KEY (user_id, follower_id),
UNIQUE INDEX (follower_id, user_id)
);
InnoDB tables are clustered, so the secondary indexes behave differently than in heap-based tables and can have unexpected overheads if you are not cognizant of that. Having a surrogate primary key id just adds another index for no good reason1 and makes indexes on {user_id, follower_id} and {follower_id, user_id} fatter than they need to be (because secondary indexes in a clustered table implicitly include a copy of the PK).
The table above has no surrogate key id and (assuming InnoDB) is physically represented by two B-Trees (one for the primary/clustering key and one for the secondary index), which is about as efficient as it gets for searching in both directions2. If you only need one direction, you can abandon the secondary index and go down to just one B-Tree.
BTW what you did was a violation of the principle of atomicity, and therefore of 1NF.
1 And every additional index takes space, lowers the cache effectiveness and impacts the INSERT/UPDATE/DELETE performance.
2 From followee to follower and vice versa.
One weakness of that representation is that each relationship is encoded twice: once in the row for the follower and once in the row for the following user, making it harder to maintain data integrity and updates tedious.
I would make one table for users and one table for relationships. The relationship table would look like:
id | follower | following
1 | 23 | 20
2 | 58 | 20
3 | 84 | 20
4 | 20 | 11
...
This way adding new relationships is simply an insert, and removing relationships is a delete. It's also much easier to roll up the counts to determine how many followers a given user has.
No, the approach you describe has a few problems.
First, storing multiple data points as comma-separated strings has a number of issues. It's difficult to join on (and while you can join using like it will slow down performance) and difficult and slow to search on, and can't be indexed the way you would want.
Second, if you store both a list of followers and a list of people following, you have redundant data (the fact that A is following B will show up in two places), which is both a waste of space, and also creates the potential of data getting out-of-sync (if the database shows A on B's list of followers, but doesn't show B on A's list of following, then the data is inconsistent in a way that's very hard to recover from).
Instead, use a join table. That's a separate table where each row has a user id and a follower id. This allows things to be stored in one place, allows indexing and joining, and also allows you to add additional columns to that row, for example to show when the following relationship started.
Related
I'm currently developing an application that allows a customer to register for an event through a custom form. That custom form will be built by the event admin for specific input by the customer.
The customer will go to the form, complete the input and pick a venue that will then display the available time-slots. I'm stuck with these two database designs and wondering which one is a better approach.
Pivot table with 3 foreign keys
Table 'Customers' -
| id | name |
Table 'Events' -
| id | name | form_fields (json)
Table 'Venues' -
| id | address | event_id |
Table 'Timeslots' -
| id | datetime | slots | venue_id |
Pivot Table 'Tickets' -
|id | customer_id | timeslot_id | event_id | form_data (json)
Two pivot tables
Table 'Customers' -
| id | name |
Table 'Events' -
| id | name | form_fields (json)
Table 'Venues' -
| id | address | event_id |
Table 'Timeslots' -
| id | datetime | slots | venue_id |
Pivot Table 'Tickets' -
| id | customer_id | timeslot_id |
Pivot Table 'EventCustomers' -
| id | customer id | event_id | form_data (json)
In addition, I will store the HTML markup of the custom form built by admin in 'form_fields' (json) and have the customer complete the form and store the values in 'form_data' (json).
Is it also sensible to have the custom form and data saved in json?
Thank you.
To answer your question(even if it's a bit off topic):
None of the above.
To model data we must ask ourselves what are the constraints. Data is often easier to define by what it cannot do, not what it can do.
For example, can you have a Tickets record that:
Does not have a customer record ( customer_id = null )
Does not have a timeslot ( timeslot_id = null) -timeslot is related to venue or the location and time of the event.
Does not have an event ( event_id = null )
If you answered no to all of these then we have to bring this data all together at one time (but, not necessarily in the same table).
Now in my mind, it's pretty clear you could/should not have a ticket that:
wasn't assigned to a customer
does not have an event
does not have a timeslot
does not have a venue
whose number exceeds the number of slots for the event (this you mostly missed on)
So I will assume these are our "basic" constraints
Problems with your second case:
you could sell a ticket to a customer for a particular timeslot ( at a venue ), but for an unknown event. Record in Tickets, and No record in the EventCustomers table
you could also have a customer registered to an event, with no ticket or timeslot/venue. Record in EventCustomers and No record in the Tickets table
To me that seems somewhat illogical, and indeed it violates the constraints I outlined above.
Problems with your first case:
On the surface the first case looks fine as far as our constraints above look. But as I worked though it some issues popped up. To understand these, as a general rule, we always want a unique index on all the foreign keys in a pivot table ( aka a unique compound key ).
So in the first case we want this(idealy):
Pivot Table 'Tickets' -
|id | customer_id | timeslot_id | event_id | form_data (json)
//for this table you would want this compound, unique index
Unique Key ticket (customer_id,timeslot_id,event_id)
This lead me to the number of "slots" as this would imply that a customer could only have one tickets record per event and timeslot/venue. This relates back to the part I said that you mostly missed on, i.e. you have no way to track how many you have used. At first you might want to allow duplicates in this table. "We can just add some more tickets in right?" - you think, and this is the easy fix, not.
Exhibit A:
Pivot Table 'Tickets' -
|id | customer_id | timeslot_id | event_id | form_data (json)
| 1 | 1 | 1 | 1 | {}
| 2 | 1 | 1 | 1 | {}
While contemplating Exhibit A consider some basic DB design rules:
In a good DB design you always want ( ideally )
a surrogate primary key, a key with no relation to the data, this is id
a natural key, a unique key that is part of the data. An example would be if you had an email field attracted to customer, you could make this unique to prevent adding duplicate customers. It's a piece of the data that is by it's nature unique and part of the data.
The first one (surrogate keys) allow you use the data with no knowledge of the data itself. This is good as it gives us some separation of concerns, some abstractions between our code and the data. When you join two tables on their primary key, foreign key relationship you don't need to know anything else about the data.
The second (natural key) is essential to prevent duplicate data. In the case of a pivot table the foreign keys, which are surrogate keys in their respective tables, become a natural key in the pivot table. These are now part of the data in the context of the pivot table and they uniquely and naturally identify that data.
Why is uniqueness so important?
Once you allow duplicates with the pivot tables you will run into several issues (especially if you have accessory data like the form_data):
How to tell those records apart?
Which of the duplicates is the authoritative copy, which is in charge.
How do you synchronize that accessory data, if you need to change form_data, which record do you change it in. Only one? Which one? Both? how do you maintain synchronizing all the duplicates.
What if an accidental duplicate gets entered, how will you know it was accidental? How do you know it's a real duplicate or true duplicate and not a valid record.
Even if you knew it was an accidental duplicat, how do you decide which one of the duplicates should be removed, this goes back to which is the authoritative record.
In short order, it really becomes a mess to deal with.
Finally (what I would suggest)
Table 'customer' -
| id | name |
Table 'event' -
| id | name | form_fields (json)
Table 'venue' -
| id | address | slots |
Table 'show' -
| id | datetime | venue_id | event_id |
Table 'purchase' -
| id | show_id | customer_id | slots | created |
Table 'ticket' ( customers_shows )
| id | purchase_id | guid |
I changed quite a few things (you can use some or all of these changes):
I changed the plural names to singular. I only use plurals when I do pivot tables that have a no accessory data, such a name would be venues_events. This is because a record from customer is a single entity, I don't need to do any joins to get useful data. A record from our hypothetical venues_events would encompass 2 entities, so I would know right away I need to do a join no matter what as there is no other data besides the foreign keys.
Now in the case of show, you may notice that is essentially a pivot table. So why did I not name it venues_events as I listed above. The reason is we have a datetime column in there, which is what I mean by "accessory" data. So in this case I could pull data just from show if I just wanted the datetime and I would not need a join to do it. So it can be considered a single entity that has some Many to One relationships. ( A Many to Many is a Many to One and a One to Many that's why we need pivot tables ) More on this table later.
Letter Casing and spacing. I would suggest using all lowercase and no spaces. MySql is case sensitive and doesn't play nice with spaces. It's just easier from a standpoint of not having to remember did we name it venuesEvents or VenuesEvents or Venuesevents etc... Consistency in naming convention is paramount in good DB design.
The above is largely Opinion based, it's my answer so it's my opinion. Deal with it.
Table show
I moved the slotscolumn to venue. I am assuming that the venue will determine how many slots are available, in my mind this is a physical requirement or attribute of the venue itself. For example a Movie theater has only X number of seats, no matter what time the movie is at doesn't change how many seats are there. If those assumptions are correct then it saves us a lot of work trying to remember how many seats a venue has every time we enter a show.
The reason I changed timeslot to show is that in both your original cases, there is some disharmony in the data model. Some things that just don't tie together as well as they should. For example your timeslots have no direct relation to the event.
Exhibit B (using your structure):
Table 'event' -
| id | name | form_fields (json) |
| 1 | "Event A" | "{}" |
| 2 | "Event B" | "{}" |
Table 'Venues' -
| id | address | event_id |
| 1 | "123 ABC SE" | 1 |
| 2 | "123 AB SE" | 2 | //address entered wrong as AB instead ABC
Table 'Timeslots' -
| id | datetime | slots | venue_id |
| 1 | "2018-01-27 04:41:23" | 200 | 1 |
| 2 | "2018-01-27 04:41:23" | 200 | 2 |
In the above exhibit, we can see right away we have to duplicate the address to create more then one event at a given venue. So if the address was entered wrong, it could be correct in some venues and incorrect in others. This can be a real issue as programmatically how do you know that AB was supposed to be ABC when the venue ID and event ID are both different for this record. Basically how do you tell those records apart at run time? You will find that it is very difficult to do. The main problem is you have to much data in Veneues, your trying to do to much with it and the relationship doesn't fit the constraints of the data.
That's not even the worst of it as a further problem creeps in, because now that the venue_id is different we can corrupt our Timeslots table and have 2 records in there at the same time for the same venue. Then, because the slots are tied to this table, we can also corrupt things down stream such as selling more tickets then we should for that time and place. Everything just starts to fracture.
Even counting the numbers of shows at a given venue becomes a real challenge, this "flaw" is in both data models you presented.
The same Data in my Model
#with Unique compound Key datetime_venue_id( show.datetime, show.venue_id)
Table 'event' -
| id | name | form_fields (json) |
| 1 | "Event A" | "{}" |
#| 2 | "Event B" | "{}" |
Table 'venue' -
| id | address | slots |
| 1 | "123 ABC SE" | 200 |
Table 'show' -
| id | datetime | venue_id | event_id |
| 1 | "2018-01-27 04:41:23" | 1 | 1 |
#| 2 | "2018-01-27 04:41:23" | 1 | 2 |
As you can see, you no longer have the duplicate address. And while it looks like you could enter in 2 shows for the same venue at the same time, this is only because we don't have a compound unique key that includes the datetime and venue_id a.k.a. Unique Key datetime_venue_id( datetime, venue_id). If you tried inserting that data with that constraint MySql would blowup on you. And if you included both inserts ( event and show ) in the same "Transaction" (which is how I would do it, in innodb engine) the whole thing would fail and get rolled back and neither the event or show would get inserted.
Now you could try to argue that you could have the same Unique constraint on Exhibit B, but as the Venue ID is different there, you would be wrong.
Anyway, show is our new main pivot table with foreign keys from event and venue and then the accessory data datetime.
Besides what I went over above, this setup gives us several advantages over the old structure, in this one table we now have access to:
what and where is the event (by joining on Table event )
when is the event ( timestamp )
how many slots available for the event (by joining on Table venue)
This centers everything around the show record. We can build a "show" independent of a customer or tickets. Because really a customer is not part of the show, and including them to soon (or to late depending how you look at it) in the data model muddies everything up.
Exhibit C
#in your first case
Pivot Table 'Tickets' -
|id | customer_id | timeslot_id | event_id | form_data (json)
#in your second case
Pivot Table 'Tickets' -
| id | customer_id | timeslot_id |
Pivot Table 'EventCustomers' -
| id | customer id | event_id | form_data (json)
AS I said above, you cant put what I am calling a show the what,where and when together without having a customer ID (in either of your data models). As you build your application around this later it will become a huge issue. This may be insurmountable at run time. Basically, you need all that data assembled and waiting on the customer_id. In both of your models that's not the case, and there is data you may not have easy access to. For example for the first case (of the old structure) how would you know that timeslot_id=20 AND event_id=32 plus a customer equals a valid ticket? There is no direct relationship between timeslot and event outside of the pivot table that contains the customer. timeslot_id=20 could be valid for any event and you have no way to know that.
It's so much easier to grab say show=32 and check how many slots are left and then just do the purchase record. Everything is ready and waiting for it.
Table purchase
I also added purchase or an order table, even if the "shows" are free this table provides us with some great utility. This is also a pivot table, but it has some accessory data just like show does. ( slots and created ).
This table
we bind the customer table to the show table here
we have a 'created' field so you will know when this record was created, when the tickets where purchased
we also have a number of slots the customer will use, we can do an aggregate sum of slots grouped on the show_id to see how many slots we have "sold". With one join from show to venue we can find out how many total slots this "show" has with the same integer key (show.id) that we used above to aggregate. Then it would be a simple matter to compare the two, if you wanted to get fancy you may be able to do this all in one query.
Table ticket
Now you may or may not even need this table. It has a many to one relationship to table purchase. So One order can have Many tickets. The records in here would be generated when a purchase is made, the number dependent on what is in slots. The primary use of this table is just to provide a unique record for each individual ticket. For this I have guid column which can just be a unique hash. Basically this would give you some tracking ability on individual tickets, I don't really have enough information to know how this will work in your case. You may even be able to replace this table with JSON data if searching on it is not a concern, and that would make maintenance of it easier in the case that some tickets are refunded. But as I hinted this is very dependent on your particular use case.
Some brief SQL examples
Joining Everything (just to show the relationships):
SELECT
{some fields}
FROM
ticket AS t
JOIN
puchase AS p ON t.purchase_id = p.id
JOIN
customer AS c ON p.customer_id = c.id
JOIN
show AS s ON p.show_id = s.id
JOIN
venue AS v ON s.venue_id = s.id
JOIN
event AS e ON s.event_id = e.id
Counting the used slots for a show:
SELECT
SUM(slots) AS used_slots
FROM
puchase
WHERE
show_id = :show_id
GROUP BY show_id
Get the available slots for a show:
SELECT
v.name,
v.slots
FROM
venue AS v
JOIN
show AS s ON s.venue_id = v.id
WHERE
v.show_id = :show_id
# or you could do s.id = :show_id
It also works out nice that all the tables start with a different letter, which makes aliasing a bit easier.
-note- The table name event may be a reserved word in MySql, I am not sure off the top of my head if it will work as a table name. Some reserved words still work in some parts of the query based on the context it's used in. Even if that is true, I am sure you can come up with a work around for it. Coincidentally this is why I named purchase that instead of order as "order" is a reserved word. (I just happen to think of event)
I hope that helps and makes sense. I probably spent way more time on this then I should have, but I design things like this for a living and I really enjoy the data architecture part of it, so I can get a bit carried away at times.
Non F-K SCHEMA
human
human_id | name
alien
alien_id | name | planet
comment
comment_id | text
1 hello
vote
to_id | to_type | who | who_type
1 human 1 alien
1 comment 1 human
FK- SCHEMA
human
human_id | name
alien
alien_id | name | planet
comment
comment_id | text
1 hello
entity_id
entity_id | id | type
1 1 human
2 1 comment
3 1 alien
vote
to_id | who_id
1 3
2 1
I want to ask which one is better ?
First one is without foreign key
Second one is with foreign key
Isnt the second one (with fk key) will be slow as i have to do twice inserting and unnecesary joins in order to get human/ alien name etc.
And what will happen if entity_id reaches a maximum of 18446744073709551615 ?
I suggest you add a supertype to unify Human and Alien and use this supertype in relationships. I also suggest separating votes on comments from votes on users into separate relationships. Consider the following tables:
This is the basic idea, though somewhat oversimplified. It allows a User to have both Human and Alien details. If required, disjoint subtypes can be enforced with a few additional columns and triggers.
You ask whether a foreign key and joins will be slower. An argument can be made that normalized databases are likely to be more efficient, since redundant associations are eliminated. In practice, performance has much more to do with effective use of indexes than avoiding joins.
If an auto_increment column overflows, the database engine will return an error and refuse to insert more rows. In this case you can adjust the column to use a larger type. When you exceed the space of the largest types in MySQL, it's probably time for a different (or even custom) solution.
I have a table company in which I am saving company information and I want to save N number of company locations for that particular company (country_id, city_id). 1 company has Multiple locations. I have to save country and city in database in such a way that if user wants to view company filter by Country or filter by city, it will search very fast (Indexing applied).
Which option will give me better performance in terms of fast search, Normalization?
Option 1:
Should I maintain country id and city id in JSON and save it in Company table?
No need of new table. every time I will add or update JSON based on users selections.
for e.g.
[{"country1" : {city1, city2, city3}},
{"country3" : {city5, city1, city3}}]
Then I can (LIKE) query on this field -> decode json -> return result
option 2:
Should I create new table and save country's and city's PK along with
company_id FK.
company_id (FK) | country id | city id
1 | 25 | 12
1 | 25 | 16
1 | 25 | 19
1 | 30 | 1
1 | 30 | 69
1 | 30 | 14
then just query and return result
Normalize if you're using traditional SQL.
MongoDB and other similar systems for storing hierarchical data (MarkLogic, etc) have ways of making the search of JSON docs fast.
But searching and updating denormalized data is an unreliable pain in the neck in SQL. With the volume you have, it will be very slow.
Option #2, meaning creating a separate table for company location is the best option. Use the combination of all 3 columns to create the primary key as a clustered index.
Never, under any circumstances, will a delimited value column be more efficient then lookup tables in a relational database. the cost of using Like or parsing the data (not to mention if you are using the like operator to get more results then needed and then parse the data in code) is always higher then the cost of querying a well indexed normalized tables with a simple inner join.
DETAILS
I have a quiz (let’s call it quiz1). Quiz1 uses the same wordlist each time it is generated.
If the user needs to, they can skip words to complete the quiz. I’d like to store those skipped words in mysql and then later perform statistics on them.
At first I was going to store the missed words in one column as a string. Each word would be separated by a comma.
|testid | missedwords | score | userid |
*************************************************************************
| quiz1 | wordlist,missed,skipped,words | 59 | 1 |
| quiz2 | different,quiz,list | 65 | 1 |
The problem with this approach is that I want to show statistics at the end of each quiz about which words were most frequently missed by users who took quiz1.
I’m assuming that storing missed words in one column as above is inefficient for this purpose as I'd need to extract the information and then tally it -(probably tally using php- unless I stored that tallied data in a separate table).
I then thought perhaps I need to create a separate table for the missed words
The advantage of the below table is that it should be easy to tally the words from the table below.
|Instance| missed word |
*****************************
| 1 | wordlist |
| 1 | missed |
| 1 | skipped |
Another approach
I could create a table with tallys and update it each time quiz1 was taken.
Testid | wordlist| missed| skipped| otherword|
**************************************************
Quiz1 | 1 | 1| 1| 0 |
The problem with this approach is that I would need a different table for each quiz, because each quiz will use different words. Also information is lost because only the tally is kept not the related data such which user missed which words.
Question
Which approach would you use? Why? Alternative approaches to this task are welcome. If you see any flaws in my logic please feel free to point them out.
EDIT
Users will be able to retake the quiz as many times as they like. Their information will not be updated, instead a new instance would be created for each quiz they retook.
The best way to do this is to have the word collection completely normalized. This way, analyses will be easy and fast.
quiz_words with wordID, word
quiz_skipped_words with quizID, userID, wordID
To get all the skipped words of a user:
SELECT wordID, word
FROM quiz_words
JOIN quiz_skipped_words USING (wordID)
WHERE userID = ?;
You could add a group by clause to have group counts of the same word.
To get the count of a specific word:
SELECT COUNT(*)
FROM quiz_words
WHERE word LIKE '?';
According to database normalization theory, second approach is better, because ideally one relational table cell should store only one value, which is atomic and unsplitable. Each word is an entity instance.
Also, I might suggest to not create Quiz-Word tables, but reserve another column in Missed-Word table for quiz, for which this word was specified, then use this column as a foreign key for Quiz table. Then you probably may avoid real time table generation (which is a "bad practice" in database design).
why not have a quiz table and quiz_words table, the quiz_words table would store id,quizID,word as columns. Then for each quiz instance create records in the quiz_words table for each word the user did use.
You could then run mysql counts on the quiz_words table based on quizID and or quiz type
The best solution (from my pov) for what are you trying to achieve is the normalized aproach:
test table which has test_id column and other columns
missed_words table which has id (AI PK) and word (UQ) , here you can also have a hits column that should be incremented each time that a association to this word is made in test_missed_words table this way you have the stats that you want already compiled and you don't need them to be calculated from a select query
test_missed_words which is a link table that has test_id and missed_word_id (composite PK)
This way you do not have redundant data (missed words) and you can extract easily that stats that you want
Keeping as much information as possible (and being able to compile user-specific stats later as well as overall stats now) I would create a table structure similar to:
Stats
quizId | userId | type| wordId|
******************************************
1 | 1 | missed| 4|
1 | 1 | skipped| 7|
Where type can either be an int defining the different types of actions, or a string representation - depending on if you believe it can ever be more. ^^
Then:
Quizzes
quizId | quizName|
********************
1| Quiz 1|
With the word list made for each quiz like:
WordList (pk: wordId)
quizId | wordId| word|
***************************
1 | 1 | Cat|
1 | 2 | Dog|
You would have your user table however you want, we are just linking the id from it in to this system.
With this, all id fields will be non-unique keys in the stats table. When a user skips or misses a word, you would add the id of that word to the stats table along with relevant quizId and type. Getting stats this way would make it easy as a per-user basis, a per-word basis, or a per-type basis - or a combination of the three. It will also make the word list for each quiz easily available as well for making the quizzes. ^^
Hope this helps!
What's the best way to store site statistics for specific users? Basically I want to store how many times a user has done a specific task. The data will be coming from a potentially large table and will be referenced frequently, so I want to avoid COUNT() and store them in their own table.
Method A
Have a table with the following fields, then have a row for each user to store the count for each field:
User_id | posted_comments | comment_replies | post_upvotes | post_downvotes
50 12 7 23 54
Method B
Have one table storing the actions, and another storing the count for that action:
Table 1:
Id | Action
1 | posted_comments
2 | comment_replies
3 | post_upvotes
4 | post_downvotes
Table 2
User_id | Action | Count
50 | 1 | 12
50 | 2 | 7
50 | 3 | 23
50 | 4 | 54
I can't see me having more than 25-30 actions in total, but I'm not sure if that is too many to store horizontally as in method A.
I think you answered your question. If you don't know what the actions are, then store each action in a separate row. That would be the second option.
Be sure that you have the proper indexes on the table. One possibility is (user_id, action, count). With this index, it will be fast to denormalize the table at the user level.
If you have a well-defined problem and won't need to be adding/removing/renaming columns in a table, then the first version is also feasible. Otherwise, just stick with inserting rows. The queries may seem a little bit more complicated, but the application is more flexible.
Seems like a typical BI question to me. The real question is not how many "actions" you have in your dimension, but how often they change.
Table A is denormalized and quick and easy to read: with a "SELECT" you get your information in the proper format.
Table B is normalized and easier to maintain It is highly recommended if your list of actions difficult to defined in advance, and is a must if it is dynamic.
To pass back and forth from Table A to Table B is known as pivot operations, for which you find standard tools, but which are never easy to code manually. So do not jump too quickly to the conclusion that Table B is better just because every body tells so since Codd in 1970.
I suggest you to ask yourself the question of how often will your COUNT(*) table(s) will be read. If you can live with the statistics of yesterday, then compute BOTH tables every night.