Issue: I am working on a kind of e-commerce platform which has sellers and buyers.Now in my case a seller can also be a buyer i.e every user can buy plus sell.
So i have a single table called users.Now I want to implement a follow vendor/user feature,wherein the user can click follow and he sees all the goods listed by that vendor under his account(till he unfollows).
Now my traditional approach was to have a table that has a key and two columns to store the follower and the followed Eg:
|id | userId| vendorId So it will go horizontally as the users go on following others.But if I have a user following many people(say 100) my query may take a lot of time to select a 100 records for each user.
Question: How can I implement the follow mechanism?Is there a better approach than this?I am using PHP and Mysql.
Reasearch: I tried going through how facebook and Pinterest handle it,but that seemed a bit too bigg for me to learn now as I don't expect as many users immedeately. Do I need to use memcache to enhance the performance and avoid recurring queries?Can I use a Document Database in any sense parallel with Mysql?
I would like a simple yet powerful implementation that would scale if my userbase grows gradually to a few thousands.
Any help or insights would be very helpful.
Since, from my understanding of this scenario, a user may follow many vendors, and a vendor may have many followers, this constitutes a many<->many relationship, and thus the only normalised way to achieve this in a database schema should be through using a link table, exactly as you described.
As for the performance considerations, I wouldn't worry too much about it, since it could be indexed on userId and vendorId, the queries should be fine.
The junction table is probably the best approach but still a lot depends on your clustered index.
Table clustered with a key on the substitute key id can make adding new records a bit faster.
Table clusetered with a key (userId,vendorId) will make the queries where you look for vendors a certain user follows faster
Table clustered with a key (vendorId,userId) will make the queries where you look for users that follow a certain vendor faster
Related
I'm trying to create a Like/Unlike system akin to Facebook's for an existing comments section of a website, and I need help in designing the system.
Currently, every product on the website has a comments section and members can post and like comments. I need to know each member has posted how many comments and each of his comments has received how many likes. Of course, I need to know who liked what comments too (partly so that I can prevent a user from liking a comment more than once) for analytical purposes.
The naive way of implementing a Like system to the current comments module is to create a new table in the database that has foreign keys to the CommentID and UserID. Then for every "like" given to a comment by a user, I would insert a row to this new table with the targeting comment ID and user ID.
While this might work, the massive amount of comments and users is going to cause this table to grow quickly and retrieving records from and doing counts on this huge table will become slow and inefficient. I can index either one of the columns, but I don't know how effective it would be. The website has over a million comments.
I'm using PHP and MySQL. For a system like this with a huge database, how should I designing a Like system so that it is more optimised and stable?
For scalability, do not include the count column in the same table with other things. This is a rare case where "vertical partitioning" is beneficial. Why? The LIKEs/UNLIKEs will come fast and furious. If the code to do the increment/decrement hits a table used for other things (such as the text of the Comment), there will be an unacceptable amount of contention between the two.
This tip is the first of many steps toward being able to scale to Facebook levels. The other tips will come, not from a free forum, but from the team of smart engineers you will have to hire to get to that level. (Hints: Sharding, Buffering, Showing Estimates, etc.)
Your main concern will be a lot of counts, so the easy thing to do is to keep a separate count in your comments table.
Then you can create a TRIGGER that increments/decrements the count based on a like/unlike.
That way you only use the big table to figure out if a user already voted.
Problem statement: I am working on a application in which a user can follow other users (like twitter or other e-commerce sites) and get their updates on his wall.It is in relation to a merchant and a user. A user can follow any merchant.The user himself can be a merchant,so actually its like a user following other users(Many-many realtion).
Issue: The easiest way to go about it was to have a junction table which will have
id (auto-increment) | follower_user_id | followed_user_id. But I am not sure when the database grows vertically,how well will it scale.If a user follows 100 people there would be 100 entries for a single user.In that case if I want to get the followers of any user it would take longer time for the query to execute.
Research: i tried studying twitter and other websites and DB designs,but they use different databases like graph based Nosql etc to solve their problems.In our case its Mysql.I also went about using caching mechanism but I would like to know,if there is any way I could store the values horizontally i.e each user has his followers in a single row(comma separated would be tedious as I tried it).
Can I have a separate databse for this feature something like Nosql based database (mongo etc). What impact would it have on performnce in different cases?
if my approach of going with the easiset way is right how can I improve the performance for say 5-10k users(looking at a small base now)?Would basic mysql queries work well?
Please help me with inputs over the same.
The system I use (my personal preference) is to add a 2 columns on the users, following and followers and store a simple encrypted json array in it with the ID's of followers and the users that are following..
Only drawback is that when querying you have to decrypt it then json_decode it but it has worked fine for me for almost 2 years.
After going through the comments and doing some research I came to the conclusion that it would be better I go the normal way of creating the followers table and do some indexing and use caching mechanism for it.
Indexing as suggested composite indexes would work well
For caching I am planning to use Memcache!
I'm building a aweber-like list management system (for phone numbers, not emails).
There are campaigns. A phone number is associated with each campaign. Users can text to a number after which they will be subscribed.
I'm building "Create a New Campaign" page.
My current strategy is to create a separate table for each campaign (campaign_1,campaign_2,...,campaign_n) and store the subscriber data in it.
It's also possible to just create a single table and add a campaign_id column to it.
Each campaign is supposed to have 5k to 25k users.
Which is a better option? #1 or #2?
Option 2 makes more sense and is widely used approach.
I suppose it really depends on the amount of campaigns you're going to have. Let's give you some pros/cons:
Pros for campaign_n:
Faster queries
You can have each instance run with its own code and own database
Cons for campaign_n:
Database modifications are harder (you need to sync all tables)
You get a lot of tables
Personally I'd go for option 2 (campaign_id field), unless you have a really good reason not to.
So I'm working on site that will replace an older site with a lot of traffic, and I will also have a lot of data in the DB, so my question to you guys is what is the best way to design mysql tables for growth?
I was thinking to split let's say a table with 5 000 000 rows in 5 tables,with 1 000 000 rows/table and create a relationship between the tables, but I guess this isn't a good option since I will spend a lot of resources and time to figure out in what table my data is.
Or can you guys give me some tips mabe some useful articles?
No, you're absolutely right on the relationships. This technique is called Normalization where you define separate tables because these individual tables are affected with time and independent of other tables.
So if you have a hotel database that keeps a track of rooms and guests, then you know normalization is necessary because rooms and guests are independent of each other.
But you will have foreign keys/surrogate keys in each table (for instance, room_id) that could relate the particular guest entering for that particular room.
Normalization, in your case, could help you optimize that 5000 rows of yours as it would not be optimal for a loop to go over 5000 elements and retrieve an entire data.
Here is a strong example for why normalization is essential in database management.
Partitioning as mentioned in a comment is one way to go, but the first path to check out is even determining if you can break down the tables with the large amounts of data into workable chunks based on some internal data.
For instance, lets say you have a huge table of contacts. You can essentially break down the data into contacts that start from a-d, e-j, etc. Then when you go to add records you just make sure you add the records to the correct table (I'd suggest checking out stored procedures for handling this, so that logic is regulated in the database). You'd also probably set up stored procedures to also get data from the same tables. By doing this however, you have to realize that using auto-incrementing IDs won't work correctly as you won't be able to maintain unique IDs across all of the tables without doing some work yourself.
These of course are the simple solutions. There are tons of solutions for large data sets which also includes looking at other storage solutions, clustering, partitioning, etc. Doing some of these things manually yourself can give you a little bit of an understanding on some of the possibly "manual solutions".
In the site I am currently working on members can favorite other members. Then when a member goes to their favorites page they can see all the members they have favorited throughout time.
I can go about this in 2 ways:
Method #1:
Every time a user favorites another I enter a row in the favorites table which looks like this (the index is user_favoriting_id):
id | user_favorited_id | user_favoriting_id
-------------------------------------------
Then when they load the "My Favorites" page, I do a select on the favorites table to find all the rows where the user_favoriting_id value equals to that of the present logged in user. I then take the user_favorited_ids to build out a single SELECT statement and look up the respective users from a separate users table.
Method #2:
Each time a user favorites another I will update the favorites field on their row in the users table, which looks something like this (albeit with more fields, the index is id):
id | username | password | email | account_status | timestamp | favorites
--------------------------------------------------------------------------
I will CONCAT the id of the user being favorited in the favorites field so that column will hold a comma separated string like so:
10,44,67 etc...
Then to produce the My Favorites page like method #1 I will just grab all the favorite users with one select. That part is the same.
I know method #1 is the normalized way to do it and is much prettier. But my concern for this particular project is scalability and performance above anything else.
If I choose method #2, it will reduce having to look up on the separate favorites table, since the users table will have to be selected anyway as soon as the user logs in.
And I'm pretty sure using php's explode function to split up those CSV values in method #2 wouldn't take nearly as much time as method #1's additional db look up on the favorites table, but just in case I must ask:
From a purely performance perspective, which of these methods is more optimized?
Also please assume that this website will get a trillion page views a day.
You say that scalability is a concern. This seems to imply that Method #2 won't work for you, because that limits the number of favorites that a user can have. (For example, if you have a million users, then most users will have five-digit IDs. How wide do you want to let favorites be? If it's a VARCHAR(1000), that means that fewer than 200 favorites are allowed.)
Also, do you really expect that you will never want to know which users have "favorited" a given user? Your Method #2 might be O.K. if you know that you will always look up favoritings by "favoriter" rather than "favoritee", but it falls apart completely otherwise. (And even here, it only makes sense if you don't expect to lookup anything meaningful about the "favoritee" aside from his/her user-ID; otherwise, if you actually lookup the "favoritees", then you're basically doing all the hard work of a JOIN, just removing any opportunity for MySQL to do the JOIN intelligently.)
Overall, it's better to start out with best-practices, such as normalization, and to move away from them only when performance requires it. Otherwise something that seems like a performance optimization can have negative consequences, forcing you to write very un-optimal code further down the line.
JOINs take time, but I wouldn't make the change until you have some data that suggests that it's necessary.
Normalization is good for a number of reasons; it's not just an academic exercise.
Concatenation of IDs into a column is a heinous crime against normalization. Don't do it.
You're assuming that your code is faster than all the work that's been done to optimize relational databases. That's a big mistake.
Make sure that you have indexes on primary and foreign keys that participate in the JOINs.
Profile your app when you have real performance issues; don't guess.
Make sure that the real problem isn't your app. Bringing back too much unnecessary information will be more of a performance drag than a normalized schema.
Use Both, One (the normalized approach), is preferable from a data normalization, maintainability, and data integrity perspective, (and for other reasons as well) - you should always strongly favor this approach.
But there's no reason not to use the other approach as well if the normalized approach is not acceptable for read performance. Often an alternative, denormalized approach will be better for read performance. So, use the first one as the "master" for keeping track of the data and for ensuring data integrity, and then keep a denormalized "copy" of the data in the other structure for read access... Update the copy from the master any time it changes, (inserts updates, deletes).
But measure the performance of your alternative approach to ensure it is indeed faster, and by a margin sufficient to justify its use.
As far as I know using dernomalization in mysql is really trivial. but if you'd be using something like not a RDBMS but db like couchdb or mongoDB there's the whole engine how to manipulate data in a safe way. And it's really scalable, non relational database will work for you really faster..
The only method which a prefer for optimizing webapp which uses mysql for example, is to dernomalize table and then give some job to php, and ofcourse using HipHop you will get some really big optimization in there, because you offloaded mysql and loaded php which with HipHop will be optimized up to 50%!
Probably not, but it would totally screw your database up for reasons that others have already cited.
Do NOT use a comma-separated-list-of-IDs pattern. It just sucks.
I strongly suspect that you won't have enough users on your site for this to matter, as unless you're Facebook you are unlikely to have > 1M users. Most of those 1M users won't choose anybody as their favourite (because most will be casual users who don't use that feature).
So you're looking at an extremely small table (say, 1M rows maximum if your 1M users have an average of 1 favourite, although most don't use the feature at all) with only two columns. You can potentially improve the scans in innodb by making the primary key start with the thing that you most commonly want to search by, BUT - get this - you can still add a secondary index on the other one, and get reasonable lookup times (actually, VERY quick as the table will fit into memory on the tiniest server!)