I'm trying to create a Like/Unlike system akin to Facebook's for an existing comments section of a website, and I need help in designing the system.
Currently, every product on the website has a comments section and members can post and like comments. I need to know each member has posted how many comments and each of his comments has received how many likes. Of course, I need to know who liked what comments too (partly so that I can prevent a user from liking a comment more than once) for analytical purposes.
The naive way of implementing a Like system to the current comments module is to create a new table in the database that has foreign keys to the CommentID and UserID. Then for every "like" given to a comment by a user, I would insert a row to this new table with the targeting comment ID and user ID.
While this might work, the massive amount of comments and users is going to cause this table to grow quickly and retrieving records from and doing counts on this huge table will become slow and inefficient. I can index either one of the columns, but I don't know how effective it would be. The website has over a million comments.
I'm using PHP and MySQL. For a system like this with a huge database, how should I designing a Like system so that it is more optimised and stable?
For scalability, do not include the count column in the same table with other things. This is a rare case where "vertical partitioning" is beneficial. Why? The LIKEs/UNLIKEs will come fast and furious. If the code to do the increment/decrement hits a table used for other things (such as the text of the Comment), there will be an unacceptable amount of contention between the two.
This tip is the first of many steps toward being able to scale to Facebook levels. The other tips will come, not from a free forum, but from the team of smart engineers you will have to hire to get to that level. (Hints: Sharding, Buffering, Showing Estimates, etc.)
Your main concern will be a lot of counts, so the easy thing to do is to keep a separate count in your comments table.
Then you can create a TRIGGER that increments/decrements the count based on a like/unlike.
That way you only use the big table to figure out if a user already voted.
Related
So im making a web based game similar to Torn City, where there could potentially be millions of users.
My issue is regarding user inventories. I started out creating dynamic tables based on each user's id. e.g Table name = [UserID]_Inventory.
From what Ive found out this can create a load of hacker friendly entries with sql injections and such because of the dynamic creation.
My only other option seems to be creating 1 giant table holding every item that every player has and all the varied details of each item. This seems like it would take longer and longer to load once user count increases and the user's inventory will likely be accessed often.
Is there another option?
My only idea so far is to create some kind of temporary inventory that grabs only the active player inventories. That helps the database search time issues but still brings me back to creating dynamic tables.
At this stage I don't really need coding help, rather I need database structure help.
Code is appreciated tho.
Cheers.
Use the big table. Index it optimally. It should not give you trouble until you get well past a billion rows.
Here's a trick to optimizing use of such a table. Instead of
PRIMARY KEY(id),
INDEX(user_id)
have
PRIMARY KEY(user_id, id),
INDEX(id)
Since the PK is "clustered" with the data and the data is ordered according to the PK, this makes all of one user's data rows sitting next to each other. In huge tables, this cuts back significantly in I/O, hence improves overall speed. Also, it cuts back on pressure on the buffer_pool. (I assume you are using InnoDB?)
The INDEX(id) is sufficient for AUTO_INCREMENT.
There could be more suggestions, but I need more details. Please provide SHOW CREATE TABLE (as it stands now) and the main SELECTs. I am likely to suggest more changes to the indexes, datatypes, and query formulations.
(Dynamic tables is a mistake, and your troubles in that direction have only begun.)
Issue: I am working on a kind of e-commerce platform which has sellers and buyers.Now in my case a seller can also be a buyer i.e every user can buy plus sell.
So i have a single table called users.Now I want to implement a follow vendor/user feature,wherein the user can click follow and he sees all the goods listed by that vendor under his account(till he unfollows).
Now my traditional approach was to have a table that has a key and two columns to store the follower and the followed Eg:
|id | userId| vendorId So it will go horizontally as the users go on following others.But if I have a user following many people(say 100) my query may take a lot of time to select a 100 records for each user.
Question: How can I implement the follow mechanism?Is there a better approach than this?I am using PHP and Mysql.
Reasearch: I tried going through how facebook and Pinterest handle it,but that seemed a bit too bigg for me to learn now as I don't expect as many users immedeately. Do I need to use memcache to enhance the performance and avoid recurring queries?Can I use a Document Database in any sense parallel with Mysql?
I would like a simple yet powerful implementation that would scale if my userbase grows gradually to a few thousands.
Any help or insights would be very helpful.
Since, from my understanding of this scenario, a user may follow many vendors, and a vendor may have many followers, this constitutes a many<->many relationship, and thus the only normalised way to achieve this in a database schema should be through using a link table, exactly as you described.
As for the performance considerations, I wouldn't worry too much about it, since it could be indexed on userId and vendorId, the queries should be fine.
The junction table is probably the best approach but still a lot depends on your clustered index.
Table clustered with a key on the substitute key id can make adding new records a bit faster.
Table clusetered with a key (userId,vendorId) will make the queries where you look for vendors a certain user follows faster
Table clustered with a key (vendorId,userId) will make the queries where you look for users that follow a certain vendor faster
A friend of mine has a catalogue that currently holds about 500 rows or 500 items. We are looking at ways that we can provide reports on the catalogue inclduing the number of times an item was viewed, and dates for when its viewed.
His site is averaging around 25,000 page impressions per month and if we assumed for a minute that half of these were catalogue items then we'd assume roughly 12,000 catalogue items viewed each month.
My question is the best way to manage item views in the database.
First option is to insert the catalogue ID into a table and then increment the number of times its viewed. The advantage of this is its compact nature. There will only ever be as many rows in the table as there are catalogue items.
`catalogue_id`, `views`
The disadvantage is that no date information is being held, short of maintaining the last time an item was viewed.
The second option is to insert a new row each time an item is viewed.
`catalogue_id`, `timestamp`
If we continue with the assumed figure of 12,000 item views that means adding 12,000 rows to the table each month, or 144,000 rows each year. The advantage of this is we know the number of times the item is viewed, and also the dates for when its viewed.
The disadvantage is the size of the table. Is a table with 144,000 rows becoming too large for MySQL?
Interested to hear any thoughts or suggestions on how to achieve this.
Thanks.
As you have mentioned the first is a lot more compact but limited. However if you look at option 2 in more detail; for example if you wish to store more than just view count, for instance entry/ exit page, host ip ect. This information maybe invaluable for stats and tracking. The other question is are these 25,000 impressions unique? If not you are able to track by username, ip or some other unique identifier, this could enable you to not use as many rows. The answer to your question relies on how much detail you wish to store? and what is the importance of the data?
Update:
True, limiting the repeats on a given item due to a time interval would be a good solution. Also knowing if someone visited the same item could be useful for suggested items perdition widgets similar to what amazon does. Also knowing that someone visited an item many times says to me that this is a good item to promote to them or others in a mail-out, newsletter or popular product page. Tracking unique views will give a more honest view count, which you can choose to display or store. On the issue of limiting the value of repeat visitors, this mainly only comes into play depending on what information you display. It is all about framing the information in the way that best suits you.
Your problem statement: We want to be able to track number of views for a particular catalogue item.
Lets review you options.
First Option:
In this option you will be storing the catalogue_id and a integer value of the number of views of the items.
Advantages:
Well since you really have a one to one relationship the new table is going to be small. If you have 500 items you will have 500 hundred rows. I would suggest if you choose this route not to create a new table but add another column to the catalogue table with the number of views on it.
Disadvantages:
The problem here is that since you are going to be updating this table relatively frequently it is going to be a very busy little table. For example 10 users are viewing the same item. These 10 updates will have to run one after the other. Assuming you are using InnoDB the first view action would come in lock the row update the counter release the lock. The other updates would queue behind it. So while the data is small on the table it could potentially become a bottleneck later on especially if you start scaling the system.
You are loosing granular data i.e. you are not keeping track of the raw data. For example lets say the website starts growing and you have a interested investor they want to see a breakdown of the views per week over the last 6 months. If you use this option you wont have the data to provide to the investor. Essentially you are keeping a summary.
Second Option:
In this option you would create a logging table with at least the following minimal fields catalogue_id and timestamp. You could expand this to add a username/ip address or some other information to make it even more granular.
Advantages:
You are keeping granular data. This will allow you to summarise the data in a variety of ways. You could for example add a ip address column store the visitors IP and then do a monthly report showing you products viewed by country(you could do a IP address lookup to get a idea of which country they were from). Another example would be to see over the last quarter which products was viewed the most etc. This data is pretty essential in helping you make decisions on how to grow you business. If you want to know what is working what is not working as far as products are concerned this detail is absolutely critical.
Your new table will be a logging table. It will only be insert operations. Inserts can pretty much happen in parallel. If you go with this option it will probably scale better as the site grows compared to a constantly updated table.
Disadvantages:
This table will be bigger probably the biggest table in the database. However this is not a problem. I regularly deal with 500 000 000 rows+ tables. Some of my tables are over 750GB by themselves and I can still run reporting on it. You just need to understand your queries and how to optimise them. This is really not a problem as MySQL was designed to handle millions of rows with ease. Just keep in mind you could archive some information into other tables. Say you archive the data every 3 years you could move data older than 3 years into another table. You dont have to keep all the data there. Your estimate of 144 000 rows means you could probably safely keep about 15+ years worth without every worrying about the performance of the table.
My suggestion to you is to serious consider the second option. If you decide to go this route update your question with the proposed table structures and let us have a look at it. Don't be scared of big data rather be scared of BAD design it is much more difficult to deal with.
However as always the choice is yours.
What are some of the techniques for providing personalized search results to a logged in user? One way I can think of will be by analyzing the user's browsing history.
Tracking: A log of a user's activities like pages viewed and 'like' buttons clicked can be use to bias search results.
Question 1: How do you track a user's browsing history? A table with columns user_id, number_of_hits, page id? If I have 1000 daily visitors, each browsing 10 pages on average, wont there be a large number of records to select each time a personalized recommendation is required? The table will grow at 300K rows a month! It will take longer and longer to select the rows each time a search is made. I guess the table for recording 'likes' will take the same table design.
Question 2: How do you bias the results of a search? For example, if a user as been searching for apple products, how does the search engine realise that the user likes apple products and subsequently bias the search towards them? Tag the pages and accumulate a record of tags on the page visited?
You probably don't want to use a relational database for this type of thing, take a look at mongodb or cassandra. That's because you basically want to add a new column to the user's history so a column-oriented database makes more sense.
300k rows per month is not really that much, in fact, that's almost nothing. it doesn't matter if you use a relational or non-relational database for this.
Straightforward approach is the following:
put entries into the table/collection like this:
timestamp, user, action, misc information
(make sure that you put as much information as possible, such that you don't need to join this data warehousing table with any other table)
partition by timestamp (one partition per month)
never go against this table directly, instead have say daily report jobs running over all data and collect and compute the necessary statistics and write them to a summary table.
reflect on your report queries and put appropriate partition local indexes
only go against the summary table from your web frontend
If you stored only the last X results as opposed to everything, it would probably be do-able. Might slow things down, but it'd work. Any time you're writing more data and reading more data, there's going to be an impact. Proper DBA methods such as indexing and query optimizing can help, but no matter what you use there's going to be an affect.
I'd personally look at storing just a default view for the user in a DB and use the session to keep track of the rest. Sure, when you login there'd be no history. But you could take advantage of that to highlight a set of special pages that you think are important or relevant to steer the user to. A highlight system of sorts. Faster, easier, and more user-friendly.
As for bias, you could write a set of keywords for each record and array sort them accordingly. Wouldn't be terribly difficult using PHP.
I use MySQL and over 2M records (page views) a month and we run reports on that table daily and often.
The table is partitioned by month (like already suggested) and indexed where needed.
I also clear the table from data that is over 6 months by creating a new table called "page_view_YYMM" (YY=year, MM=month) and using some UNIONS when necessary
for the second question, the way I would approach it is by creating a table with the list of your products that is a simple:
url, description
the description will be a tag stripped of the content of your page or item (depend how you want to influence the search) and then add a full text index on description and a search on that table adding possible extra terms that you have been collecting while the user was surfing your site that you think are relevant (for example category name, or brand)
I have a pretty large social network type site I have working on for about 2 years (high traffic and 100's of files) I have been experimenting for the last couple years with tweaking things for max performance for the traffic and I have learned a lot. Now I have a huge task, I am planning to completely re-code my social network so I am re-designing mysql DB's and everything.
Below is a photo I made up of a couple mysql tables that I have a question about. I currently have the login table which is used in the login process, once a user is logged into the site they very rarely need to hit the table again unless editing a email or password. I then have a user table which is basicly the users settings and profile data for the site. This is where I have questions, should it be better performance to split the user table into smaller tables? For example if you view the user table you will see several fields that I have marked as "setting_" should I just create a seperate setting table? I also have fields marked with "count" which could be total count of comments, photo's, friends, mail messages, etc. So should I create another table to store just the total count of things?
The reason I have them all on 1 table now is because I was thinking maybe it would be better if I could cut down on mysql queries, instead of hitting 3 tables to get information on every page load I could hit 1.
Sorry if this is confusing, and thanks for any tips.
alt text http://img2.pict.com/b0/57/63/2281110/0/800/dbtable.jpg
As long as you don't SELECT * FROM your tables, having 2 or 100 fields won't affect performance.
Just SELECT only the fields you're going to use and you'll be fine with your current structure.
should I just create a seperate setting table?
So should I create another table to store just the total count of things?
There is not a single correct answer for this, it depends on how your application is doing.
What you can do is to measure and extrapolate the results in a dev environment.
In one hand, using a separate table will save you some space and the code will be easier to modify.
In the other hand you may lose some performance ( and you already think ) by having to join information from different tables.
About the count I think it's fine to have it there, although it is always said that is better to calculate this kind of stuff, I don't think for this situation it hurt you at all.
But again, the only way to know what's better your you and your specific app, is to measuring, profiling and find out what's the benefit of doing so. Probably you would only gain 2% of improvement.
You'll need to compare performance testing results between the following:
Leaving it alone
Breaking it up into two tables
Using different queries to retrieve the login data and profile data (if you're not doing this already) with all the data in the same table
Also, you could implement some kind of caching strategy on the profile data if the usage data suggests this would be advantageous.
You should consider putting the counter-columns and frequently updated timestamps in its own table --- every time you bump them the entire row is written.
I wouldn't consider your user table terrible large in number of columns, just my opinion. I also wouldn't break that table into multiple tables unless you can find a case for removal of redundancy. Perhaps you have a lot of users who have the same settings, that would be a case for breaking the table out.
Should take into account the average size of a single row, in order to find out if the retrieval is expensive. Also, should try to use indexes as while looking for data...
The most important thing is to design properly, not just to split because "it looks large". Maybe the IP or IPs could go somewhere else... depends on the data saved there.
Also, as the socialnetworksite using this data also handles auth and autorization processes (guess so), the separation between login and user tables should offer a good performance, 'cause the data on login is "short enough", while the access to the profile could be done only once, inmediately after the successful login. Just do the right tricks to improve DB performance and it's done.
(Remember to visualize tables as entities, name them as an entity, not as a collection of them)
Two things you will want to consider when deciding whether or not you want to break up a single table into multiple tables is:
MySQL likes small, consistent datasets. If you can structure your tables so that they have fixed row lengths that will help performance at the potential cost of disk space. One thing that from what I can tell is common is taking fixed length data and putting it in its own table while the variable length data will go somewhere else.
Joins are in most cases less performant than not joining. If the data currently in your table will normally be accessed all at the same time then it may not be worth splitting it up as you will be slowing down both inserts and quite potentially reads. However, if there is some data in that table that does not get accessed as often then that would be a good candidate for moving out of the table for performance reasons.
I can't find a resource online to substantiate this next statement but I do recall in a MySQL Performance talk given by Jay Pipes that he said the MySQL optimizer has issues once you get more than 8 joins in a single query (MySQL 5.0.*). I am not sure how accurate that magic number is but regardless joins will usually take longer than queries out of a single table.