Design for database

Design for database - php

I have 4 tables: users, posts, categories, categories_map
posts has id, text, category_id
categories_map contains user_id and category_id
My goal is to make a queue that the user can preview. Also, the user will be able to skip some posts or edit text in them. If the user skipped a post it will never appear in queue. However, the user is not able to change sequence because cron will be executing a script.
The first approach I think is to create a table that will contain
user_id, post_id, text_modified, is_skipped, last_posted. So when the cron job is executed it will leave a timestamp so next time this post won't be grabbed and the user easily can change the text for this post.
The second approach is to create a separate table where a queue will be generated for user user_id, post_id, category_id, text_modified. So the cron job can easily job follow this table and remove the row after it was done. But with this approach if I will have 30 users, with an average of 3 categories that contains 5000 posts each, my table will have 450000 rows already. Yes if it is indexed properly it should be all good. But will it be scalable when I have 100-200 users?
Which approach should I go or is there any other solution?

A lot of things depend on your product. We don't know:
How users interact with each other?
Do their actions (skips) need to be persisted, or are we ok, if they lose them above 99.9 percentile.
Are their text modification on the posts, globally visible, or only to them.
Are the users checking posts by category?
Said all these unknowns, I'll take a stab at it:
If the answer to question 4 is YES then option #2 seems more sound judging from your PKs.
If the answer to question 4 is NO then option #1 seems more sound judging from your PKs.
For database size, I think you're doing a bit of pre-optimization. You should take into account table width. Since your tables are very narrow (only a couple of columns and mainly ints), you shouldn't worry too much about the length of the specific table.
When that becomes a constraint, (which you can benchmark, or wait to see disk space on the specific servers), you can scale up the databases by sharding on the user easily. You basically put different users on different db servers.
Note: Question 1 will determine how easy the above would be.
Said all this, keep in mind performance implications:
The lists are going to get really long.
If the users modification affect other users, you are going to have to do quite a bit of fan-out work, to publish the updates to the specific queues.
In that case, you might want to take a look at some distributed cache like Memcached, Redis.
Note: Depending on answers to Questions 2 & 3, you might not even need to persist the queues.

Related

How to design an efficient Like system?

I'm trying to create a Like/Unlike system akin to Facebook's for an existing comments section of a website, and I need help in designing the system.
Currently, every product on the website has a comments section and members can post and like comments. I need to know each member has posted how many comments and each of his comments has received how many likes. Of course, I need to know who liked what comments too (partly so that I can prevent a user from liking a comment more than once) for analytical purposes.
The naive way of implementing a Like system to the current comments module is to create a new table in the database that has foreign keys to the CommentID and UserID. Then for every "like" given to a comment by a user, I would insert a row to this new table with the targeting comment ID and user ID.
While this might work, the massive amount of comments and users is going to cause this table to grow quickly and retrieving records from and doing counts on this huge table will become slow and inefficient. I can index either one of the columns, but I don't know how effective it would be. The website has over a million comments.
I'm using PHP and MySQL. For a system like this with a huge database, how should I designing a Like system so that it is more optimised and stable?

For scalability, do not include the count column in the same table with other things. This is a rare case where "vertical partitioning" is beneficial. Why? The LIKEs/UNLIKEs will come fast and furious. If the code to do the increment/decrement hits a table used for other things (such as the text of the Comment), there will be an unacceptable amount of contention between the two.
This tip is the first of many steps toward being able to scale to Facebook levels. The other tips will come, not from a free forum, but from the team of smart engineers you will have to hire to get to that level. (Hints: Sharding, Buffering, Showing Estimates, etc.)

Your main concern will be a lot of counts, so the easy thing to do is to keep a separate count in your comments table.
Then you can create a TRIGGER that increments/decrements the count based on a like/unlike.
That way you only use the big table to figure out if a user already voted.

Creating forum software - looking for the best way to go about 1 thing [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Alright, so I enjoy making forum software with PHP and MySQL, though there's one thing that has always troubled me, and one thing only;
The main page of the forums, where you view the list of the forums. Each forum shows the forum name, the number of posts made in that forum, the number of discussions made in that forum, and the last poster in the forum. There lies the problem, getting all of that data when all of those things are stored in different tables. It's not much of a problem to GET it, not really a problem at all, but to do it EFFICIENTLY is what I'm after.
My current approach is this;
Store the current number of posts, discussions, and the last poster statically in the forum table itself instead of going out and grabbing the data from the different tables - "posts", "discussions", "forums", etc. Then when a user posts, it updates that "forums" table, incrementing the number of posts by 1 and updating the last poster, and also incrementing the discussions by 1 if they're making a new discussion. This just seems inefficient and dirty to me for some reason, but maybe it's just me.
And here's another approach that I fear would be horribly inefficient;
Actually going out to each table - "posts", "discussions", "forums" - and grabbing the data. The problem with this is, there can be hundreds of forums on one page... And I'd have to use a COUNT statement to fetch the number of posts or discussions, meaning I'd have to use subqueries - not to mention a third subquery to fetch the last poster. That being said... The query would be something like this psuedo-code-like-thing:
SELECT foruminfo, (
SELECT COUNT(id)
FROM posts
WHERE forumId = someid
), (
SELECT COUNT(id)
FROM discussions
WHERE forumId = someid
), (
SELECT postinfo
FROM posts
WHERE forumId = someid
ORDER BY postdate
DESC LIMIT 1
)
FROM forums
ORDER BY position DESC;
So basically those subqueries could be run hundreds of times if I have hundreds of forums being listed. And with hundreds of users viewing the page every second, would this not put quite a bit of strain on? I'm not entirely sure if subqueries cause the same amount of load as normal queries or not, but if they do then it seems like it would certainly be horribly inefficient.
Any ideas? :(

I've built a large scale forum systems before, and the key to making it performant is to de-normalize anything and everything you can.
You cannot realistically use JOIN on really popular pages. You must keep the number of queries you issue to the absolute minimum. You should never use sub-selects. Always be sure your indexes cover your exact use cases and no more. A query that takes longer than 1-5ms to run is probably way too slow to work on a site that's running at scale. When due to severe load things suddenly take ten times longer to run a 15ms query will take a crippling 150ms or more while your optimized 1ms queries will take an acceptable 10ms. You're aiming for them to be 0.00s all the time, and it's possible to do this.
Remember that any time you're executing a query and waiting for a response, you're not able to do anything else. If you get a little careless, you'll have requests coming in faster than you can process them and the whole system will buckle.
Keep your schema simple, even stupidly simple, and by that I mean think about the layout of your page, the information you're showing, and make the schema match that as exactly as possible. Strip it down to the bare essentials. Represent it in a format that's as close as possible to the final output without making needless compromises.
If you're showing username, avatar, post title, number of posts, date of posting, then that's the fields you have in your database. Yes, you will still have a separate users database, but transpose anything and everything you can into a straight-forward structure that makes it as simple as this:
SELECT id, username, user_avatar, post_title, post_count, post_time FROM posts
WHERE forum_id=?
ORDER BY id DESC
Normally you'd have to join against users to get their name, maybe another table to get their particular avatar, and the discussions table to get the post count. You can avoid all that by changing your storage strategy.
In the case I was working with, it was a requirement to be able to post things in the future as well as in the past, so I had to create a specific "sort key" independent of ID, like your position. If this is not the case for you, just use the id primary key for ordering, something like this:
INDEX post_order (forum_id, id)
Using SUM or COUNT is completely out of the question. You need counter-cache columns. These are things that save counts of how many messages are in a particular forum. Yes, they will drift out of sync once in a while like any de-normalized data, so you will need to add tools to keep them in check, to rebuild completely them if required. Usually you can do this as a cron-job that runs once daily to repair any minor corruption that might've occurred. Most of the time, if you get your implementation correct, they will be perfectly in sync.
Other things to note, split up posts into threads if you can. The smaller your tables are, the faster they'll be. Sifting through all posts to find the top-level posts of each thread is brutally slow, especially on popular systems.
Also, cache anything you can get away with in something like Memcached if that's an option. For example, a user's friends listing won't change unless a friend is added or removed, so you don't need to select that list constantly from the database. The fastest database query is the one you never make, right?
To do this properly, you'll need to know the layout of each page and what information is going on it. Pages that aren't too popular need less optimization, but anything in the main line of fire will have to be carefully examined. Like a lot of things, there's probably an 80/20 rule going on, where 80% of your traffic hits only 20% of your code-base. That's where you'll want to be at your best.

How can I make this SQL query the most effective?

I am making a website with a large pool of images added by users.
I want to choose randomly one image out of this pool, and display it to the user, but I want to make sure that this user has never seen this image before.
So i was thinking that: when a user views an image, I make a row INSERT in MYSQL that would say "This USER has watched THIS IMAGE at (TIME)" for every entry.
But the thing is, since there might be a lot of users and a lot of images, this table can easily grow to tens of thousands of entries quite rapidly.
So alternatively, it might be done like that:
I was thinking of making a row INSERT for every USER, and in ONE field, I insert an array all id's of images that user has watched.
I can even do that to the array:
base64_encode(gzcompress(serialize($array)
And then:
unserialize(gzuncompress(base64_decode($array))
What do you think I should do?
Is the encoding/decoding functions fast enough, or at least faster than the conventional way i was describing at the beginning of the post?
Is that compression good enough to store large chunks of data into only ONE database field? (imagine if the user has viewed thousands images?)
Thanks a lot

in ONE field, I insert an array all id's
In almost all cases, serializing values like this is bad practice. Let the database do what it's designed to do -- efficiently handle large amounts of data. As long as you ensure that your cross table has an index on the user field, retrieving the list of images that a user has seen will not be an expensive operation, regardless of the number of rows in the table. Tens of thousands of entries is nothing.

You should create a new table UserImageViews with columns user_id and image_id (additionally, you could add more information on the view, such as Date/Time, IP and Browser).
That will make queries like "What images the user has (not) seen" much faster.

You should use a table. Serializing data into a single field in a database is a bad practice, as the DBMS has no clue what that data represents and cannot be used in ANY queries. For example, if you wanted to see which users had viewed an image, you wouldn't be able to in SQL alone.
Tens of thousands of entries isn't much, BTW. The main application we develop has multiple tables with hundreds of thousands of records, and we're not that big. Some web applications have tables with millions of rows. Don't worry about having "too much data" unless it starts becoming a problem - the solutions for that problem will be complex and might even slow down your queries until you get to that amount of data.
EDIT: Oh yeah, and joins against those 100k+ tables happen in under a second. Just some perspective for ya...

I don't really think that tens of thousands of rows will be a problem for a database lookup. I will recommend using the first approach over the second.

I want to choose randomly one image out of this pool, and display it
to the user, but I want to make sure that this user has never seen
this image before.
For what it's worth, that's not a random algorithm; that's a shuffle algorithm. (Knowing that will make it easier to Google when you need more details about it.) But that's not your biggest problem.
So i was thinking that: when a user views an image, I make a row
INSERT in MYSQL that would say "This USER has watched THIS IMAGE at
(TIME)" for every entry.
Good thought. Using a table that stores the fact that a user has seen a specific image makes sense in your case. Unless I've missed something, you don't need to store the time. (And you probably shouldn't. It doesn't seem to serve any useful business purpose.) Something along these lines should work well.
-- Predicate: User identified by [user_id] has seen image identified by
-- [image_filename] at least once.
create table images_seen (
user_id integer not null references users (user_id),
image_filename not null references images (image_filename),
primary key (user_id, image_filename)
);
Test that and look at the output of EXPLAIN. If you need a secondary index on image_filename . . .
create index images_seen_img_filename on images_seen (image_filename);
This still isn't your biggest problem.
The biggest problem is that you didn't test this yourself. If you know any scripting language, you should be able to generate 10,000 rows for testing in a matter of a couple of minutes. If you'd done that, you'd find that a table like that will perform well even with several million rows.
I sometimes generate millions of rows to test my ideas before I answer a question on StackOverlow.
Learning to generate large amounts of random(ish) data for testing is a fundamental skill for database and application developers.

Should I break a larger mysql table into multiple?

I have a pretty large social network type site I have working on for about 2 years (high traffic and 100's of files) I have been experimenting for the last couple years with tweaking things for max performance for the traffic and I have learned a lot. Now I have a huge task, I am planning to completely re-code my social network so I am re-designing mysql DB's and everything.
Below is a photo I made up of a couple mysql tables that I have a question about. I currently have the login table which is used in the login process, once a user is logged into the site they very rarely need to hit the table again unless editing a email or password. I then have a user table which is basicly the users settings and profile data for the site. This is where I have questions, should it be better performance to split the user table into smaller tables? For example if you view the user table you will see several fields that I have marked as "setting_" should I just create a seperate setting table? I also have fields marked with "count" which could be total count of comments, photo's, friends, mail messages, etc. So should I create another table to store just the total count of things?
The reason I have them all on 1 table now is because I was thinking maybe it would be better if I could cut down on mysql queries, instead of hitting 3 tables to get information on every page load I could hit 1.
Sorry if this is confusing, and thanks for any tips.
alt text http://img2.pict.com/b0/57/63/2281110/0/800/dbtable.jpg

As long as you don't SELECT * FROM your tables, having 2 or 100 fields won't affect performance.
Just SELECT only the fields you're going to use and you'll be fine with your current structure.

should I just create a seperate setting table?
So should I create another table to store just the total count of things?
There is not a single correct answer for this, it depends on how your application is doing.
What you can do is to measure and extrapolate the results in a dev environment.
In one hand, using a separate table will save you some space and the code will be easier to modify.
In the other hand you may lose some performance ( and you already think ) by having to join information from different tables.
About the count I think it's fine to have it there, although it is always said that is better to calculate this kind of stuff, I don't think for this situation it hurt you at all.
But again, the only way to know what's better your you and your specific app, is to measuring, profiling and find out what's the benefit of doing so. Probably you would only gain 2% of improvement.

You'll need to compare performance testing results between the following:
Leaving it alone
Breaking it up into two tables
Using different queries to retrieve the login data and profile data (if you're not doing this already) with all the data in the same table
Also, you could implement some kind of caching strategy on the profile data if the usage data suggests this would be advantageous.

You should consider putting the counter-columns and frequently updated timestamps in its own table --- every time you bump them the entire row is written.

I wouldn't consider your user table terrible large in number of columns, just my opinion. I also wouldn't break that table into multiple tables unless you can find a case for removal of redundancy. Perhaps you have a lot of users who have the same settings, that would be a case for breaking the table out.

Should take into account the average size of a single row, in order to find out if the retrieval is expensive. Also, should try to use indexes as while looking for data...
The most important thing is to design properly, not just to split because "it looks large". Maybe the IP or IPs could go somewhere else... depends on the data saved there.
Also, as the socialnetworksite using this data also handles auth and autorization processes (guess so), the separation between login and user tables should offer a good performance, 'cause the data on login is "short enough", while the access to the profile could be done only once, inmediately after the successful login. Just do the right tricks to improve DB performance and it's done.
(Remember to visualize tables as entities, name them as an entity, not as a collection of them)

Two things you will want to consider when deciding whether or not you want to break up a single table into multiple tables is:
MySQL likes small, consistent datasets. If you can structure your tables so that they have fixed row lengths that will help performance at the potential cost of disk space. One thing that from what I can tell is common is taking fixed length data and putting it in its own table while the variable length data will go somewhere else.
Joins are in most cases less performant than not joining. If the data currently in your table will normally be accessed all at the same time then it may not be worth splitting it up as you will be slowing down both inserts and quite potentially reads. However, if there is some data in that table that does not get accessed as often then that would be a good candidate for moving out of the table for performance reasons.
I can't find a resource online to substantiate this next statement but I do recall in a MySQL Performance talk given by Jay Pipes that he said the MySQL optimizer has issues once you get more than 8 joins in a single query (MySQL 5.0.*). I am not sure how accurate that magic number is but regardless joins will usually take longer than queries out of a single table.

Determining unread items in a forum

Using PHP and MySQL, I have a forum system I'm trying to build. What I want to know is, how can I set it so that when a user reads a forum entry, it shows as read JUST for that user, no matter what forum they are in, until someone else posts on it.
Currently, for each thread, I have a table with a PostID, and has the UserID that posted it, the ThreadID to link it to, the actual Post (as Text), then the date/time it was posted.
For the thread list in each forum, there is the threadID (Primary Key), the ThreadName, ForumID it belongs to, NumPosts, NumViews, LastPostDateTime, and CreateDateTime. Any help?

The traditional solution is a join table something along the lines of:
CREATE TABLE topicviews (
userid INTEGER NOT NULL,
topicid INTEGER NOT NULL,
lastread TIMESTAMP NOT NULL,
PRIMARY KEY (userid, topicid),
FOREIGN KEY (userid) REFERENCES users(id),
FOREIGN KEY (topicid) REFERENCES topics(id)
);
with lastread updated every time a topic is read. When displaying the list of topics, if the topics.lastupdated is > topicviews.lastread, there are new posts.
The traditional solution is rubbish and will kill your database! Don't do it!
The first problem is that a write on every topic view will soon bring the database server to its knees on a busy forum, especially on MyISAM tables which only have table-level locks. (Don't use MyISAM tables, use InnoDB for everything except fulltext search).
You can improve this situation a bit by only bothering to write through the lastread time when there are actually new messages being read in the topic. If topic.lastupdated < topicviews.lastread you have nothing to gain by updating the value. Even so, on a heavily-used forum this can be a burden.
The second problem is a combinatorial explosion. One row per user per topic soon adds up: just a thousand users and a thousand topics and you have potentially a million topicview rows to store!
You can improve this situation a bit by limiting the number of topics remembered for each user. For example you could remove any topic from the views table when it gets older than a certain age, and just assume all old topics are 'read'. This generally needs a cleanup task to be done in the background.
Other, less intensive approaches include:
only storing one lastread time per forum
only storing one lastvisit time per user across the whole site, which would show as 'new' only things updated since the user's previous visit (session)
not storing any lastread information at all, but including the last-update time in a topic's URL itself. If the user's browser has seen the topic recently, it will remember the URL and mark it as visited. You can then use CSS to style visited links as 'topics containing no new messages'.

May be storing in another table UserID,threadID, LastReadDateTime when the user read that thread.
if (LastPostDateTime > LastReadDateTime) you got an unread post.
Sadly you have a great overhead, on every read you'll have a write.

The general ideas here are correct, but they've overlooked some obvious solutions to the scalability issue.
#bobince:
The second problem is a combinatorial explosion. One row per user per topic soon adds up: just a thousand users and a thousand topics and you have potentially a million topicview rows to store!
You don't need to store a record in the "topicviews" table if somebody hasn't ever viewed that thread. You'd simply display a topic as having unread posts if null is returned OR of the last_read time is < last_post time. This will reduce that "million" rows by perhaps an order of magnitude.
#gortok: There are plenty of ways to do it, but each grows exponentially larger as the user visits the site.
In this case, you archive a forum after n-posts or n-weeks and, when you lock, you clean up the "topicviews" table.
My first suggestion is obvious and has no downside. My second reduces usability on archived topics, but that's a small price to pay for a speedy forum. Slow forums are just painful to read and post to.
But honestly? You probably won't need to worry about scalability. Even one million rows really isn't all that many.

There's no easy way to do this. There are plenty of ways to do it, but each grows exponentially larger as the user visits the site. The best you can do and still keep performance is to have a timestamp and mark any forums that have been updated since the last visit as 'unread'.

You could just use the functionality of the user's browser and append the last postid in the link to the thread like this: "thread.php?id=12&lastpost=122"
By making use of a:visited in your CSS, you can display the posts that the user already read different from those that he did not read.

Bobince has a lot of good suggestions. Another few potential optimizations:
Write the new "is this new?" information to memcached and to a MySQL "ARCHIVE" table. A batch job can update the "real" table.
For each user, keep a "everything read up until $date" flag (for when "mark all read" is clicked).
When something new is posted, clear all the "it's been read" flags -- that keeps the number of "flags" down and the table can just be (topic_id, user_id) -- no timestamps.

The used of functionality user’s browser and add the last post ID in the link of the thread. After use of a: visited in CSS you can display all thread that did not read by user.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.