I have written a simple forum in PHP using PostgreSQL. The forum consists of a number of subforums (or categories, if you like) that contain topics. I have a table that stores when was the last time a user visited a topic. It's something like this: user_id, topic_id, timestamp.
I can easily determine what topics should be marked as unread by comparing the timestamp of the last topic reply with the timestamp of the last user visit.
My question is: how do I efficiently determine what subforums (categories) should be marked as unread? All I've come up with is this: every time a user visits a topic, update the visit timestamp and check if all the topics from the current subforum are read or unread. If they are all read, mark the subforum as read for the user. Else, mark it as unread. But I think there must be another way.
Thank you in advance.
There are many ways (like yours) to achieve a similar behavior, since you mention efficiency I will consider performance is important.
The way I handled this before did not involved a database to take care of unread content at all. That in mind, my suggestion would be:
On the first visit mark only topics newer than, let's say, 3 days as 'unread'
As the user browses the topics, start throwing the topic IDs and LastUpdate for the thread into a cookie on the client
When the forum pages load, check the cookie and if the thread has suffered any updates, this code and also the cookie handling can be easily done with pure javascript.
If the client is a whole week away from the website, no problem, he will see everything newer than 3 days (first visit rule) as unread.
p.s.: This is 100% related to how important is to a person to know what he has not read. In my suggestion this is not something crucial, because it is not 100% reliable (we are not using a database/proper persistance after all)
Related
I'm making a site where users can make posts and comments, where number of comments made by a single user could get over 1000 comments. On their profile, it would show a list of all comments (by latest, splitting them into pages with 20 comments per page) made by that user.
Considering the database used to store comments would get extremely large, I'm wondering what would be the best way to go about this, since people with more comments would likely be more popular and running a query searching for the user's id through a list of all comments would be the best way to go about it.
Was thinking an alternative could be making a separate column on the user database, which would store all comment ids, and whenever someone were to visit their page, it would go through the comments looking for those ids (limiting to 20 at a time or so).
Unsure which method would be faster, and if the second method is practical. Also if there's any other better method to go about it. First time doing something like this and would appreciate any guidance.
If you are using SQL 2012 new syntax was added to make this really easy. See Implement paging (skip / take) functionality with this query
Skip 20 * page depending on the page you're looking for.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Alright, so I enjoy making forum software with PHP and MySQL, though there's one thing that has always troubled me, and one thing only;
The main page of the forums, where you view the list of the forums. Each forum shows the forum name, the number of posts made in that forum, the number of discussions made in that forum, and the last poster in the forum. There lies the problem, getting all of that data when all of those things are stored in different tables. It's not much of a problem to GET it, not really a problem at all, but to do it EFFICIENTLY is what I'm after.
My current approach is this;
Store the current number of posts, discussions, and the last poster statically in the forum table itself instead of going out and grabbing the data from the different tables - "posts", "discussions", "forums", etc. Then when a user posts, it updates that "forums" table, incrementing the number of posts by 1 and updating the last poster, and also incrementing the discussions by 1 if they're making a new discussion. This just seems inefficient and dirty to me for some reason, but maybe it's just me.
And here's another approach that I fear would be horribly inefficient;
Actually going out to each table - "posts", "discussions", "forums" - and grabbing the data. The problem with this is, there can be hundreds of forums on one page... And I'd have to use a COUNT statement to fetch the number of posts or discussions, meaning I'd have to use subqueries - not to mention a third subquery to fetch the last poster. That being said... The query would be something like this psuedo-code-like-thing:
SELECT foruminfo, (
SELECT COUNT(id)
FROM posts
WHERE forumId = someid
), (
SELECT COUNT(id)
FROM discussions
WHERE forumId = someid
), (
SELECT postinfo
FROM posts
WHERE forumId = someid
ORDER BY postdate
DESC LIMIT 1
)
FROM forums
ORDER BY position DESC;
So basically those subqueries could be run hundreds of times if I have hundreds of forums being listed. And with hundreds of users viewing the page every second, would this not put quite a bit of strain on? I'm not entirely sure if subqueries cause the same amount of load as normal queries or not, but if they do then it seems like it would certainly be horribly inefficient.
Any ideas? :(
I've built a large scale forum systems before, and the key to making it performant is to de-normalize anything and everything you can.
You cannot realistically use JOIN on really popular pages. You must keep the number of queries you issue to the absolute minimum. You should never use sub-selects. Always be sure your indexes cover your exact use cases and no more. A query that takes longer than 1-5ms to run is probably way too slow to work on a site that's running at scale. When due to severe load things suddenly take ten times longer to run a 15ms query will take a crippling 150ms or more while your optimized 1ms queries will take an acceptable 10ms. You're aiming for them to be 0.00s all the time, and it's possible to do this.
Remember that any time you're executing a query and waiting for a response, you're not able to do anything else. If you get a little careless, you'll have requests coming in faster than you can process them and the whole system will buckle.
Keep your schema simple, even stupidly simple, and by that I mean think about the layout of your page, the information you're showing, and make the schema match that as exactly as possible. Strip it down to the bare essentials. Represent it in a format that's as close as possible to the final output without making needless compromises.
If you're showing username, avatar, post title, number of posts, date of posting, then that's the fields you have in your database. Yes, you will still have a separate users database, but transpose anything and everything you can into a straight-forward structure that makes it as simple as this:
SELECT id, username, user_avatar, post_title, post_count, post_time FROM posts
WHERE forum_id=?
ORDER BY id DESC
Normally you'd have to join against users to get their name, maybe another table to get their particular avatar, and the discussions table to get the post count. You can avoid all that by changing your storage strategy.
In the case I was working with, it was a requirement to be able to post things in the future as well as in the past, so I had to create a specific "sort key" independent of ID, like your position. If this is not the case for you, just use the id primary key for ordering, something like this:
INDEX post_order (forum_id, id)
Using SUM or COUNT is completely out of the question. You need counter-cache columns. These are things that save counts of how many messages are in a particular forum. Yes, they will drift out of sync once in a while like any de-normalized data, so you will need to add tools to keep them in check, to rebuild completely them if required. Usually you can do this as a cron-job that runs once daily to repair any minor corruption that might've occurred. Most of the time, if you get your implementation correct, they will be perfectly in sync.
Other things to note, split up posts into threads if you can. The smaller your tables are, the faster they'll be. Sifting through all posts to find the top-level posts of each thread is brutally slow, especially on popular systems.
Also, cache anything you can get away with in something like Memcached if that's an option. For example, a user's friends listing won't change unless a friend is added or removed, so you don't need to select that list constantly from the database. The fastest database query is the one you never make, right?
To do this properly, you'll need to know the layout of each page and what information is going on it. Pages that aren't too popular need less optimization, but anything in the main line of fire will have to be carefully examined. Like a lot of things, there's probably an 80/20 rule going on, where 80% of your traffic hits only 20% of your code-base. That's where you'll want to be at your best.
So I have been playing around with a forum I am building and have been stuck on one aspect of it for a while, how to track unread posts and notifications without storing loads of data in the database. After looking at some solutions I believe I came up with a solution that may suit my needs but need a set of fresh eyes to point out what I didn't think of. Here is the architecture of my idea.
1) When a user logs in, check for posts made between current time() and last login time().
2) If posts found, add to array, then serialize() array and save to member row in database.
3) Output array to user if not empty.
This way it will only check for unread posts and store on users who actually log in to the forum, instead of taking up unnecessary space holding unread IDs of inactive users. I'm still wondering if this isn't such a good idea since if the user doesn't read posts then the serialization in the database might become too large to manage.
Does anyone see a problem in my way of thinking? If so please let me know.
Don't worry about the space until there's actually a problem. A table storing the post ID (integer) and the user ID (another integer) will be small. Even if you have thousands of posts and thousands of users, you can safely assume that:
a large part of the users will be inactive (one-time registrations to post something and forget the whole issue)
even the active members will not read all the posts, but rather only a (relatively small) part of the ones that are in topics that interest them.
One other thing: don't store unread posts if you really want to minimise space. Store only the last read post in each thread. That's one record per thread per user, and only assuming the user has ever opened the thread.
If the user logs in, but does not read posts, your scheme still marks them as read.
If the user logs in twice at once (as from a desktop computer and an iPad), what will happen?
What is the problem with keeping each user's view of the forum with a flag to indicate whether they read each one? Such a mechanism is obviously useful to expand into upvoting, favorites, etc.
I am in the process of writing my own basic forum to plug into a code igntier site. I'm a little stuck on how to display threads/latest posts unread by a user.
I was thinking of a table that holds each thread_id visited, but this table has the potential to get rather large.
What's are some ways to approach this requirement?
A simple idea: record the last datetime that a user visits the site/forum/subforum. This could be as granular as the thread or subforum, as you like. Perhaps create/update this key-value pair of thread_id and last_visit_date in a cookie. Perhaps store this in a cookie, rather than in your RDBMS. Ask: is this mission critical data, or an important feature that can/cannot withstand a loss of data?
When the user returns, find all the threads/posts whose create_date is greater than the last_visit_date for the forum.
I'm assuming that the act of visiting the forum (list of threads) is same as 'viewing'. Assuming that if the info was presented, that you'd 'viewed' the thread title, regardless of whether you actually drilled into the thread.
The easiest way out would probably be just to keep a cookie of the time of user's last visit and query posts posted/edit after this. You don't get exactly all read threads but most forums seems to work this way, otherwise you have to save all read threads somewhere.
I don't think you really need to create any table to log thread ids as you have thought because its going to grow by the size of your users and by the numbers of threads/posts created. You can just show threads or posts that were created after the user's last visit as unread. I think thats what I am going to do.
Using PHP and MySQL, I have a forum system I'm trying to build. What I want to know is, how can I set it so that when a user reads a forum entry, it shows as read JUST for that user, no matter what forum they are in, until someone else posts on it.
Currently, for each thread, I have a table with a PostID, and has the UserID that posted it, the ThreadID to link it to, the actual Post (as Text), then the date/time it was posted.
For the thread list in each forum, there is the threadID (Primary Key), the ThreadName, ForumID it belongs to, NumPosts, NumViews, LastPostDateTime, and CreateDateTime. Any help?
The traditional solution is a join table something along the lines of:
CREATE TABLE topicviews (
userid INTEGER NOT NULL,
topicid INTEGER NOT NULL,
lastread TIMESTAMP NOT NULL,
PRIMARY KEY (userid, topicid),
FOREIGN KEY (userid) REFERENCES users(id),
FOREIGN KEY (topicid) REFERENCES topics(id)
);
with lastread updated every time a topic is read. When displaying the list of topics, if the topics.lastupdated is > topicviews.lastread, there are new posts.
The traditional solution is rubbish and will kill your database! Don't do it!
The first problem is that a write on every topic view will soon bring the database server to its knees on a busy forum, especially on MyISAM tables which only have table-level locks. (Don't use MyISAM tables, use InnoDB for everything except fulltext search).
You can improve this situation a bit by only bothering to write through the lastread time when there are actually new messages being read in the topic. If topic.lastupdated < topicviews.lastread you have nothing to gain by updating the value. Even so, on a heavily-used forum this can be a burden.
The second problem is a combinatorial explosion. One row per user per topic soon adds up: just a thousand users and a thousand topics and you have potentially a million topicview rows to store!
You can improve this situation a bit by limiting the number of topics remembered for each user. For example you could remove any topic from the views table when it gets older than a certain age, and just assume all old topics are 'read'. This generally needs a cleanup task to be done in the background.
Other, less intensive approaches include:
only storing one lastread time per forum
only storing one lastvisit time per user across the whole site, which would show as 'new' only things updated since the user's previous visit (session)
not storing any lastread information at all, but including the last-update time in a topic's URL itself. If the user's browser has seen the topic recently, it will remember the URL and mark it as visited. You can then use CSS to style visited links as 'topics containing no new messages'.
May be storing in another table UserID,threadID, LastReadDateTime when the user read that thread.
if (LastPostDateTime > LastReadDateTime) you got an unread post.
Sadly you have a great overhead, on every read you'll have a write.
The general ideas here are correct, but they've overlooked some obvious solutions to the scalability issue.
#bobince:
The second problem is a combinatorial explosion. One row per user per topic soon adds up: just a thousand users and a thousand topics and you have potentially a million topicview rows to store!
You don't need to store a record in the "topicviews" table if somebody hasn't ever viewed that thread. You'd simply display a topic as having unread posts if null is returned OR of the last_read time is < last_post time. This will reduce that "million" rows by perhaps an order of magnitude.
#gortok: There are plenty of ways to do it, but each grows exponentially larger as the user visits the site.
In this case, you archive a forum after n-posts or n-weeks and, when you lock, you clean up the "topicviews" table.
My first suggestion is obvious and has no downside. My second reduces usability on archived topics, but that's a small price to pay for a speedy forum. Slow forums are just painful to read and post to.
But honestly? You probably won't need to worry about scalability. Even one million rows really isn't all that many.
There's no easy way to do this. There are plenty of ways to do it, but each grows exponentially larger as the user visits the site. The best you can do and still keep performance is to have a timestamp and mark any forums that have been updated since the last visit as 'unread'.
You could just use the functionality of the user's browser and append the last postid in the link to the thread like this: "thread.php?id=12&lastpost=122"
By making use of a:visited in your CSS, you can display the posts that the user already read different from those that he did not read.
Bobince has a lot of good suggestions. Another few potential optimizations:
Write the new "is this new?" information to memcached and to a MySQL "ARCHIVE" table. A batch job can update the "real" table.
For each user, keep a "everything read up until $date" flag (for when "mark all read" is clicked).
When something new is posted, clear all the "it's been read" flags -- that keeps the number of "flags" down and the table can just be (topic_id, user_id) -- no timestamps.
The used of functionality user’s browser and add the last post ID in the link of the thread. After use of a: visited in CSS you can display all thread that did not read by user.