Implementing Recursive Comments in PHP/MySQL

Implementing Recursive Comments in PHP/MySQL - php

I'm trying to write a commenting system, where people can comment on other comments, and these are displayed as recursive threads on the page. (Reddit's Commenting system is an example of what I'm trying to achieve), however I am confused on how to implement such a system that would not be very slow and computationally expensive.
I imagine that each comment would be stored in a comment table, and contain a parent_id, which would be a foreign key to another comment. My problem comes with how to get all of this data without a ton of queries, and then how to efficiently organize the comments into the order belong in. Does anyone have any ideas on how to best implement this?

Try using a nested set model. It is described in Managing Hierarchical Data in MySQL.
The big benefit is that you don't have to use recursion to retrieve child nodes, and the queries are pretty straightforward. The downside is that inserting and deleting takes a little more work.
It also scales really well. I know of one extremely huge system which stores discussion hierarchies using this method.

Here's another site providing information on that method + some source code.

It's just a suggestion, but since I'm facing the same problem right now,
How about add a sequence field (int), and a depth field in the comments table,
and update it as new comments are inserted.
The sequence field would serve the purpose of ordering the comments.
And the depth field would indicates the recursion level of the comment.
Then the hard part would be do the right updates as users insert new comments.
I don't know yet how hard this is to implement,
but I'm pretty sure once implemented, we will have a performance gain over nested model based
solutions.

I created a small tutorial explaining the basic concepts behind the recursive approach. As people have said above, the recursive function doesn't scale as well, however, inserts are far more efficient.
Here are the links:
http://www.evanpetersen.com/index.php/item/php-and-mysql-recursion.html
and
http://www.evanpetersen.com/index.php/item/php-mysql-revisited.html

I normaly work with a parent - child system.
For example, consider the following:
Table comment(
commentID,
pageID,
userID,
comment
[, parentID]
)
parentID is a foreign key to commentID (from the same table) which is optional (can be NULL).
For selecting comments use this for a 'root' comment:
SELECT * FROM comments WHERE pageID=:pageid AND parentID IS NULL
And this for a child:
SELECT * FROM comments WHERE pageID=:pageid AND parentID=:parentid

I had to implement recursive comments too.
I broke my head with nested model, let me explain why :
Let's say you want comments for an article.
Let's call root comments the comments directly attached to this article.
Let's calls reply comments the comments that are an answer to another comment.
I noticed ( unfortunately ) that I wanted the root comments to be ordered by date desc,
BUT I wanted the reply comments to be ordered date asc !!
Paradoxal !!
So the nested model didn't help me to alleviate the number of queries.
Here is my solution :
Create a comment table with following fields :
id
article_id
parent_id (nullable)
date_creation
email
whateverYouLike
sequence
depth
The 3 key fields of this implementation are parent_id, sequence and depth.
parent_id and depth helps to insert new nodes.
Sequence is the real key field, it's kind of nested model emulation.
Each time you insert a new root comment, it is multiple of x.
I choose x=1000, which basically means that I can have 1000 maximum nested comments (That' s the only drawback I found
for this system, but this limit can easily be modified, it's enough for my needs now).
The most recent root comment has to be the one with the greatest sequence number.
Now reply comments :
we have two cases :
reply for a root comment, or reply for a reply comment.
In both cases the algoritm is the same :
take the parent's sequence, and retrieve one to get your sequence number.
Then you have to update the sequences numbers which are below the parent's sequence and above the base sequence,
which is the sequence of the root comment just below the root comment concerned.
I don't expect you to understand all this since I'm not a very good explainer,
but I hope it may give you new ideas.
( At least it worked for me better than nested model would= less requests which is the real goal ).

I’m taking a simple approach.
Save root id (if it’s comments then post_id)
Save parent_id
Then fetch all comments with post_id and recursively order them on the client.
I don’t care if there’s 1000 comments. This happens in memory.
It’s one database call, and that’s te expensive part.

Related

Loop through MySQL database until field = 'specified value'

I need some help please! Basically I have a system that has an unlimited amount of categories and the way in which it works is through unique IDs. So basically the system will find the root folder and match all subfolders based on its parent's UID. An endless loop...
But now I want to do the opposite of that in a single MySQL statement (if possible).
Basically I want it to do this.. (By the way this isn't my actual code, it's just how I want it to work)
SELECT UID FROM Table
WHERE UID = 'value'
--AND ALSO:
SELECT * FROM SameTable
WHERE UID = The Parent UID just fetched...
And do this until the UID = 'Specified Value'.
I seriously hope that makes sense!
Is it even possible? I could do it using multiple queries in a PHP loop I know, but that just feels like a long way around, and bad practice.

What you have is called "Hierarchical data". You have to read on it on google. In short, there are three main ways to represent it in a 2-dimensional table:
Adjacency list (what you have). You scarcely can make it with single query
Materialized path (my favorite). Natural and readable. Not so efficient though.
Nested set (Most complicated) yet most powerful.
You can choose any system you like ir stick to your current one. Single query is not Holy grail to pursue at any cost.

MongoDB (PHP) - Custom "id", and OrderWith number

First to say that I'm new to MongoDb and document oriented db's in general.
After some trouble with embedded documents in mongodb (unable to select only nested document (example single comment in blog post)),
I redesigned the db. Now I have two collections, posts and comments (not the real deal, using blog example for convinience sake).
Example - posts collection document:
Array {
'_id' : MongoId,
'title' : 'Something',
'body' : 'Something awesome'
}
Example - comments document:
Array {
'_id' : MongoId,
'postId' : MongoId,
'userId' : MongoId,
'commentId' : 33,
'comment' : 'Punch the punch line!'
}
As you can see, I have multiple comment documents (As I said before, I want to be able to select single comment, and not an array of them).
My plan is this: I want to select single comment from collection using postId and commentId (commentId is unique value only among comments with the same postId).
Oh and commentId needs to be an int, so that I could be able to use that value for calculating next and previous documents, sort of "orderWith" number.
Now I can get a comment like this:
URI: mongo.php?post=4de526b67cdfa94f0f000000&comment=4
Code: $comment = $collection->findOne(array("postId" => $theObjId, "commentId" => (int)$commentId));
I have a few questions.
Am I doing it right?
What is the best way to generate that kind of commentId?
What is the best way to ensure that commentId is unique among comments with the same postId (upsert?)?
How to deal with concurrent queries?

Am I doing it right?
This is a really difficult question. Does it work? Does it meet your performance needs, are you comfortable maintaining it?
MongoDB doesn't have any notion of "normalization" or the "the one true way". You model your data in a way that works for you.
What is the best way to generate that kind of commentId?
What is the best way to ensure that commentId is unique among comments with the same postId (upsert?)?
This is really a complex problem. If you want to generate monotonically increasing integers IDs (like auto-increment), then you need a central authority for generating these integers. That doesn't tend to scale very well.
The commonly suggested method is to use the the ObjectId/MongoId. That will give you a unique ID.
However, you really want an integer. So take a look at findAndModify. You can keep a "last_comment_id" on your post and then update it when creating a new comment.
How to deal with concurrent queries?
Why would concurrent queries be a problem? Two readers should be able to access the same data.
Are you worried about concurrent comments being created? Then see the find an modify docs.

I don't know if The Big Picture will allow you to do this, but here is how I'd do it.
I'd have an array of comments contained inside each post. This means no joins are needed. In your case, normalization of comments doesn't give any benefit. I'd replace CommentID with CreatedAt as the time of creation.
This will let you have an easy data model to work with, as well as the ability to sort it.

mysql: use SET or lots of columns?

I'm using PHP and MySQL. I have records for:
events with various "event types" that are hierarchical (events can have multiple categories and subcategories, but there are a fixed amount of such categories and subcategories) (timestamped)
What is the best way to set up the table? Should I have a bunch of columns (30 or so) with enums for yes or no indicating membership in that category? or should I use MySQL SET datatype?
http://dev.mysql.com/tech-resources/articles/mysql-set-datatype.html
Basically I have performance in mind and I want to be able to retrieve all of the ids of the events for a given category. Just looking for some insight on the most efficient way to do this.

It sounds like you're chiefly concerned with performance.
A couple people have suggested splitting into 3 tables (category table plus either simple cross-reference table or a more sophisticated way of modeling the tree hierarchy, like nested set or materialized path), which is the first thing I thought when I read your question.
With indexes, a fully normalized approach like that (which adds two JOINs) will still have "pretty good" read performance. One issue is that an INSERT or UPDATE to an event now may also include one or more INSERT/UPDATE/DELETEs to the cross-reference table, which on MyISAM means the cross-reference table is locked and on InnoDB means the rows are locked, so if your database is busy with a significant number of writes you're going to have a larger contention problems than if just the event rows were locked.
Personally, I would try out this fully normalized approach before optimizing. But, I'll assume you know what you're doing, that your assumptions are correct (categories never change) and you have a usage pattern (lots of writes) that calls for a less-normalized, flat structure. That's totally fine and is part of what NoSQL is about.
SET vs. "lots of columns"
So, as to your actual question "SET vs. lots of columns", I can say that I've worked with two companies with smart engineers (whose products were CRM web applications ... one was actually events management), and they both used the "lots of columns" approach for this kind of static set data.
My advice would be to think about all of the queries you will be doing on this table (weighted by their frequency) and how the indexes would work.
First, with the "lots of columns" approach you are going to need indexes on each of these columns so that you can do SELECT FROM events WHERE CategoryX = TRUE. With the indexes, that is a super-fast query.
Versus with SET, you must use bitwise AND (&), LIKE, or FIND_IN_SET() to do this query. That means the query can't use an index and must do a linear search of all rows (you can use EXPLAIN to verify this). Slow query!
That's the main reason SET is a bad idea -- its index is only useful if you're selecting by exact groups of categories. SET works great if you'd be selecting categories by event, but not the other way around.
The primary problem with the less-normalized "lots of columns" approach (versus fully normalized) is that it doesn't scale. If you have 5 categories and they never change, fine, but if you have 500 and are changing them, it's a big problem. In your scenario, with around 30 that never change, the primary issue is that there's an index on every column, so if you're doing frequent writes, those queries become slower because of the number of indexes that have to updated. If you choose this approach, you might want to check the MySQL slow query log to make sure there aren't outlier slow queries because of contention at busy times of day.
In your case, if yours is a typical read-heavy web app, I think going with the "lots of columns" approach (as the two CRM products did, for the same reason) is probably sane. It is definitely faster than SET for that SELECT query.
TL;DR Don't use SET because the "select events by category" query will be slow.

It's good that the number of categories is fixed. If it wasn't you couldn't use either approach.
Check the Why You Shouldn't Use SET on the page you linked. I think that should give you a comprehensive guide.
I think the most important one is about indexes. Also, modifying a SET is slightly more complex.

The relationship between events and event types/categories is a many to many relationship, as echo says, but a simple xref table will leave you with a problem: If you want to query for all descendants of any given node, then you must make multiple recursive queries. On a deep tree, that will be very inefficient.
So when you say "retrieve all ids for a given category", if you do mean all descendants, then you want to use a Nested Set Model:
http://mikehillyer.com/articles/managing-hierarchical-data-in-mysql/
The Nested Set model makes writes updates a bit slower, but makes it very easy to retrieve subtrees:
To get the Televisions sub tree, you query for all categories left >= 2 and right <= 9.
Leaf nodes always have left = right - 1
You can find the count of descendants without pulling those rows: (right - left - 1)/2
Finding inheritance paths and depth is also very easy (single query stuff). See the article for full details.

You might try using a cross-reference (Xref) table, to create a many-to-many relationship between your events and their types.
create table event_category_event_xref
(
event_id int,
event_category_id int,
foreign key(event_id) references event(id),
foreign key (event_category_id) references event_category(id)
);
Event / category membership is defined by records in this table. So if you have a record with {event_id = 3, event_category_id = 52}, it means event #3 is in category #52. Similarly you can have records for {event_id = 3, event_category_id = 27}, and so on.

How to design the user table for an online dating site?

I'm working on the next version of a local online dating site, PHP & MySQL based and I want to do things right. The user table is quite massive and is expected to grow even more with the new version as there will be a lot of money spent on promotion.
The current version which I guess is 7-8 years old was done probably by someone not very knowledgeable in PHP and MySQL so I have to start over from scratch.
There community has currently 200k+ users and is expected to grow to 500k-1mil in the next one or two years. There are more than 100 attributes for each user's profile and I have to be able to search by at least 30-40 of them.
As you can imagine I'm a little wary to make a table with 200k rows and 100 columns. My predecessor split the user table in two ... one with the most used and searched columns and one with the rest (and bulk) of the columns. But this lead to big synchronization problems between the two tables.
So, what do you think it's the best way to go about it?

This is not an answer per se, but since few answers here suggested the attribute-value model, I just wanted to jump in and say my life experience.
I've tried once using this model with a table with 120+ attributes (growing 5-10 every year), and adding about 100k+ rows (every 6 months), the indexes is growing so big that it takes for ever to add or update a single user_id.
The problem I find with this type of design (not that it's completely unfit to any situation) is that you need to put a primary key on user_id,attrib on that second table. Unknowing the potential length of attrib, you would usually use a greater length value, thus increasing the indexes. In my case, attribs could have from 3 to 130 chars. Also, the value most certainly suffer from the same assumption.
And as the OP said, this leads to synchronization problems. Imagine if every attributes (or say at least 50% of them) NEED to exist.
Also, as the OP suggest, the search needs to be done on 30-40 attributes, and I can't just imagine how a 30-40 joins would be efficient, or even a group_concat() due to length limitation.
My only viable solution was to go back to a table with as much columns as there are attributes. My indexes are now greatly smaller, and searches are easier.
EDIT: Also, there are no normalization problems. Either having lookup tables for attribute values or have them ENUM().
EDIT 2: Of course, one could say I should have a look-up table for attribute possible values (reducing index sizes), but I should then make a join on that table.

What you could do is split the user data accross two tables.
1) Table: user
This will contain the "core" fixed information about a user such as firstname, lastname, email, username, role_id, registration_date and things of that nature.
Profile related information can go in its own table. This will be an infinitely expandable table with a key => val nature.
2) Table: user_profile
Fields: user_id, option, value
user_id: 1
option: profile_image
value: /uploads/12/myimage.png
and
user_id: 1
option: questions_answered
value: 24
Hope this helps,
Paul.

The entity-attribute-value model might be a good fit for you:
http://en.wikipedia.org/wiki/Entity-attribute-value_model
Rather than have 100 and growing columns, add one table with three columns:
user_id, property, value.

In general, you shouldn't sacrifice database integrity for performance.
The first thing that I would do about this is to create a table with 1 mln rows of dummy data and test some typical queries on it, using a stress tool like ab. It will most probably turn out that it performs just fine - 1 mln rows is a piece of cake for mysql. So, before trying to solve a problem make sure you actually have it.
If you find the performance poor and the database really turns out to be a bottleneck, consider general optimizations, like caching (on all levels, from mysql query cache to html caching), getting better hardware etc. This should work out in most cases.

In general you should always get the schema formally correct before you worry about performance!
That way you can make informed decisions about adapting the schema to resolve specific performance problems, rather than guessing.
You definitely should go down the 2 table route. This will significantly reduce the amount of storage, code complexity, and the effort to changing the system to add new attributes.
Assuming that each attribute can be represented by an Ordinal number, and that you're only looking for symmetrical matches (i.e. you're trying to match people based on similar attributes, rather than an expression of intention)....
At a simple level, the query to find suitable matches may be very expensive. Effectively you are looking for nodes within the same proximity in a N-dimensional space, unfortunately most relational databases aren't really setup for this kind of operation (I believe PostgreSQL has support for this). So most people would probably start with something like:
SELECT candidate.id,
COUNT(*)
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value=current_user.attr_value
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
However this forces the system to compare every available candidate to find the best match. Applying a little heurisitics and you could get a very effective query:
SELECT candidate.id,
COUNT(*)
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
(the value of $tolerance will affect the number of rows returned and query performance - if you've got an index on attr_type, attr_value).
This can be further refined into a points scoring system:
SELECT candidate.id,
SUM(1/1+
((candidate_attrs.attr_value - current_user.attr_value)
*(candidate_attrs.attr_value - current_user.attr_value))
) as match_score
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
This approach lets you do lots of different things - including searching by a subset of attributes, e.g.
SELECT candidate.id,
SUM(1/1+
((candidate_attrs.attr_value - current_user.attr_value)
*(candidate_attrs.attr_value - current_user.attr_value))
) as match_score
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs,
attribute_subsets s
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
AND s.subset_name=$required_subset
AND s.attr_type=current_user.attr_type
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
Obviously this does not accomodate non-ordinal data (e.g. birth sign, favourite pop-band). Without knowing a lot more about te structure of the existing data, its rather hard to say exactly how effective this will be.
If you want to add more attributes, then you don't need to make any changes to your PHP code nor the database schema - it can be completely data-driven.
Another approach would be to identify sterotypes - i.e. reference points within the N-dimensional space, then work out which of these a particular user is closest to. You collapse all the attributes down to a single composite identifier - then you just need to apply the same approach to find the best match within the subset of candidates whom also have been matched to the stereotype.

Can't really suggest anything without seeing the schema. Generally - Mysql database have to be normalized to at least 3NF or BNCF. It rather sounds like it is not normalized right now with 100 columns in 1 table.
Also - you can easily enforce referential integrity with foreign keys using transactions and INNODB engine.

PHP join help with two tables

I am just learning php as I go along, and I'm completely lost here. I've never really used join before, and I think I need to here, but I don't know. I'm not expecting anyone to do it for me but if you could just point me in the right direction it would be amazing, I've tried reading up on joins but there are like 20 different methods and I'm just lost.
Basically, I hand coded a forum, and it works fine but is not efficient.
I have board_posts (for posts) and board_forums (for forums, the categories as well as the sections).
The part I'm redoing is how I get the information for the last post for the index page. The way I set it up is that to avoid using joins, I have it store the info for latest post in the table for board_forums, so say there is a section called "Off Topic" there I would have a field for "forum_lastpost_username/userid/posttitle/posttime" which I woudl update when a user posts etc. But this is bad, I'm trying to grab it all dynamically and get rid of those fields.
Right now my query is just like:
`SELECT * FROM board_forums WHERE forum_parent='$forum_id''
And then I have the stuff where I grab the info for that forum (name, description, etc) and all the data for the last post is there:
$last_thread_title = $forumrow["forum_lastpost_title"];
$last_thread_time = $forumrow["forum_lastpost_time"];
$lastpost_username = $forumrow["forum_lastpost_username"];
$lastpost_threadid = $forumrow["forum_lastpost_threadid"];
But I need to get rid of that, and get it from board_posts. The way it's set up in board_posts is that if it's a thread, post_parentpost is NULL, if it's a reply, then that field has the id of the thread (first post of the topic). So, I need to grab the latest post_date, see which user posted that, THEN see if parentpost is NULL (if it's null then the last post is a new thread, so I can get all the info of the title and user there, but if it's not, then I need to get the info (title, id) of the first post in that thread (which can be found by seeing what post_parentpost is, looking up that ID and getting the title from it.
Does that make any sense? If so please help me out :(
Any help is greatly appreciated!!!!

Updating board___forums whenever a post or a reply is inserted is - regarding performance - not the worst idea. For displaying the index page you only have to select data from one table board_forums - this is definitely much faster than selecting a second table to get the "last posts' information", even when using a clever join.

You are better off just updating the stats on each action, New Post, Delete Post etc.
The other instances would not likely require any stats update (deletion of a thread would trigger a forum update, to show one less topic in the topic count).
Think about all the actions the user would do, in most cases, you dont need to update any stats, therefore, getting the counts on the fly is very inefficient and you are right to think so.

It looks like you've already done the right thing.
If you were to join, you'd do it like this:
SELECT * FROM board_forums
JOIN board_posts ON board_posts.forum_id = board_forums.id
WHERE forum_parent = '$forum_id'
The problem with that, is that it gets you every post, which is not useful (and very slow). What you would want to do is something like this
SELECT * FROM board_forums
JOIN board_posts ON board_posts.forum_id = board_forums.id ORDER BY board_posts.id desc LIMIT 1
WHERE forum_parent = '$forum_id'
except SQL doesn't work like that. You can't order or limit on a join (or do many other useful things like that), so you have to fetch every row and then scan them in code (which sucks).
In short, don't worry. Use joins for the actual case where you do want to load all forums and all posts in one hit.

The simple solution will result in numerous queries, some optional, as you're already discovered.
The classic approach to this is to cache the results, and only retrieve it once in a while. The cache doesn't have to live long; even two or three seconds on a busy site will make a significant difference.
De-normalizing the data into a table you're already reading anyway will help. This approach saves you figuring out optional queries and can be a bit of a cheap win because it's just one more update when an insert is already happening. But it shifts some data integrity to the application.
As an aside, you might be running into the recursive-query problem with your threads. Relational databases do not store heirarchical data all that well if you use a "simple" algorithim. A better way is something sometimes called 'set trees'. It's a bit hard to Google, unfortunately, so here are some links.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.