So I'm trying to create a comment system in which you can reply to comments that are already replies (allowing you to create theoretically infinite threads of replies). I want them to display in chronological order (newest on top), but of course the replies should be directly underneath the original comment. If there are multiple comments replying to the same comment, the replies should also be in chronological order (still underneath the original comment). I also want to limit the number of comment groups (a set of comments with a single comment that is not a reply at all) to, say, 25. How should I set up the MySQL table, and what sort of query would I use to extract what I want?
Here's a simplified version of my DB:
ID int(11) NOT NULL AUTO_INCREMENT,
DatePosted datetime NOT NULL,
InReplyTo int(11) NOT NULL DEFAULT '0',
Sorry if this is kind of confusing, I'm not sure how to word it any differently. I've had this problem in the back of my mind for a couple months now, and every time I solve one problem, I end up with another...
There are many ways. Here's one approach that I like (and use on a regular basis).
The database
Consider the following database structure:
CREATE TABLE comments (
id int(11) unsigned NOT NULL auto_increment,
parent_id int(11) unsigned default NULL,
parent_path varchar(255) NOT NULL,
comment_text varchar(255) NOT NULL,
date_posted datetime NOT NULL,
PRIMARY KEY (id)
);
your data will look like this:
+-----+-------------------------------------+--------------------------+---------------+
| id | parent_id | parent_path | comment_text | date_posted |
+-----+-------------------------------------+--------------------------+---------------+
| 1 | null | / | I'm first | 1288464193 |
| 2 | 1 | /1/ | 1st Reply to I'm First | 1288464463 |
| 3 | null | / | Well I'm next | 1288464331 |
| 4 | null | / | Oh yeah, well I'm 3rd | 1288464361 |
| 5 | 3 | /3/ | reply to I'm next | 1288464566 |
| 6 | 2 | /1/2/ | this is a 2nd level reply| 1288464193 |
... and so on...
It's fairly easy to select everything in a useable way:
select id, parent_path, parent_id, comment_text, date_posted
from comments
order by parent_path, date_posted;
ordering by parent_path, date_posted will usually produce results in the order you'll need them when you generate your page; but you'll want to be sure that you have an index on the comments table that'll properly support this -- otherwise the query works, but it's really, really inefficient:
create index comments_hier_idx on comments (parent_path, date_posted);
For any given single comment, it's easy to get that comment's entire tree of child-comments. Just add a where clause:
select id, parent_path, parent_id, comment_text, date_posted
from comments
where parent_path like '/1/%'
order by parent_path, date_posted;
the added where clause will make use of the same index we already defined, so we're good to go.
Notice that we haven't used the parent_id yet. In fact, it's not strictly necessary. But I include it because it allows us to define a traditional foreign key to enforce referential integrity and to implement cascading deletes and updates if we want to. Foreign key constraints and cascading rules are only available in INNODB tables:
ALTER TABLE comments ENGINE=InnoDB;
ALTER TABLE comments
ADD FOREIGN KEY ( parent_id ) REFERENCES comments
ON DELETE CASCADE
ON UPDATE CASCADE;
Managing The Hierarchy
In order to use this approach, of course, you'll have to make sure you set the parent_path properly when you insert each comment. And if you move comments around (which would admittedly be a strange usecase), you'll have to make sure you manually update each parent_path of each comment that is subordinate to the moved comment. ... but those are both fairly easy things to keep up with.
If you really want to get fancy (and if your db supports it), you can write triggers to manage the parent_path transparently -- I'll leave this an exercise for the reader, but the basic idea is that insert and update triggers would fire before a new insert is committed. they would walk up the tree (using the parent_id foreign key relationship), and rebuild the value of the parent_path accordingly.
It's even possible to break the parent_path out into a separate table that is managed entirely by triggers on the comments table, with a few views or stored procedures to implement the various queries you need. Thus completely isolating your middle-tier code from the need to know or care about the mechanics of storing the hierarchy info.
Of course, none of the fancy stuff is required by any means -- it's usually quite sufficient to just drop the parent_path into the table, and write some code in your middle-tier to ensure that it gets managed properly along with all the other fields you already have to manage.
Imposing limits
MySQL (and some other databases) allows you to select "pages" of data using the the LIMIT clause:
SELECT * FROM mytable LIMIT 25 OFFSET 0;
Unfortunately, when dealing with hierarchical data like this, the LIMIT clause alone won't yield the desired results.
-- the following will NOT work as intended
select id, parent_path, parent_id, comment_text, date_posted
from comments
order by parent_path, date_posted
LIMIT 25 OFFSET 0;
Instead, we need to so a separate select at the level where we want to impose the limit, then we join that back together with our "sub-tree" query to give the final desired results.
Something like this:
select
a.*
from
comments a join
(select id, parent_path
from comments
where parent_id is null
order by parent_path, post_date DESC
limit 25 offset 0) roots
on a.parent_path like concat(roots.parent_path,roots.id,'/%') or a.id=roots.id)
order by a.parent_path , post_date DESC;
Notice the statement limit 25 offset 0, buried in the middle of the inner select. This statement will retrieve the most recent 25 "root-level" comments.
[edit: you may find that you have to play with stuff a bit to get the ability to order and/or limit things exactly the way you like. this may include adding information within the hierarchy that's encoded in parent_path. for example: instead of /{id}/{id2}/{id3}/, you might decide to include the post_date as part of the parent_path: /{id}:{post_date}/{id2}:{post_date2}/{id3}:{post_date3}/. This would make it very easy to get the order and hierarchy you want, at the expense of having to populate the field up-front, and manage it as the data changes]
hope this helps.
good luck!
You should consider nesting your comments in a tree - I'm not that well familiar with data trees, but I can accomplish something relatively easy - I'm open to any suggestions (and explanations) for optimizing the code - but an idea would be something like this:
<?php
$mysqli = new mysqli('localhost', 'root', '', 'test');
/** The class which holds the comments */
class Comment
{
public $id, $parent, $content;
public $childs = array();
public function __construct($id, $parent, $content)
{
$this->id = $id;
$this->parent = $parent;
$this->content = $content;
}
public function addChild( Comment $obj )
{
$this->childs[] = $obj;
}
}
/** Function to locate an object from it's id to help nest the comments in a hieraci */
function locateObject( $id, $comments )
{
foreach($comments as $commentObject)
{
if($commentObject->id == $id)
return $commentObject;
if( count($commentObject->childs) > 0 )
return locateObject($id, $commentObject->childs);
}
}
/** Function to recursively show comments and their nested child comments */
function showComments( $commentsArray )
{
foreach($commentsArray as $commentObj)
{
echo $commentObj->id;
echo $commentObj->content;
if( count($commentObj->childs) > 0 )
showComments($commentObj->childs);
}
}
/** SQL to select the comments and order dem by their parents and date */
$sql = "SELECT * FROM comment ORDER BY parent, date ASC";
$result = $mysqli->query($sql);
$comments = array();
/** A pretty self-explainatory loop (I hope) */
while( $row = $result->fetch_assoc() )
{
$commentObj = new Comment($row["id"], $row["parent"], $row["content"]);
if($row["parent"] == 0)
{
$comments[] = $commentObj;
continue;
}
$tObj = locateObject($row["parent"], $comments);
if( $tObj )
$tObj->addChild( $commentObj );
else
$comments[] = $commentObj;
}
/** And then showing the comments*/
showComments($comments);
?>
I hope you get the general idea, and I'm certain that some of the other users here can provide with some experienced thoughts about my suggestion and helt optimize it.
In database, you may create a table with foreign key column (parent_comment), which references to comments table itself. For example:
CREATE TABLE comments (
id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
parent_comment INT FOREIGN KEY REFERENCES comments(id),
date_posted DATETIME,
...)
In order to show comments to single item, you'll have to select all comments for particular item, and parse them recursively in your script with depth-first algorithm. Chronological order should be taken into account in traversal algorithm.
I would consider a nested set, for storing this type of hierarchical data. See http://mikehillyer.com/articles/managing-hierarchical-data-in-mysql/ for an example.
You might find this method helpful which involves a single call to a non-recursive stored procedure.
Full script can be found here : http://pastie.org/1259785
Hope it helps :)
Example stored procedure call:
call comments_hier(1);
Example php script:
<?php
$conn = new mysqli("localhost", "foo_dbo", "pass", "foo_db", 3306);
$result = $conn->query(sprintf("call comments_hier(%d)", 3));
while($row = $result->fetch_assoc()){
...
}
$result->close();
$conn->close();
?>
SQL script:
drop table if exists comments;
create table comments
(
comment_id int unsigned not null auto_increment primary key,
subject varchar(255) not null,
parent_comment_id int unsigned null,
key (parent_comment_id)
)engine = innodb;
insert into comments (subject, parent_comment_id) values
('Comment 1',null),
('Comment 1-1',1),
('Comment 1-2',1),
('Comment 1-2-1',3),
('Comment 1-2-2',3),
('Comment 1-2-2-1',5),
('Comment 1-2-2-2',5),
('Comment 1-2-2-2-1',7);
delimiter ;
drop procedure if exists comments_hier;
delimiter #
create procedure comments_hier
(
in p_comment_id int unsigned
)
begin
declare v_done tinyint unsigned default 0;
declare v_depth smallint unsigned default 0;
create temporary table hier(
parent_comment_id smallint unsigned,
comment_id smallint unsigned,
depth smallint unsigned default 0
)engine = memory;
insert into hier select parent_comment_id, comment_id, v_depth from comments where comment_id = p_comment_id;
/* http://dev.mysql.com/doc/refman/5.0/en/temporary-table-problems.html */
create temporary table tmp engine=memory select * from hier;
while not v_done do
if exists( select 1 from comments c inner join hier on c.parent_comment_id = hier.comment_id and hier.depth = v_depth) then
insert into hier
select c.parent_comment_id, c.comment_id, v_depth + 1 from comments c
inner join tmp on c.parent_comment_id = tmp.comment_id and tmp.depth = v_depth;
set v_depth = v_depth + 1;
truncate table tmp;
insert into tmp select * from hier where depth = v_depth;
else
set v_done = 1;
end if;
end while;
select
c.comment_id,
c.subject,
p.comment_id as parent_comment_id,
p.subject as parent_subject,
hier.depth
from
hier
inner join comments c on hier.comment_id = c.comment_id
left outer join comments p on hier.parent_comment_id = p.comment_id
order by
hier.depth, hier.comment_id;
drop temporary table if exists hier;
drop temporary table if exists tmp;
end #
delimiter ;
call comments_hier(1);
call comments_hier(5);
Related
I want to execute a query where I can find one ID in a list of ID.
table user
id_user | name | id_site
-------------------------
1 | james | 1, 2, 3
1 | brad | 1, 3
1 | suko | 4, 5
and my query (doesn't work)
SELECT * FROM `user` WHERE 3 IN (`id_site`)
This query work (but doesn't do the job)
SELECT * FROM `user` WHERE 3 IN (1, 2, 3, 4, 6)
That's not how IN works. I can't be bothered to explain why, just read the docs
Try this:
SELECT * FROM `user` WHERE FIND_IN_SET(3,`id_site`)
Note that this requires your data to be 1,2,3, 1,3 and 4,5 (ie no spaces). If this is not an option, try:
SELECT * FROM `user` WHERE FIND_IN_SET(3,REPLACE(`id_site`,' ',''))
Alternatively, consider restructuring your database. Namely:
CREATE TABLE `user_site_links` (
`id_user` INT UNSIGNED NOT NULL,
`id_site` INT UNSIGNED NOT NULL,
PRIMARY KEY (`user_id`,`site_id`)
);
INSERT INTO `user_site_links` VALUES
(1,1), (1,2), (1,3),
(2,1), (2,3),
(3,4), (3,5);
SELECT * FROM `user` JOIN `user_site_links` USING (`id_user`) WHERE `id_site` = 3;
Try this: FIND_IN_SET(str,strlist)
NO! For relation databases
Your table doesn't comfort first normal form ("each attribute contains only atomic values, and the value of each attribute contains only a single value from that domain") of a database and you:
use string field to contain numbers
store multiple values in one field
To work with field like this you would have to use FIND_IN_SET() or store data like ,1,2,3, (note colons or semicolons or other separator in the beginning and in the end) and use LIKE "%,7,%" to work in every case. This way it's not possible to use indexes[1][2].
Use relation table to do this:
CREATE TABLE user_on_sites(
user_id INT,
site_id INT,
PRIMARY KEY (user_id, site_id),
INDEX (user_id),
INDEX (site_id)
);
And join tables:
SELECT u.id, u.name, uos.site_id
FROM user_on_sites AS uos
INNER JOIN user AS u ON uos.user_id = user.id
WHERE uos.site_id = 3;
This way you can search efficiently using indexes.
The problem is that you are searching within several lists.
You need something more like:
SELECT * FROM `user` WHERE id_site LIKE '%3%';
However, that will also select 33, 333 and 345 so you want some more advanced text parsing.
The WHERE IN clause is useful to replace many OR conditions.
For exemple
SELECT * FROM `user` WHERE id IN (1,2,3,4)
is cleaner than
SELECT * FROM `user` WHERE id=1 OR id=2 OR id=3 OR id=4
You're just trying to use it in a wrong way.
Correct way :
WHERE `field` IN (list_item1, list_item2 [, list_itemX])
The requirement is to find the first available identifier where an identifier is an alphanumeric string, such as:
ABC10000
ABC10345
ABC88942
ABC90123
The database table has a structure such as:
id, user, identifier
Note that the alpha component ABC is consistent throughout and won't change. The numeric component should be between 10000 and 99999.
How best to tackle this? It does not seem like an overly complex problem - looking for simplest solution using either just MySQL or a combination of SQL and PHP. The current solution pulls each record from the database into an array and then loops from 10000 onwards, prepending ABC and checking availability, which seems like it could be improved significantly.
Edit: Original question was not clear enough in that a certain amount of identifiers have been assigned already, and I am looking to fill in the gaps. From the short list I provided, the next available would be ABC10001. Eventually, however, it would be ABC10346 and then ABC88943 and so on
Edit: Sorry for a poorly structured question. To avoid any further confusion, here is the actual table structure:
CREATE TABLE IF NOT EXISTS `User_Loc` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`user` int(11) DEFAULT NULL,
`value` varchar(100) COLLATE utf8_unicode_ci DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `UNIQ_64FB41DA17323CBC` (`user`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=4028 ;
You have to self join the table and look for the first NULL value in the joined table
SELECT CONCAT('ABC', SUBSTRING(t1.value, 4)+1) AS next_value
FROM test t1
LEFT JOIN test t2 on SUBSTRING(t1.value, 4)+1 = SUBSTRING(t2.value, 4)
WHERE ISNULL(t2.value)
ORDER BY t1.value ASC
LIMIT 1
http://sqlfiddle.com/#!2/d69105/22
edit
With the comment about some 'specialities' at ncatnow. There are slight adjusments to make with the help of subselects for ridding the 'ABC' and UNION for having a default value
SELECT
CONCAT('ABC', t1.value+1) AS next_value
FROM
((SELECT '09999' AS value) UNION (SELECT SUBSTRING(value, 4) AS value FROM test)) t1
LEFT JOIN
((SELECT '09999' AS value) UNION (SELECT SUBSTRING(value, 4) AS value FROM test)) t2
ON t1.value+1 = t2.value
WHERE
ISNULL(t2.value)
AND t1.value >= '09999'
ORDER BY t1.value ASC
LIMIT 1
http://sqlfiddle.com/#!2/28acf6/50
Similar to the above reply by #HerrSerker, but this will cope with existing identifiers which have the numeric part starting with a zero.
SELECT CONCAT('ABC',SUBSTRING(CONCAT('00000', CAST((CAST(SUBSTRING(a.identifier, 4) AS SIGNED) + 1) AS CHAR)), -5)) AS NextVal
FROM SomeTable a
LEFT OUTER JOIN SomeTable b
ON b.identifier = CONCAT('ABC',SUBSTRING(CONCAT('00000', CAST((CAST(SUBSTRING(a.identifier, 4) AS SIGNED) + 1) AS CHAR)), -5))
WHERE b.identifier IS NULL
ORDER BY NextVal
LIMIT 1
what comes to my mind is one table with all indentifiers and use this sql
SELECT identifier FROM allIdentifiersTable WHERE identifier NOT IN (SELECT identifier FROM yourTable) LIMIT 1
Reconsidering this from your edit, you should go the PHP route and add another table or other means to store the last filled id:
$identifier = 0;
$i = mysql_query("SELECT identifier FROM last_identifier");
if ($row = mysql_fetch_assoc($i)) $identifier = $row["identifier"];
if ($identifier < 10000) $identifier = 10000;
do {
$identifier += 1;
$result = mysql_query("
INSERT IGNORE INTO table (id, user, identifier)
VALUES ('[...]', '[...]',
'ABC" . str_pad($identifier, 5, "0", STR_PAD_LEFT) . "'
)");
if (mysql_affected_rows($result) < 1) continue;
} while (false);
mysql_query("UPDATE last_identifier SET identifier = '$identifier'");
Of course, you need to add a UNIQUE index on the identifier field.
I have a query that gets all the info I need for a messaging system's main page (including unread message count, etc)... but it currently retrieves the original threads message. I would like to augment the below query to grab the most recent message in each thread instead.
This query is very close, however my mediocre SQL skills are keeping me from wrapping things up...
$messages = array();
$unread_messages_total = 0;
$messages_query = "
SELECT m.*
, COUNT(r.id) AS num_replies
, MAX(r.datetime) AS reply_datetime
, (m.archived NOT LIKE '%,".$cms_user['id'].",%') AS message_archive
, (m.viewed LIKE '%,".$cms_user['id'].",%') AS message_viewed
, SUM(r.viewed NOT LIKE '%,".$cms_user['id'].",%') AS unread_replies
, CASE
WHEN MAX(r.datetime) >= m.datetime THEN MAX(r.datetime)
ELSE m.datetime
END AS last_datetime
FROM directus_messages AS m
LEFT JOIN directus_messages as r ON m.id = r.reply
WHERE m.active = '1'
AND (m.to LIKE '%,".$cms_user['id'].",%' OR m.to = 'all' OR m.from = '".$cms_user['id']."')
GROUP BY m.id
HAVING m.reply = '0'
ORDER BY last_datetime DESC";
foreach($dbh->query($messages_query) as $row_messages){
$messages[] = $row_messages;
$unread_messages_total += (strpos($row_messages['archived'], ','.$cms_user['id'].',') === false && ( (strpos($row_messages['viewed'], ','.$cms_user['id'].',') === false && $row_messages['unread_replies'] == NULL) || ($row_messages['unread_replies']>0 && $row_messages['unread_replies'] != NULL) ) )? 1 : 0;
}
Thanks in advance for any help you can provide!
EDIT: (Database)
CREATE TABLE `cms_messages` (
`id` int(10) NOT NULL auto_increment,
`active` tinyint(1) NOT NULL default '1',
`subject` varchar(255) NOT NULL default '',
`message` text NOT NULL,
`datetime` datetime NOT NULL default '0000-00-00 00:00:00',
`reply` int(10) NOT NULL default '0',
`from` int(10) NOT NULL default '0',
`to` varchar(255) NOT NULL default '',
`viewed` varchar(255) NOT NULL default ',',
`archived` varchar(255) NOT NULL default ',',
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=1 DEFAULT CHARSET=latin1;
EDIT 2: (Requirements)
Return all parent messages for a specific user_id: $cms_user['id']
Return the number of replies for that parent message: num_replies
Return the number of unread replies for that parent message: unread_replies
Return the date of the parent message or it's most recent reply: last_datetime
Return whether the message is in the archive: message_archive
Return whether the message has been viewed: message_viewed
Return all messages in DESC datetime order
Return the newest message, from the parent or replies if there are some (like gmail)
If you have only 2 levels of messages (i.e., only parent messages and direct answers), you might try this query:
select
root_message.id,
root_message.active,
root_message.subject,
case
when max_reply_id.max_id is null then
root_message.message
else
reply_message.message
end as message,
root_message.datetime,
root_message.reply,
root_message.from,
root_message.to,
root_message.viewed,
root_message.archived
from
-- basic data
cms_messages as root_message
-- ID of last reply for every root message
left join (
select
max(id) as max_id,
reply as parent_id
from
cms_messages
where
reply <> 0
group by
reply
) as max_reply_id on max_reply_id.parent_id = root_message.id
left join cms_messages as reply_message on reply_message.id = max_reply_id.max_id
where
root_message.reply = 0
It uses subquery max_reply_id as source of data to select ID of the latest answer. If it exists (i.e., if there are answers), reply_message.message is used. If it does not exist (no answer has been found for root message), then root_message.message is used.
You should also think about structure of table. E.g., it would make more sense if reply contained either NULL, if it is parent message, or ID of existing message. Currently, you set it to 0 (ID of non-existent message), which is wrong. Types of viewed and archived are also weird.
Edit: you should also avoid using having clause. Use where instead, when possible.
Here's a new query that should fulfil your requirements. If there is any problem with it (i.e., if it returns wrong data), let me know.
Like the first query, it:
uses subquery reply_summary to accumulate data about replies (ID of last reply, number of replies and number of unread replies);
joins this subquery to the base table;
joins cms_messages as reply_message to the subquery, based on reply_summary.max_reply_id, to get data about the last reply (message, datetime).
I've simplified the way how you determine last_datetime - it now takes either time of last reply (if there is any reply), or time of original post (when no replies are found).
I have not filtered replies by from and to fields. If it is necessary, where clause of reply_summary subquery should be updated.
select
parent_message.id,
parent_message.subject,
parent_message.message,
parent_message.from,
parent_message.to,
coalesce(reply_summary.num_replies, 0) as num_replies,
last_reply_message.datetime as reply_datetime,
(parent_message.archived NOT LIKE '%,{$cms_user['id']},%') AS message_archive,
(parent_message.viewed LIKE '%,{$cms_user['id']},%') AS message_viewed,
reply_summary.unread_replies,
coalesce(last_reply_message.message, parent_message.message) as last_message,
coalesce(last_reply_message.datetime, parent_message.datetime) as last_datetime
from
cms_messages as parent_message
left join (
select
reply as parent_id,
max(id) as last_reply_id,
count(*) as num_replies,
sum(viewed not like '%,{$cms_user['id']},%') as unread_replies
from
cms_messages
where
reply <> 0 and
active = 1
group by
reply
) as reply_summary on reply_summary.parent_id = parent_message.id
left join cms_messages as last_reply_message on last_reply_message.id = reply_summary.last_reply_id
where
parent_message.reply = 0 and
parent_message.active = 1 and
(parent_message.to like '%,{$cms_user['id']},%' or parent_message.to = 'all' or parent_message.from = '{$cms_user['id']}')
order by
last_datetime desc;
your problem is that you are fetching only m records no matter what the order of the r records.
try adding
SELECT m.*, r.*
or
SELECT r.*, m.*
if you are using PDO::FETCH_ASSOC as your PDO fetch mode (assuming you are using PDO to access your database), the result will be an associative array where if the result set contains multiple columns with the same name, PDO::FETCH_ASSOC returns only a single value per column name. not sure which order takes presidence, so you would have to try both.
if your columns are defined in the right order, they will return the r.* value if one exists, or the m.* value if no r records exist. does this make sense? this way your result set will contain the latest record no matter which table (m or r) contains them.
http://www.php.net/manual/en/pdo.constants.php
I am afraid that you wont be able to solve this problem with a single query. Either you have to use more queries and gather the informations in the surrounding code or you will have to redesign the database structure for your messaging system a litte (tables: threads, posts, etc.). If you decide to redesign the database structure, you should also take care of the way you handle the viewed and archived fields. The way you use the fields (varchar 255 only!) might work for some users, but as soon as there are more users and higher user IDs your message system will break down.
This post is taking a substantial amount of time to type because I'm trying to be as clear as possible, so please bear with me if it is still unclear.
Basically, what I have are a table of posts in the database which users can add privacy settings to.
ID | owner_id | post | other_info | privacy_level (int value)
From there, users can add their privacy details, allowing it to be viewable by all [privacy_level = 0), friends (privacy_level = 1), no one (privacy_level = 3), or specific people or filters (privacy_level = 4). For privacy levels specifying specific people (4), the query will reference the table "post_privacy_includes_for" in a subquery to see if the user (or a filter the user belongs to) exists in a row in the table.
ID | post_id | user_id | list_id
Also, the user has the ability to prevent some people from viewing their post in within a larger group by excluding them (e.g., Having it set for everyone to view but hiding it from a stalker user). For this, another reference table is added, "post_privacy_exclude_from" - it looks identical to the setup as "post_privacy_includes_for".
My problem is that this does not scale. At all. At the moment, there are about 1-2 million posts, the majority of them set to be viewable by everyone. For each post on the page it must check to see if there is a row that is excluding the post from being shown to the user - this moves really slow on a page that can be filled with 100-200 posts. It can take up to 2-4 seconds, especially when additional constraints are added to the query.
This also creates extremely large and complex queries that are just... awkward.
SELECT t.*
FROM posts t
WHERE ( (t.privacy_level = 3
AND t.owner_id = ?)
OR (t.privacy_level = 4
AND EXISTS
( SELECT i.id
FROM PostPrivacyIncludeFor i
WHERE i.user_id = ?
AND i.thought_id = t.id)
OR t.privacy_level = 4
AND t.owner_id = ?)
OR (t.privacy_level = 4
AND EXISTS
(SELECT i2.id
FROM PostPrivacyIncludeFor i2
WHERE i2.thought_id = t.id
AND EXISTS
(SELECT r.id
FROM FriendFilterIds r
WHERE r.list_id = i2.list_id
AND r.friend_id = ?))
OR t.privacy_level = 4
AND t.owner_id = ?)
OR (t.privacy_level = 1
AND EXISTS
(SELECT G.id
FROM Following G
WHERE follower_id = t.owner_id
AND following_id = ?
AND friend = 1)
OR t.privacy_level = 1
AND t.owner_id = ?)
OR (NOT EXISTS
(SELECT e.id
FROM PostPrivacyExcludeFrom e
WHERE e.thought_id = t.id
AND e.user_id = ?
AND NOT EXISTS
(SELECT e2.id
FROM PostPrivacyExcludeFrom e2
WHERE e2.thought_id = t.id
AND EXISTS
(SELECT l.id
FROM FriendFilterIds l
WHERE l.list_id = e2.list_id
AND l.friend_id = ?)))
AND t.privacy_level IN (0, 1, 4))
AND t.owner_id = ?
ORDER BY t.created_at LIMIT 100
(mock up query, similar to the query I use now in Doctrine ORM. It's a mess, but you get what I am saying.)
I guess my question is, how would you approach this situation to optimize it? Is there a better way to set up my database? I'm willing to completely scrap the method I have currently built up, but I wouldn't know what to move onto.
Thanks guys.
Updated: Fix the query to reflect the values I defined for privacy level above (I forgot to update it because I simplified the values)
Your query is too long to give a definitive solution for, but the approach I would follow is to simply the data lookups by converting the sub-queries into joins, and then build the logic into the where clause and column list of the select statement:
select t.*, i.*, r.*, G.*, e.* from posts t
left join PostPrivacyIncludeFor i on i.user_id = ? and i.thought_id = t.id
left join FriendFilterIds r on r.list_id = i.list_id and r.friend_id = ?
left join Following G on follower_id = t.owner_id and G.following_id = ? and G.friend=1
left join PostPrivacyExcludeFrom e on e.thought_id = t.id and e.user_id = ?
(This might need expanding: I couldn't follow the logic of the final clause.)
If you can get the simple select working fast AND including all the information needed, then all you need to do is build up the logic in the select list and where clause.
Had a quick stab at simplifying this without re-working your original design too much.
Using this solution your web page can now simply call the following stored procedure to get a list of filtered posts for a given user within a specified period.
call list_user_filtered_posts( <user_id>, <day_interval> );
The whole script can be found here : http://pastie.org/1212812
I haven't fully tested all of this and you may find this solution isn't performant enough for your needs but it may help you in fine tuning/modifying your existing design.
Tables
Dropped your post_privacy_exclude_from table and added a user_stalkers table which works pretty much like the inverse of user_friends. Kept the original post_privacy_includes_for table as per your design as this allows a user restrict a specific post to a subset of people.
drop table if exists users;
create table users
(
user_id int unsigned not null auto_increment primary key,
username varbinary(32) unique not null
)
engine=innodb;
drop table if exists user_friends;
create table user_friends
(
user_id int unsigned not null,
friend_user_id int unsigned not null,
primary key (user_id, friend_user_id)
)
engine=innodb;
drop table if exists user_stalkers;
create table user_stalkers
(
user_id int unsigned not null,
stalker_user_id int unsigned not null,
primary key (user_id, stalker_user_id)
)
engine=innodb;
drop table if exists posts;
create table posts
(
post_id int unsigned not null auto_increment primary key,
user_id int unsigned not null,
privacy_level tinyint unsigned not null default 0,
post_date datetime not null,
key user_idx(user_id),
key post_date_user_idx(post_date, user_id)
)
engine=innodb;
drop table if exists post_privacy_includes_for;
create table post_privacy_includes_for
(
post_id int unsigned not null,
user_id int unsigned not null,
primary key (post_id, user_id)
)
engine=innodb;
Stored Procedures
The stored procedure is relatively simple - it initially selects ALL posts within the specified period and then filters out posts as per your original requirements. I have not performance tested this sproc with large volumes but as the initial selection is relatively small it should be performant enough as well as simplifying your application/middle tier code.
drop procedure if exists list_user_filtered_posts;
delimiter #
create procedure list_user_filtered_posts
(
in p_user_id int unsigned,
in p_day_interval tinyint unsigned
)
proc_main:begin
drop temporary table if exists tmp_posts;
drop temporary table if exists tmp_priv_posts;
-- select ALL posts in the required date range (or whatever selection criteria you require)
create temporary table tmp_posts engine=memory
select
p.post_id, p.user_id, p.privacy_level, 0 as deleted
from
posts p
where
p.post_date between now() - interval p_day_interval day and now()
order by
p.user_id;
-- purge stalker posts (0,1,3,4)
update tmp_posts
inner join user_stalkers us on us.user_id = tmp_posts.user_id and us.stalker_user_id = p_user_id
set
tmp_posts.deleted = 1
where
tmp_posts.user_id != p_user_id;
-- purge other users private posts (3)
update tmp_posts set deleted = 1 where user_id != p_user_id and privacy_level = 3;
-- purge friend only posts (1) i.e where p_user_id is not a friend of the poster
/*
requires another temp table due to mysql temp table problem/bug
http://dev.mysql.com/doc/refman/5.0/en/temporary-table-problems.html
*/
-- the private posts (1) this user can see
create temporary table tmp_priv_posts engine=memory
select
tp.post_id
from
tmp_posts tp
inner join user_friends uf on uf.user_id = tp.user_id and uf.friend_user_id = p_user_id
where
tp.user_id != p_user_id and tp.privacy_level = 1;
-- remove private posts this user cant see
update tmp_posts
left outer join tmp_priv_posts tpp on tmp_posts.post_id = tpp.post_id
set
tmp_posts.deleted = 1
where
tpp.post_id is null and tmp_posts.privacy_level = 1;
-- purge filtered (4)
truncate table tmp_priv_posts; -- reuse tmp table
insert into tmp_priv_posts
select
tp.post_id
from
tmp_posts tp
inner join post_privacy_includes_for ppif on tp.post_id = ppif.post_id and ppif.user_id = p_user_id
where
tp.user_id != p_user_id and tp.privacy_level = 4;
-- remove private posts this user cant see
update tmp_posts
left outer join tmp_priv_posts tpp on tmp_posts.post_id = tpp.post_id
set
tmp_posts.deleted = 1
where
tpp.post_id is null and tmp_posts.privacy_level = 4;
drop temporary table if exists tmp_priv_posts;
-- output filtered posts (display ALL of these on web page)
select
p.*
from
posts p
inner join tmp_posts tp on p.post_id = tp.post_id
where
tp.deleted = 0
order by
p.post_id desc;
-- clean up
drop temporary table if exists tmp_posts;
end proc_main #
delimiter ;
Test Data
Some basic test data.
insert into users (username) values ('f00'),('bar'),('alpha'),('beta'),('gamma'),('omega');
insert into user_friends values
(1,2),(1,3),(1,5),
(2,1),(2,3),(2,4),
(3,1),(3,2),
(4,5),
(5,1),(5,4);
insert into user_stalkers values (4,1);
insert into posts (user_id, privacy_level, post_date) values
-- public (0)
(1,0,now() - interval 8 day),
(1,0,now() - interval 8 day),
(2,0,now() - interval 7 day),
(2,0,now() - interval 7 day),
(3,0,now() - interval 6 day),
(4,0,now() - interval 6 day),
(5,0,now() - interval 5 day),
-- friends only (1)
(1,1,now() - interval 5 day),
(2,1,now() - interval 4 day),
(4,1,now() - interval 4 day),
(5,1,now() - interval 3 day),
-- private (3)
(1,3,now() - interval 3 day),
(2,3,now() - interval 2 day),
(4,3,now() - interval 2 day),
-- filtered (4)
(1,4,now() - interval 1 day),
(4,4,now() - interval 1 day),
(5,4,now());
insert into post_privacy_includes_for values (15,4), (16,1), (17,6);
Testing
As I mentioned before I've not fully tested this but on the surface it seems to be working.
select * from posts;
call list_user_filtered_posts(1,14);
call list_user_filtered_posts(6,14);
call list_user_filtered_posts(1,7);
call list_user_filtered_posts(6,7);
Hope you find some of this of use.
Is there anyway I can erase all the duplicate entries from a certain table (users)? Here is a sample of the type of entries I have. I must say the table users consists of 3 fields, ID, user, and pass.
mysql_query("DELETE FROM users WHERE ???") or die(mysql_error());
randomtest
randomtest
randomtest
nextfile
baby
randomtest
dog
anothertest
randomtest
baby
nextfile
dog
anothertest
randomtest
randomtest
I want to be able to find the duplicate entries, and then delete all of the duplicates, and leave one.
You can solve it with only one query.
If your table has the following structure:
CREATE TABLE `users` (
`id` int(10) unsigned NOT NULL auto_increment,
`username` varchar(45) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=8 DEFAULT CHARSET=latin1;
you could do something like that (this will delete all duplicate users based on username with and ID greater than the smaller ID for that username):
DELETE users
FROM users INNER JOIN
(SELECT MIN(id) as id, username FROM users GROUP BY username) AS t
ON users.username = t.username AND users.id > t.id
It works and I've already use something similar to delete duplicates.
You can do it with three sqls:
create table tmp as select distinct name from users;
drop table users;
alter table tmp rename users;
This delete script (SQL Server syntax) should work:
DELETE FROM Users
WHERE ID NOT IN (
SELECT MIN(ID)
FROM Users
GROUP BY User
)
I assume that you have a structure like the following:
users
-----------------
| id | username |
-----------------
| 1 | joe |
| 2 | bob |
| 3 | jane |
| 4 | bob |
| 5 | bob |
| 6 | jane |
-----------------
Doing the magic with temporary is required since MySQL cannot use a sub-select in delete query that uses the delete's target table.
CREATE TEMPORARY TABLE IF NOT EXISTS users_to_delete (id INTEGER);
INSERT INTO users_to_delete (id)
SELECT MIN(u1.id) as id
FROM users u1
INNER JOIN users u2 ON u1.username = u2.username
GROUP BY u1.username;
DELETE FROM users WHERE id NOT IN (SELECT id FROM users_to_delete);
I know the query is a bit hairy but it does the work, even if the users table has more than 2 columns.
You need to be a bit careful of how the data in your table is used. If this really is a users table, there is likely other tables with FKs pointing to the ID column. In which case you need to update those tables to use ID you have selected to keep.
If it's just a standalone table (no table reference it)
CREATE TEMPORARY TABLE Tmp (ID int);
INSERT INTO Tmp SELECT ID FROM USERS GROUP BY User;
DELETE FROM Users WHERE ID NOT IN (SELECT ID FROM Tmp);
Users table linked from other tables
Create the temporary tables including a link table that holds all the old id's and the respective new ids which other tables should reference instead.
CREATE TEMPORARY TABLE Keep (ID int, User varchar(45));
CREATE TEMPORARY TABLE Remove (OldID int, NewID int);
INSERT INTO Keep SELECT ID, User FROM USERS GROUP BY User;
INSERT INTO Remove SELECT u1.ID, u2.ID FROM Users u1 INNER JOIN Keep u2 ON u2.User = u1.User WHERE u1.ID NOT IN (SELECT ID FROM Users GROUP BY User);
Go through any tables which reference your users table and update their FK column (likely called UserID) to point to the New unique ID which you have selected, like so...
UPDATE MYTABLE t INNER JOIN Remove r ON t.UserID = r.OldID
SET t.UserID = r.NewID;
Finally go back to your users table and remove the no longer referenced duplicates:
DELETE FROM Users WHERE ID NOT IN (SELECT ID FROM Keep);
Clean up those Tmp tables:
DROP TABLE KEEP;
DROP TABLE REMOVE;
A very simple solution would be to set an UNIQUE index on the table's column you wish to have unique values. Note that you subsequently cannot insert the same key twice.
Edit: My mistake, I hadn't read that last line: "I want to be able to find the duplicate entries".
I would get all the results, put them in an array of IDs and VALUES. Use a PHP function to work out the dupes, log all the IDs in an array, and use those values to delete the records.
I don't know your db schema, but the simplest solution seems to be to do SELECT DISTINCT on that table, keep the result in a variable (i.e. array), delete all records from the table and then reinsert the list returne by SELECT DISTINCT previously.
The temporary table is an excellent solution, but I'd like to provide a SELECT query that grabs duplicate rows from the table as an alternative:
SELECT * FROM `users` LEFT JOIN (
SELECT `name`, COUNT(`name`) AS `count`
FROM `users` GROUP BY `name`
) AS `grouped`
WHERE `grouped`.`name` = `users`.`name`
AND `grouped`.`count`>1
Select your 3 columns as per your table structure and apply condition as per your requirements.
SELECT user.userId,user.username user.password FROM user As user
GROUP BY user.userId, user.username
HAVING (COUNT(user.username) > 1));
Every answer above and/or below didn't work for me, therefore I decided to write my own little script. It's not the best, but it gets the job done.
Comments are included throughout, but this script is customized for my needs, and I hope the idea helps you.
I basically wrote the database contents to a temp file, called the temp file, applied the function to the called file to remove the duplicates, truncated the table, and then input the data right back into the SQL. Sounds like a lot, I know.
If you're confused as to what $setprofile is, it's a session that's created upon logging into my script (to establish a profile), and is cleared upon logging out.
<?php
// session and includes, you know the drill.
session_start();
include_once('connect/config.php');
// create a temp file with session id and current date
$datefile = date("m-j-Y");
$file = "temp/$setprofile-$datefile.txt";
$f = fopen($file, 'w'); // Open in write mode
// call the user and pass via SQL and write them to $file
$sql = mysql_query("SELECT * FROM _$setprofile ORDER BY user DESC");
while($row = mysql_fetch_array($sql))
{
$user = $row['user'];
$pass = $row['pass'];
$accounts = "$user:$pass "; // the white space right here is important, it defines the separator for the dupe check function
fwrite($f, $accounts);
}
fclose($f);
// **** Dupe Function **** //
// removes duplicate substrings between the seperator
function uniqueStrs($seperator, $str) {
// convert string to an array using ' ' as the seperator
$str_arr = explode($seperator, $str);
// remove duplicate array values
$result = array_unique($str_arr);
// convert array back to string, using ' ' to glue it back
$unique_str = implode(' ', $result);
// return the unique string
return $unique_str;
}
// **** END Dupe Function **** //
// call the list we made earlier, so we can use the function above to remove dupes
$str = file_get_contents($file);
// seperator
$seperator = ' ';
// use the function to save a unique string
$new_str = uniqueStrs($seperator, $str);
// empty the table
mysql_query("TRUNCATE TABLE _$setprofile") or die(mysql_error());
// prep for SQL by replacing test:test with ('test','test'), etc.
// this isn't a sufficient way of converting, as i said, it works for me.
$patterns = array("/([^\s:]+):([^\s:]+)/", "/\s++\(/");
$replacements = array("('$1', '$2')", ", (");
// insert the values into your table, and presto! no more dupes.
$sql = 'INSERT INTO `_'.$setprofile.'` (`user`, `pass`) VALUES ' . preg_replace($patterns, $replacements, $new_str) . ';';
$product = mysql_query($sql) or die(mysql_error()); // put $new_str here so it will replace new list with SQL formatting
// if all goes well.... OR wrong? :)
if($product){ echo "Completed!";
} else {
echo "Failed!";
}
unlink($file); // delete the temp file/list we made earlier
?>
This will work:
create table tmp like users;
insert into tmp select distinct name from users;
drop table users;
alter table tmp rename users;
If you have a Unique ID / Primary key on the table then:
DELETE FROM MyTable AS T1
WHERE MyID <
(
SELECT MAX(MyID)
FROM MyTable AS T2
WHERE T2.Col1 = T1.Col1
AND T2.Col2 = T1.Col2
... repeat for all columns to consider duplicates ...
)
if you don't have a Unique Key select all distinct values into a temporary table, delete all original rows, and copy back from temporary table - but this will be problematic if you have Foreign Keys referring to this table