MYSQL will it take long to read through a database?

MYSQL will it take long to read through a database? - php

Let's say I have a table called comments in my database. Each comment has an id, an assigned post and its data (the actual text submitted by the user.) Let's say I have 10,000 posts and each post has 10 comments. That's a total of 100,000 comments. If a user loads up a post and its comments.
Will the code I use to load the comments read though all one hundred thousand comments to pick out the one's with a postID of (lets say:) 11 or will it instantly be able to see that there is 10 comments with the postID of 11 and pick them out (which would be a lot quicker than looking through 100,000 comments. What does MySql do in this scenario?

I recommend the following in the table definition (along with other columns):
post_id INT UNSIGNED NOT NULL, -- for JOINing to Posts table
comment_id INT UNSIGNED AUTO_INCREMENT NOT NULL,
PRIMARY KEY(post_id, comment_id), -- clusters all the comments together for a post
INDEX(comment_id) -- keeps AUTO_INCREMENT happy
And use ENGINE=InnoDB so that the PRIMARY KEY is clustered with the data.
SELECT ...
WHERE post_id = 123
AND comment_id > 0 -- see note below
ORDER BY comment_id ASC
LIMIT 10
With this, all 10 comments will come from one block (or possibly a very small number of blocks), thereby be as efficient as possible.
Since your next question will be about finding the 'next' 10 comments, I have jumped ahead. Recommended reading: This . "Remember where you left off with the first 10 comments, then change the 0 to the last comment_id found.
I have given you a lot of keywords to research and understand. Please study the manuals before coming back with your next question.

Related

Database logic for conversation (like Forum) PHP, MySql

I want create a page where is possible create topics and users are able to comment.
I create the table discussion with a recursive relation.
I don't if the idea is good.
How can I find the id of the parent when a user comment?
I don't if is clear...
Below you will find 3 screenshots to explain you better the situation.

I would not have the parentID point to the comment that precedes it, that's bad practice all around. The parentID should point to the topic and then when selecting all comments from the database, make a query that orders them by time or their ID to see what order they were posted in.
For example...
SELECT * FROM `discussion` WHERE `parentID` = 1 ORDER BY `time` DESC;
The parentID should be null for topics and not null (integer) for replies (the integer should point to the ID of the topic). I hope I understood what you were asking correctly.

need some guidance on a php point system

am trying to build a point system which checks how much points a user have and gives a specific title to them.
I have prepared a table which the php script can refer to when checking which title should be given to a member.
MYSQL Table structure as follows:
name: ptb
structure: pts , title
For example , if you have 100 points , you gain the title - "Veteran" , if you have 500 points , you gain the title "Pro". let's say i have pts:100 , title:veteran and pts:500 , title:pro rows in the ptb table.
However i stumble upon a confusing fact.
How can i use php to determine which title to give the user by using the ptb table data?
If a user have equal or more than 100 points will gain Veteran for title BUT 500 is also MORE THAN 100 which means the php script also needs to make sure it is below 500pts .
I still not sure how to use php to do this. as i am confused myself.
I hope someone could understand and provide me some guidelines.
THANKS!

You select all records with enought points, sort the one with the highest score to the top and cut out the rest.
SELECT title FROM ptb WHERE pts <= $points ORDER BY pts DESC LIMIT 1

(PiTheNumber's solution doesn't work very well if you want to retrieve titles for multiple users)
Since the points will change over time and the mutliple users can have the same title, it sounds like this should be 2 tables:
CREATE TABLE users (
userid ...whatever type,
points INTEGER NOT NULL DEFAULT 0
PRIMARY KEY(userid)
);
CREATE TABLE titles (
title VARCHAR(50),
minpoints INTEGER NOT NULL DEFAULT 0
PRIMARY KEY (title),
UNIQUE INDEX (minpoints)
);
Then....
SELECT u.userid, u.points, t.title
FROM users u, titles t
WHERE u.points>=t.minpoints
AND ....other criteria for filtering output....
AND NOT EXISTS (
SELECT 1
FROM titles t2
WHERE t2.minpoints>=u.points
AND t2.minpoints<=t.minpoints
);
(there are other ways to write the query)

Storing user activity? PHP, MySQL and Database Design

Ok so a user comes to my web application and gets points and the like for activity, sort of similar (but not as complex) as this site. They can vote, comment, submit, favorite, vote for comments, write description etc and so on.
At the moment I store a user action in a table against a date like so
Table user_actions
action_id - PK AI int
user_id - PK int
action_type - varchar(20)
date_of_action - datetime
So for example if a user comes along and leaves a comment or votes on a comment, then the rows would look something like this
action_id = 4
user_id = 25
action_type = 'new_comment'
date_of_action = '2011-11-21 14:12:12';
action_id = 4
user_id = 25
action_type = 'user_comment_vote'
date_of_action = '2011-12-01 14:12:12';
All good I hear you say, but not quite, remember that these rows would reside in the user_actions table which is a different table to the ones in which the comments and user comment votes are stored in.
So how do I know what comment links to what row in the user_actions?
Well I could just link to the unique comment_id in the comments table to a new column, called target_primary_key in the user_actions table?
Nope. Can't do that because the action could equally have been a user_comment_vote which has a composite key (double key)?
So the thought I am left with is, do I just add the primary keys in a column and comma deliminate them and let PHP parse it out?
So taking the example above, the lines below show how I would store the target primary keys
new_comment
target_primary_keys - 12 // the unique comment_id from the comments table
user_comment_vote
target_primary_keys - 22,12 // the unique comment_id from the comments table
So basically a user makes an action, the user_actions is updated and so is the specific table, but how do I link the two while still allowing for multiple keys?
Has anyone had experience with storing user activity before?
Any thoughts are welcome, no wrong answers here.

You do not need a user actions table.
To calculate the "score" you can run one query over multiple tables and multiply the count of matching comments, ratings etc. with a multiplier (25 points for a comment, 10 for a rating, ...).
To speed up your page you can store the total score in an extra table or the user table and refresh the total score with triggers if the score changes.
If you want to display the number of ratings or comments you can do the same.
Get the details from the existing tables and store the total number of comments and ratings in an extra table.

The simplest answer is to just use another table, which can contain multiple matches for any key and allow great indexing options:
create table users_to_actions (
user_id int(20) not null,
action_id int(20) not null,
action_type varchar(25) not null,
category_or_other_criteria ...
);
create index(uta_u_a) on users_to_actions(user_id, action_id);
To expand on this a bit, you would then select items by joining them with this table:
select
*
from
users_to_actions as uta join comments as c using(action_id)
where
uta.action_type = 'comment' and user_id = 25
order by
c.post_date
Or maybe a nested query depending on your needs:
select * from users where user_id in(
select
user_id
from
users_to_actions
where
uta.action_type = 'comment'
);

MySQL and NoSQL: Help me to choose the right one

There is a big database, 1,000,000,000 rows, called threads (these threads actually exist, I'm not making things harder just because of I enjoy it). Threads has only a few stuff in it, to make things faster: (int id, string hash, int replycount, int dateline (timestamp), int forumid, string title)
Query:
select * from thread where forumid = 100 and replycount > 1 order by dateline desc limit 10000, 100
Since that there are 1G of records it's quite a slow query. So I thought, let's split this 1G of records in as many tables as many forums(category) I have! That is almost perfect. Having many tables I have less record to search around and it's really faster. The query now becomes:
select * from thread_{forum_id} where replycount > 1 order by dateline desc limit 10000, 100
This is really faster with 99% of the forums (category) since that most of those have only a few of topics (100k-1M). However because there are some with about 10M of records, some query are still to slow (0.1/.2 seconds, to much for my app!, I'm already using indexes!).
I don't know how to improve this using MySQL. Is there a way?
For this project I will use 10 Servers (12GB ram, 4x7200rpm hard disk on software raid 10, quad core)
The idea was to simply split the databases among the servers, but with the problem explained above that is still not enought.
If I install cassandra on these 10 servers (by supposing I find the time to make it works as it is supposed to) should I be suppose to have a performance boost?
What should I do? Keep working with MySQL with distributed database on multiple machines or build a cassandra cluster?
I was asked to post what are the indexes, here they are:
mysql> show index in thread;
PRIMARY id
forumid
dateline
replycount
Select explain:
mysql> explain SELECT * FROM thread WHERE forumid = 655 AND visible = 1 AND open <> 10 ORDER BY dateline ASC LIMIT 268000, 250;
+----+-------------+--------+------+---------------+---------+---------+-------------+--------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+---------------+---------+---------+-------------+--------+-----------------------------+
| 1 | SIMPLE | thread | ref | forumid | forumid | 4 | const,const | 221575 | Using where; Using filesort |
+----+-------------+--------+------+---------------+---------+---------+-------------+--------+-----------------------------+

You should read the following and learn a little bit about the advantages of a well designed innodb table and how best to use clustered indexes - only available with innodb !
http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html
http://www.xaprb.com/blog/2006/07/04/how-to-exploit-mysql-index-optimizations/
then design your system something along the lines of the following simplified example:
Example schema (simplified)
The important features are that the tables use the innodb engine and the primary key for the threads table is no longer a single auto_incrementing key but a composite clustered key based on a combination of forum_id and thread_id. e.g.
threads - primary key (forum_id, thread_id)
forum_id thread_id
======== =========
1 1
1 2
1 3
1 ...
1 2058300
2 1
2 2
2 3
2 ...
2 2352141
...
Each forum row includes a counter called next_thread_id (unsigned int) which is maintained by a trigger and increments every time a thread is added to a given forum. This also means we can store 4 billion threads per forum rather than 4 billion threads in total if using a single auto_increment primary key for thread_id.
forum_id title next_thread_id
======== ===== ==============
1 forum 1 2058300
2 forum 2 2352141
3 forum 3 2482805
4 forum 4 3740957
...
64 forum 64 3243097
65 forum 65 15000000 -- ooh a big one
66 forum 66 5038900
67 forum 67 4449764
...
247 forum 247 0 -- still loading data for half the forums !
248 forum 248 0
249 forum 249 0
250 forum 250 0
The disadvantage of using a composite key is that you can no longer just select a thread by a single key value as follows:
select * from threads where thread_id = y;
you have to do:
select * from threads where forum_id = x and thread_id = y;
However, your application code should be aware of which forum a user is browsing so it's not exactly difficult to implement - store the currently viewed forum_id in a session variable or hidden form field etc...
Here's the simplified schema:
drop table if exists forums;
create table forums
(
forum_id smallint unsigned not null auto_increment primary key,
title varchar(255) unique not null,
next_thread_id int unsigned not null default 0 -- count of threads in each forum
)engine=innodb;
drop table if exists threads;
create table threads
(
forum_id smallint unsigned not null,
thread_id int unsigned not null default 0,
reply_count int unsigned not null default 0,
hash char(32) not null,
created_date datetime not null,
primary key (forum_id, thread_id, reply_count) -- composite clustered index
)engine=innodb;
delimiter #
create trigger threads_before_ins_trig before insert on threads
for each row
begin
declare v_id int unsigned default 0;
select next_thread_id + 1 into v_id from forums where forum_id = new.forum_id;
set new.thread_id = v_id;
update forums set next_thread_id = v_id where forum_id = new.forum_id;
end#
delimiter ;
You may have noticed I've included reply_count as part of the primary key which is a bit strange as (forum_id, thread_id) composite is unique in itself. This is just an index optimisation which saves some I/O when queries that use reply_count are executed. Please refer to the 2 links above for further info on this.
Example queries
I'm still loading data into my example tables and so far I have a loaded approx. 500 million rows (half as many as your system). When the load process is complete I should expect to have approx:
250 forums * 5 million threads = 1250 000 000 (1.2 billion rows)
I've deliberately made some of the forums contain more than 5 million threads for example, forum 65 has 15 million threads:
forum_id title next_thread_id
======== ===== ==============
65 forum 65 15000000 -- ooh a big one
Query runtimes
select sum(next_thread_id) from forums;
sum(next_thread_id)
===================
539,155,433 (500 million threads so far and still growing...)
under innodb summing the next_thread_ids to give a total thread count is much faster than the usual:
select count(*) from threads;
How many threads does forum 65 have:
select next_thread_id from forums where forum_id = 65
next_thread_id
==============
15,000,000 (15 million)
again this is faster than the usual:
select count(*) from threads where forum_id = 65
Ok now we know we have about 500 million threads so far and forum 65 has 15 million threads - let's see how the schema performs :)
select forum_id, thread_id from threads where forum_id = 65 and reply_count > 64 order by thread_id desc limit 32;
runtime = 0.022 secs
select forum_id, thread_id from threads where forum_id = 65 and reply_count > 1 order by thread_id desc limit 10000, 100;
runtime = 0.027 secs
Looks pretty performant to me - so that's a single table with 500+ million rows (and growing) with a query that covers 15 million rows in 0.02 seconds (while under load !)
Further optimisations
These would include:
partitioning by range
sharding
throwing money and hardware at it
etc...
hope you find this answer helpful :)

EDIT: Your one-column indices are not enough. You would need to, at least, cover the three involved columns.
More advanced solution: replace replycount > 1 with hasreplies = 1 by creating a new hasreplies field that equals 1 when replycount > 1. Once this is done, create an index on the three columns, in that order: INDEX(forumid, hasreplies, dateline). Make sure it's a BTREE index to support ordering.
You're selecting based on:
a given forumid
a given hasreplies
ordered by dateline
Once you do this, your query execution will involve:
moving down the BTREE to find the subtree that matches forumid = X. This is a logarithmic operation (duration : log(number of forums)).
moving further down the BTREE to find the subtree that matches hasreplies = 1 (while still matching forumid = X). This is a constant-time operation, because hasreplies is only 0 or 1.
moving through the dateline-sorted subtree in order to get the required results, without having to read and re-sort the entire list of items in the forum.
My earlier suggestion to index on replycount was incorrect, because it would have been a range query and thus prevented the use of a dateline to sort the results (so you would have selected the threads with replies very fast, but the resulting million-line list would have had to be sorted completely before looking for the 100 elements you needed).
IMPORTANT: while this improves performance in all cases, your huge OFFSET value (10000!) is going to decrease performance, because MySQL does not seem to be able to skip ahead despite reading straight through a BTREE. So, the larger your OFFSET is, the slower the request will become.
I'm afraid the OFFSET problem is not automagically solved by spreading the computation over several computations (how do you skip an offset in parallel, anyway?) or moving to NoSQL. All solutions (including NoSQL ones) will boil down to simulating OFFSET based on dateline (basically saying dateline > Y LIMIT 100 instead of LIMIT Z, 100 where Y is the date of the item at offset Z). This works, and eliminates any performance issues related to the offset, but prevents going directly to page 100 out of 200.

There is are part of question which related to NoSQL or MySQL option. Actually this is one fundamental thing hidden here. SQL language is easy to write for human and bit difficult to read for computer. In high volume databases I would recommend to avoid SQL backend as this requires extra step - command parsing. I have done extensive benchmarking and there are cases when SQL parser is slowest point. There is nothing you can do about it. Ok, you can possible use pre-parsed statements and access them.
BTW, it is not wide known but MySQL has grown out from NoSQL database. Company where authors of MySQL David and Monty worked was data warehousing company and they often had to write custom solutions for uncommon tasks. This leaded to big stack of homebrew C libraries used to manually write database functions when Oracle and other were performing poorly. SQL was added to this nearly 20 years old zoo on 1996 for fun. What came after you know.
Actually you can avoid SQL overhead with MySQL. But usually SQL parsing is not the slowest part but just good to know. To test parser overhead you may just make benchmark for "SELECT 1" for example ;).

You should not be trying to fit a database architecture to hardware you're planning to buy, but instead plan to buy hardware to fit your database architecture.
Once you have enough RAM to keep the working set of indexes in memory, all your queries that can make use of indexes will be fast. Make sure your key buffer is set large enough to hold the indexes.
So if 12GB is not enough, don't use 10 servers with 12GB of RAM, use fewer with 32GB or 64GB of RAM.

Indices are a must - but remember to choose the right type of index: BTREE is more suitable when using queries with "<" or ">" in your WHERE clauses, while HASH is more suitable when you have many distinct values in one column and you are using "=" or "<=>" in your WHERE clause.
Further reading http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html

How to implement related posts feature without tagging each post?

We have 2 text fields ('post_text' & 'post_slug') in our database.
Let's say, post_text = "Hello World!", so its post_slug = "hello-world".
How to implement related posts feature without tagging each posts operating only existing fields? (PHP, MySQL)
p.s. the database contains a lot of posts.

Just a note: you definitely do not want to compute your related posts on the fly every time you display a page because on a large database that would be somewhere between too much work and impossible.
You would normally periodically run whatever algorithm you are using and save the information in a separate table referencing your original table. Displaying a post would then be just a simple join.

check out the similar_text function
http://jp2.php.net/manual/en/function.similar-text.php
or maybe split each word by spaces and caculate the relation with your own algorithm.
and if your able to add a table to mysql, you should create a table that holds the calculated relation of each post.
CREATE TABLE blog_table.`posts_relation` (
`post_id` INT UNSIGNED NOT NULL ,
`related_post_id` INT UNSIGNED NOT NULL ,
`relation` FLOAT UNSIGNED NOT NULL ,
INDEX ( `post_id` , `related_post_id` )
)
update each time you add a post, or maybe once a day.
and grab your results with something like
SELECT posts.* FROM posts, posts_relation WHERE posts_relation.post_id = {$post_id} AND posts.post_id = posts_relation.related_post_id ORDER BY posts_relation.relation DESC LIMIT 5

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.