I'm working in a commenting application and i would like some feedback on the method that i am using to keep track of the number of replies or likes that a comment has. Comments and replies are stored in the same table, to determine if a comment is a reply i use the field parent_id if it is anything other than 0 the comment is a reply.
Please note that i wont be including all the columns of the table below:
cid | parent_id | replies | likes
-----+-----------+---------+-------
2 | 0 | 3 | 0
3 | 2 | 0 | 0
4 | 2 | 0 | 2
5 | 2 | 0 | 0
In the table above comments with id (cid) [3,4,5] are replies of comment #2. The columns replies and likes are integer that hold the count of replies and likes accordingly. The integrity and accuracy of these columns is maintain and updated through the PHP code, for example if another reply for comment #2 is added than the replies column would be increased by one or decreased by one if deleted.
Im also aware that i could dynamically calculate the replies count in the SQL query that fetches the comments but i thought it would add more stress to the SQL server. This query would look something like these:
SELECT cid, parent_id, (
SELECT count(*)
FROM comments as SC
WHERE RC.parent_id = C.cid
) AS replies
FROM comments AS C
WHERE thread = {thread_id}
Am i doing it right by storing the replies and likes in an actual column in the table? or am i exaggerating about the stress that a query such as the one above would have in the MySql server and i should use such complex query instead?
Any feedback would be appreciated, thanks
I dont think you need the column called 'replies'. Just occupies additional unwanted space.
Do a combined Index on cid and parentId. That should be good enough. Queries should be fast.
By having the column, you are adding more stress to app code & mysql. (App code for maintaining integrity and mysql coz 2 writes in the place of 1 write - when a comment is entered).
But if you are talking about millions of rows, i wouldnt choose mysql for it, rather mongo, the data can be constructed as a beautiful JSON and dumped in mongo.
Related
I am finalizing a comments system and was with a doubt.
I have a table for blogs and one for news, and they accept comments.
My comments table receives the text and the id.
I wonder if I need to (or should I) go through some sort of reference to know where the comment comes from.
table comment
id | id_content | text | ref
1 | 1 | test | blog
2 | 1 | test | news
thanks
depending on the number of comments you expect to receive there are two ways of doing this ...
1 - parent_tbl, parent_id - in one big comment table
2 - two tables for comments with a parent_id - one for each primary table
either way you need to index properly, the second will always work faster, but it doesn't expand well if you say add "press_releases" now you have to duplicate code, tables, what not.
I have a Reporting table where i store description
tableA
sno | Project |name | description | mins |
1 | prjA |nameA |ABC -10% task done| 30 |
...
3000 | prjA |nameB |ABC -70% task done| 70 |
i want to query the description field and save in another table
tableB
id | valueStr | total_mins | last_sno
1 | ABC | 100 | 3000
if there is no entry in second table , i create a entry with default values
if there is and entry in second table , i update 2nd table , with the total_mins and increment the last_sno to that value say 3300 , so that the next time i query this table i get values from second table and based on the last_sno
Query
SELCT last_sno FROM tableB where valueStr ='ABC'
the first 3 characters in the description field
SELECT max(sno), sum(mins) FROM tableA
where sno > last_sno and description like 'ABC%'
Since the first table has million of rows so,
i search the first table with sno > last_sno , so that should help performance right ?
but the explain shows that it scans the same no of rows , when i query the first table from the first sno
The use of the index may not help you, because MySQL still has to scan the index from the last_sno to the end of the data. You would be better off with an index on TableA(description), because such an index can be used for description like 'ABC%'.
In fact, this might be a case where the index can hurt you. Instead of sequentially reading the pages in the table, the index reads them randomly -- which is less efficient.
EDIT: (too long for comment)
Try running the query with an ignore index hint to see if you can run the query without it. It is possible that the index is actually making things worse.
However, the "real" solution is to store the prefix you are interested in as a separate column. You can then add an index on this column and the query should work efficiently using basic SQL. You won't have to spend your time trying to optimize a simple process, because the data will be stored correctly for it.
DETAILS
I have a quiz (let’s call it quiz1). Quiz1 uses the same wordlist each time it is generated.
If the user needs to, they can skip words to complete the quiz. I’d like to store those skipped words in mysql and then later perform statistics on them.
At first I was going to store the missed words in one column as a string. Each word would be separated by a comma.
|testid | missedwords | score | userid |
*************************************************************************
| quiz1 | wordlist,missed,skipped,words | 59 | 1 |
| quiz2 | different,quiz,list | 65 | 1 |
The problem with this approach is that I want to show statistics at the end of each quiz about which words were most frequently missed by users who took quiz1.
I’m assuming that storing missed words in one column as above is inefficient for this purpose as I'd need to extract the information and then tally it -(probably tally using php- unless I stored that tallied data in a separate table).
I then thought perhaps I need to create a separate table for the missed words
The advantage of the below table is that it should be easy to tally the words from the table below.
|Instance| missed word |
*****************************
| 1 | wordlist |
| 1 | missed |
| 1 | skipped |
Another approach
I could create a table with tallys and update it each time quiz1 was taken.
Testid | wordlist| missed| skipped| otherword|
**************************************************
Quiz1 | 1 | 1| 1| 0 |
The problem with this approach is that I would need a different table for each quiz, because each quiz will use different words. Also information is lost because only the tally is kept not the related data such which user missed which words.
Question
Which approach would you use? Why? Alternative approaches to this task are welcome. If you see any flaws in my logic please feel free to point them out.
EDIT
Users will be able to retake the quiz as many times as they like. Their information will not be updated, instead a new instance would be created for each quiz they retook.
The best way to do this is to have the word collection completely normalized. This way, analyses will be easy and fast.
quiz_words with wordID, word
quiz_skipped_words with quizID, userID, wordID
To get all the skipped words of a user:
SELECT wordID, word
FROM quiz_words
JOIN quiz_skipped_words USING (wordID)
WHERE userID = ?;
You could add a group by clause to have group counts of the same word.
To get the count of a specific word:
SELECT COUNT(*)
FROM quiz_words
WHERE word LIKE '?';
According to database normalization theory, second approach is better, because ideally one relational table cell should store only one value, which is atomic and unsplitable. Each word is an entity instance.
Also, I might suggest to not create Quiz-Word tables, but reserve another column in Missed-Word table for quiz, for which this word was specified, then use this column as a foreign key for Quiz table. Then you probably may avoid real time table generation (which is a "bad practice" in database design).
why not have a quiz table and quiz_words table, the quiz_words table would store id,quizID,word as columns. Then for each quiz instance create records in the quiz_words table for each word the user did use.
You could then run mysql counts on the quiz_words table based on quizID and or quiz type
The best solution (from my pov) for what are you trying to achieve is the normalized aproach:
test table which has test_id column and other columns
missed_words table which has id (AI PK) and word (UQ) , here you can also have a hits column that should be incremented each time that a association to this word is made in test_missed_words table this way you have the stats that you want already compiled and you don't need them to be calculated from a select query
test_missed_words which is a link table that has test_id and missed_word_id (composite PK)
This way you do not have redundant data (missed words) and you can extract easily that stats that you want
Keeping as much information as possible (and being able to compile user-specific stats later as well as overall stats now) I would create a table structure similar to:
Stats
quizId | userId | type| wordId|
******************************************
1 | 1 | missed| 4|
1 | 1 | skipped| 7|
Where type can either be an int defining the different types of actions, or a string representation - depending on if you believe it can ever be more. ^^
Then:
Quizzes
quizId | quizName|
********************
1| Quiz 1|
With the word list made for each quiz like:
WordList (pk: wordId)
quizId | wordId| word|
***************************
1 | 1 | Cat|
1 | 2 | Dog|
You would have your user table however you want, we are just linking the id from it in to this system.
With this, all id fields will be non-unique keys in the stats table. When a user skips or misses a word, you would add the id of that word to the stats table along with relevant quizId and type. Getting stats this way would make it easy as a per-user basis, a per-word basis, or a per-type basis - or a combination of the three. It will also make the word list for each quiz easily available as well for making the quizzes. ^^
Hope this helps!
Although I researched on this topic and came across a few solutions like using JOIN LEFT or subqueries, I am still unable to get the result I want as I am not strong in mySQL. I am more of a web designer trying to use simple php to my website better for a school project.
I am trying to create a web application something similar to a blog. I wanted to count how many comments are there for a post and display the number for my users to see, but if there is no comment for that row, my query will return nothing instead of 0.
This is my query below:
SELECT post.post_id, COUNT(comment)
FROM `comment`, post
WHERE `comment`.post_id = post.post_id
GROUP BY post.post_id
The result:
Record | post_id | COUNT(comment)
1 | 12 | 2
2 | 13 | 1
3 | 15 | 1
4 | 16 | 1
As you can see, post_id 14 has no comments, thus my query returns nothing. What must I do to make my result looks like this?
Record | post_id | COUNT(comment)
1 | 12 | 2
2 | 13 | 1
3 | 14 | 0
4 | 15 | 1
5 | 16 | 1
Also, it would be nice of you guys to give me references or links to understand the concept behind the solution as I want to learn more about php :)
So Actually when you do that (which is what you do, reformulated for the JOIN):
SELECT post.post_id, COUNT(comment)
FROM `comment`
INNER JOIN post ON `comment`.post_id = post.post_id
GROUP BY post.post_id;
You gather only post rows having at least one reference in comment.
If you alter the JOIN type to a LEFT join, this way:
SELECT post.post_id, COUNT(comment)
FROM `comment`
LEFT JOIN post ON `comment`.post_id = post.post_id
GROUP BY post.post_id;
Then the rows from post are all there, and NULL values are inserted for columns of comments if no comments related to this row exists (that's a left join). So if comment is a column from table comment it will be there for each rows of post table, but with a NULL value, after the group by on the post_id column the subset of comments related to this post contains only 1 NULL value, the count should return 0.
select count(NULL);
returns 0.
Now you could use a subquery but that's a really bad idea, subqueries are usually done instead of LEFT JOINS, usually it's a mistake, sometimes it's not, but it's really often a mistake. When you do a left join indexes are used to compare the key values of the 2 tables (the ON clause) and build one final 'temporary' result of rows, mixing values from both tables (and then, or maybe in the same time, the filters from other parts of your queries are applied). When you use a subquery, for each row of the first table a new query is run to get results from the second table (not always, but it's another problem), the cost is reeeaaally bigger for the database engine.
Query the post table and do a subquery for the count on the comments query.
SELECT post.post_id, (SELECT COUNT(comment) FROM `comment` WHERE `comment`.post_id = post.post_id) as comments FROM post
This may get extremely slow with lots or rows so add a limit with a pager when you get to that point.
This is for an upcoming project. I have two tables - first one keeps tracks of photos, and the second one keeps track of the photo's rank
Photos:
+-------+-----------+------------------+
| id | photo | current_rank |
+-------+-----------+------------------+
| 1 | apple | 5 |
| 2 | orange | 9 |
+-------+-----------+------------------+
The photo rank keeps changing on a regular basis, and this is the table that tracks it:
Ranks:
+-------+-----------+----------+-------------+
| id | photo_id | ranks | timestamp |
+-------+-----------+----------+-------------+
| 1 | 1 | 8 | * |
| 2 | 2 | 2 | * |
| 3 | 1 | 3 | * |
| 4 | 1 | 7 | * |
| 5 | 1 | 5 | * |
| 6 | 2 | 9 | * |
+-------+-----------+----------+-------------+ * = current timestamp
Every rank is tracked for reporting/analysis purpose.
[Edit] Users will have access to the statistics on demand.
I talked to someone who has experience in this field, and he told me that storing ranks like above is the way to go. But I'm not so sure yet.
The problem here is data redundancy. There are going to be tens of thousands of photos. The photo rank changes on a hourly basis (many times- within minutes) for recent photos but less frequently for older photos. At this rate the table will have millions of records within months. And since I do not have experience in working with large databases, this makes me a little nervous.
I thought of this:
Ranks:
+-------+-----------+--------------------+
| id | photo_id | ranks |
+-------+-----------+--------------------+
| 1 | 1 | 8:*,3:*,7:*,5:* |
| 2 | 2 | 2:*,9:* |
+-------+-----------+--------------------+ * = current timestamp
That means some extra code in PHP to split the rank/time (and sorting), but that looks OK to me.
Is this a correct way to optimize the table for performance? What would you recommend?
The first one. Period.
Actually you'll lose much more. A timestamp stored in the int column will occupy only 4 bytes of space.
While the same timestamp stored in the string format will take 10 bytes.
Your first design is correct for a relational database. The redundancy in the key columns is preferable because it gives you a lot more flexibility in how you validate and query the rankings. You can do sorts, counts, averages, etc. in SQL without having to write any PHP code to split your string six ways from Sunday.
It sounds like you would like to use a non-SQL database like CouchDB or MongoDB. These would allow you to store a semi-structured list of rankings right in the record for the photo, and subsequently query the rankings efficiently. With the caveat that you don't really know that the rankings are in the right format, as you do with SQL.
I would stick with your first approach. In the second you will have a lot of data stored in the row, as time goes by it gets more ranks! That is, if a photo gets thousands and thousands of rankings.
The first approach is also more maintainable, that is, if you wish to delete a rank.
I'd think the database 'hit' of over normalistion (querying the ranks table over and over) is nicely avoided by 'caching' the last rank in current_rank. It does not really matter ranks is growing tremendously if it is seldom queried (analyis / reporting you said), never updated but just gets records inserted at the end: even a very light box would have no problem having millions of rows in that table.
You alternative would require lots of updates on different locations on the disk, possibly resulting in degraded performance.
Of course, if you need all the old data, and always by photo_id, you could plan a scheduled run to another table rankings_old, possibly with photo_id, year,month, rankings (including timestamps) when a month is over, so retrieving old data stays easily possible, but there are no updates needed in rankings_old or rankings, only inserts at the end of the table.
And take it from me: millions of records in a pure logging table should be absolutely no problem.
Normalized data or not normalized data. You will find thousands of articles about that. :)
It really depends of your needs.
If you want to build your database only with performance (speed or RAM consumption or...) in mind you should only trust the numbers. To do that you have to profile your queries with the expected data "volume" (You can generate the data with some script you write). To profile your queries, learn how to read the results of the 2 following queries:
EXPLAIN extended...
SHOW STATUS
Then learn what to do to improve the figures (mysql settings, data structure, hardware, etc).
As a starter, I really advise these 2 great articles:
http://www.xaprb.com/blog/2006/10/12/how-to-profile-a-query-in-mysql/
http://ajohnstone.com/archives/mysql-php-performance-optimization-tips/
If you want to build for the academic beauty of the normalization: just follow the books and the general recommandations. :)
Out of the two options - like everyone before me said - it has to be option 1.
What you should really be concerned about are the bottlenecks in the application itself. Are users going to refer to the historical data often, or does it only show up for a few select users? If the answer is that everyone gets to see historical data of the ranks, then option 1 is good enough. If you are not going to refer to the historical ranks that often, then you could create a third "archive" table, and before updating the ranks, you can copy the rows of the original rank table to the archive table. This ensures that the number of rows stays minimal on the main table that is being called.
Remember, if you're updating the rows, and there's 10s of thousands, it might be more fruitful to get the results in your code (PHP/Python/etc), truncate the table and insert the results in rather than updating it row by row, as that would be a potential bottleneck.
You may want to look up sharding as well (horizontal partitioning) - http://en.wikipedia.org/wiki/Shard_%28database_architecture%29
And never forget to index well.
Hope that helped.
You stated the rank is only linked to the image, in which case all you need is table 1 and keep updating the rank in real time. Table 2 just stores unnecessary data. The disadvantage of this approach is that user cant change his vote.
You said the second table is for analysing /statistics, so it actually isn't something that needs to be stored in db. My suggestion is to get rid of the second table and use a logging facility to record rank changes.
Your second design is very dangerous in case you have 1 million votes for a photo. Can PHP handle that?
With the first design you can do all math on the database level which will be returning you a small result set.