This post is a follow-up of this answered question: Best method for storing a list of user IDs.
I took cletus and Mehrdad Afshari's epic advice of using a normalized database approach. Are the following tables properly set up for proper optimization? I'm kind of new to MySQL efficiency, so I want to make sure this is effective.
Also, when it comes to finding the average rating for a game and the total number of votes should I use the following two queries, respectively?
SELECT avg(vote) FROM votes WHERE uid = $uid AND gid = $gid;
SELECT count(uid) FROM votes WHERE uid = $uid AND gid = $gid;
CREATE TABLE IF NOT EXISTS `games` (
`id` int(8) NOT NULL auto_increment,
`title` varchar(50) NOT NULL,
PRIMARY KEY (`id`)
) AUTO_INCREMENT=1 ;
CREATE TABLE IF NOT EXISTS `users` (
`id` int(8) NOT NULL auto_increment,
`username` varchar(20) NOT NULL,
PRIMARY KEY (`id`)
) AUTO_INCREMENT=1 ;
CREATE TABLE IF NOT EXISTS `votes` (
`uid` int(8) NOT NULL,
`gid` int(8) NOT NULL,
`vote` int(1) NOT NULL,
KEY `uid` (`uid`,`gid`)
) ;
average votes for a game: SELECT avg(vote) FROM votes WHERE gid = $gid;
number of votes for a game: SELECT count(uid) FROM votes WHERE gid = $gid;
As you will not have any user or game ids smaller then 0 you could make them unsigned integers (int(8) unsigned NOT NULL).
If you want to enforce that a user can only make a single vote for a game, then create a primary key over uid and gid in the votes table instead of just a normal index.
CREATE TABLE IF NOT EXISTS `votes` (
`uid` int(8) unsigned NOT NULL,
`gid` int(8) unsigned NOT NULL,
`vote` int(1) NOT NULL,
PRIMARY KEY (`gid`, `uid`)
) ;
The order of the primary key's fields (first gid, then uid) is important so the index is sorted by gid first. That makes the index especially useful for selects with a given gid. If you want to select all the votes a given user has made then add another index with just uid.
I would recommend InnoDB for storage engine because especially in high load settings the table locks will kill your performance. For read performance you can implement a caching system using APC, Memcached or others.
Looks good.
I would have used users_id & games_id instead of gid and uid which sounds like global id and unique id
Whatever you end up doing, make sure you test it with a large data-set (even if you don't plan on having a huge number of users)
Write a script that generates 100,000 games, 50,000 users and a million votes. May be slightly excessive, but if your queries don't take hours with that number of items, it'll never be an issue
Looks good so far. Don't forget indices and foreign keys. In my experience most issues don't arise from not-so-well-thought-out designs but from the lack of indices and foreign keys.
Also, regarding the storage engine selection I have yet to see a reason (in a reasonably complex/sized app) for not using innodb, not just because of transactional semantics.
you might want to add a voted_on (DATETIME) column too. That way, you could, say, see a game's trend in a certain timespan, or just in case someday a vote spam happened, you could delete unwanted votes accurately.
Related
I have a table where I log members.
There are 1,486,044 records here.
SELECT * FROM `user_log` WHERE user = '1554143' order by id desc
However, this query takes 5 seconds. What do you recommend ?
Table construction below;
CREATE TABLE IF NOT EXISTS `user_log` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`user` int(11) NOT NULL,
`operation_detail` varchar(100) NOT NULL,
`ip_adress` varchar(50) NOT NULL,
`l_date` datetime NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
COMMIT;
For this query:
SELECT * FROM `user_log` WHERE user = 1554143 order by id desc
You want an index on (user, id desc).
Note that I removed the single quotes around the filtering value for user, since this column is a number. This does not necessarily speeds things up, but is cleaner.
Also: select * is not a good practice, and not good for performance. You should enumerate the columns you want in the resultset (if you don't need them all, do not select them all). If you want all columns, since your table has not a lot of columns, you might want to try a covering index on all 5 columns, like: (user, id desc, operation_detail, ip_adress, l_date).
In addition to the option of creating an index on (user, id), which has already been mentioned, a likely better option is to convert the table to InnoDB as create an index only on (user).
I have a MySQL (5.6.26) database with large ammount of data and I have problem with COUNT select on table join.
This query takes about 23 seconds to execute:
SELECT COUNT(0) FROM user
LEFT JOIN blog_user ON blog_user.id_user = user.id
WHERE email IS NOT NULL
AND blog_user.id_blog = 1
Table user is MyISAM and contains user data like id, email, name, etc...
CREATE TABLE `user` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`username` varchar(50) DEFAULT NULL,
`email` varchar(100) DEFAULT '',
`hash` varchar(100) DEFAULT NULL,
`last_login` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`created` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
PRIMARY KEY (`id`),
UNIQUE KEY `id` (`id`) USING BTREE,
UNIQUE KEY `email` (`email`) USING BTREE,
UNIQUE KEY `hash` (`hash`) USING BTREE,
FULLTEXT KEY `email_full_text` (`email`)
) ENGINE=MyISAM AUTO_INCREMENT=5728203 DEFAULT CHARSET=utf8
Table blog_user is InnoDB and contains only id, id_user and id_blog (user can have access to more than one blog). id is PRIMARY KEY and there are indexes on id_blog, id_user and id_blog-id_user.
CREATE TABLE `blog_user` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`id_blog` int(11) NOT NULL DEFAULT '0',
`id_user` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`id`),
UNIQUE KEY `id_blog_user` (`id_blog`,`id_user`) USING BTREE,
KEY `id_user` (`id_user`) USING BTREE,
KEY `id_blog` (`id_blog`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=5250695 DEFAULT CHARSET=utf8
I deleted all other tables and there is no other connection to MySQL server (testing environment).
What I've found so far:
When I delete some columns from user table, duration of query is shorter (like 2 seconds per deleted column)
When I delete all columns from user table (except id and email), duration of query is 0.6 seconds.
When I change blog_user table also to MyISAM, duration of query is 46 seconds.
When I change user table to InnoDB, duration of query is 0.1 seconds.
The question is why is MyISAM so slow executing the command?
First, some comments on your query (after fixing it up a bit):
SELECT COUNT(*)
FROM user u LEFT JOIN
blog_user bu
ON bu.id_user = u.id
WHERE u.email IS NOT NULL AND bu.id_blog = 1;
Table aliases help make it easier to both write and to read a query. More importantly, You have a LEFT JOIN but your WHERE clause is turning it into an INNER JOIN. So, write it that way:
SELECT COUNT(*)
FROM user u INNER JOIN
blog_user bu
ON bu.id_user = u.id
WHERE u.email IS NOT NULL AND bu.id_blog = 1;
The difference is important because it affects choices that the optimizer can make.
Next, indexes will help this query. I am guessing that blog_user(id_blog, id_user) and user(id, email) are the best indexes.
The reason why the number of columns affects your original query is because it is doing a lot of I/O. The fewer columns then the fewer pages needed to store the records -- and the faster the query runs. Proper indexes should work better and more consistently.
To answer the real question (why is myisam slower than InnoDB), I can't give an authoritative answer.
But it is certainly related to one of the more important differences between the two storage engines : InnoDB does support foreign keys, and myisam doesn't. Foreign keys are important for joining tables.
I don't know if defining a foreign key constraint will improve speed further, but for sure, it will guarantee data consistency.
Another note : you observe that the time decreases as you delete columns. This indicates that the query requires a full table scan. This can be avoided by creating an index on the email column. user.id and blog.id_user hopefully already have an index, if they don't, this is an error. Columns that participate in a foreign key, explicit or not, always must have an index.
This is a long time after the event to be much use to the OP and all the foregoing suggestions for speeding up the query are entirely appropriate but I wonder why no one has remarked on the output of EXPLAIN. Specifically, why the index on email was chosen and how that relates to the definition for the email column in the user table.
The optimizer has selected an index on email column, presumably because it's included in the where clause. key_len for this index is comparatively long and it's a reasonably large table given the auto_increment value so the memory requirements for this index would be considerably greater than if it had chosen the id column (4 bytes against 303 bytes). The email column is NULLABLE but has a default of the empty string so, unless the application explicitly sets a NULL, you are not going to find any NULLs in this column anyway. Neither will you find more than one record with the default given the UNIQUE constraint. The column DEFAULT and UNIQUE constraint appear to be completely at odds with each other.
Given the above, and the fact we only want the count in the query, I'd then wonder if the email part of the where clause serves any purpose other than slowing the query down as each value is compared to NULL. Without it the optimizer would probably pick the primary key and do a much better job. Better yet would be a query which ignored the user table entirely and took the count based on the covering index on blog_user that Gordon Linoff highlighted.
There's another indexing issues here worth mentioning:
On the user table
UNIQUE KEY `id` (`id`) USING BTREE,
is redundant since id is the PRIMARY KEY and therefore UNIQUE by definition.
To answer your last question,
The question is why is MyISAM so slow executing the command?
MyISAM is dependent on the speed of your hard drive,
INNODB once the data is read is at speed of RAM. 1st time query is run could be loading data, second and later will avoid hard drive until aged out of RAM.
A couple of years ago I designed a reward system for 11-16yo students in PHP, JavaScript and MySQL.
The premise is straightforward:
Members of staff issue points to students under various categories ("Positive attitude and behaviour", "Model citizen", etc)
Students accrue these points then spend them in our online store (iTunes vouchers, etc)
Existing system
The database structure is also straightforward (probably too much so):
Transactions
239,189 rows
CREATE TABLE `transactions` (
`Transaction_ID` int(9) NOT NULL auto_increment,
`Datetime` date NOT NULL,
`Giver_ID` int(9) NOT NULL,
`Recipient_ID` int(9) NOT NULL,
`Points` int(4) NOT NULL,
`Category_ID` int(3) NOT NULL,
`Reason` text NOT NULL,
PRIMARY KEY (`Transaction_ID`),
KEY `Giver_ID` (`Giver_ID`),
KEY `Datetime` (`Datetime`),
KEY `DatetimeAndGiverID` (`Datetime`,`Giver_ID`),
KEY `Recipient_ID` (`Recipient_ID`)
) ENGINE=InnoDB AUTO_INCREMENT=249069 DEFAULT CHARSET=latin1
Categories
34 rows
CREATE TABLE `categories` (
`Category_ID` int(9) NOT NULL,
`Title` varchar(255) NOT NULL,
`Description` text NOT NULL,
`Default_Points` int(3) NOT NULL,
`Groups` varchar(125) NOT NULL,
`Display_Start` datetime default NULL,
`Display_End` datetime default NULL,
PRIMARY KEY (`Category_ID`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
Rewards
82 rows
CREATE TABLE `rewards` (
`Reward_ID` int(9) NOT NULL auto_increment,
`Title` varchar(255) NOT NULL,
`Description` text NOT NULL,
`Image_URL` varchar(255) NOT NULL,
`Date_Inactive` datetime NOT NULL,
`Stock_Count` int(3) NOT NULL,
`Cost_to_User` float NOT NULL,
`Cost_to_System` float NOT NULL,
PRIMARY KEY (`Reward_ID`)
) ENGINE=InnoDB AUTO_INCREMENT=91 DEFAULT CHARSET=latin1
Purchases
5,889 rows
CREATE TABLE `purchases` (
`Purchase_ID` int(9) NOT NULL auto_increment,
`Datetime` datetime NOT NULL,
`Reward_ID` int(9) NOT NULL,
`Quantity` int(4) NOT NULL,
`Student_ID` int(9) NOT NULL,
`Student_Name` varchar(255) NOT NULL,
`Date_DealtWith` datetime default NULL,
`Date_Collected` datetime default NULL,
PRIMARY KEY (`Purchase_ID`)
) ENGINE=InnoDB AUTO_INCREMENT=6133 DEFAULT CHARSET=latin1
Problems
The system ran perfectly well for a period of time. It's now starting to slow-down massively on certain queries.
Essentially, every time I need to access a students' reward points total, the required query takes ages. Here is a few example queries and their run-times:
Top 15 students, excluding attendance categories, across whole school
SELECT CONCAT( s.Firstname, " ", s.Surname ) AS `Student` , s.Year_Group AS `Year Group`, SUM( t.Points ) AS `Points`
FROM frog_rewards.transactions t
LEFT JOIN frog_shared.student s ON t.Recipient_ID = s.id
WHERE t.Datetime > '2013-09-01' AND t.Category_ID NOT IN ( 12, 13, 14, 26 )
GROUP BY t.Recipient_ID
ORDER BY `Points` DESC
LIMIT 0 , 15
Run-time: 44.8425 sec
SELECT Recipient_ID, SUM(points) AS Total_Points FROMtransactionsGROUP BY Recipient_ID
Run-time: 9.8698 sec
Now I appreciate that, especially with the second query, I shouldn't ever be running a call which would return such a vast quantity of rows but the limitations of the framework within which the system runs meant that I had no other choice if I wanted to display students' total reward points for teachers/tutors/year managers/leadership to view and analyse.
Time for a solution
Fortunately the framework we've been forced to use is changing. We'll now be using oAuth rather than a horrible, outdated JavaScript widget format.
Unfortunately - or, I guess, fortunately - it means we'll have to rewrite quite a lot of the system.
One of the main areas I intend to look at when rewriting the system is the database structure. As time goes on it will only get bigger, so I need to do a bit of future-proofing.
As such, my main question is this: what is the most efficient and effective way of storing students' point totals?
The only idea I can come up with is to have a separate table called totals with Student_ID and Points fields. Every time a member of staff gives out some points, it adds a row into the transactions table but also updates the totals table.
Is that efficient? Would it be efficient to also have a Points_Since_Monday type field? How would I update/keep on top of that?
On top of the main question, if anyone has suggestions for general improvement with regard to optimisation of the database table, please let me know.
Thanks in advance,
Duncan
There is nothing particularly wrong with your design which should make it as slow as you have reported. I'm thinking there must be other factors at work, such as the server it is running on being overloaded or slow, for example. Only you will be able to find out if that is the case.
In order to test your design I recreated it on the 2008 SQL Server I have running on my desktop computer. I have a standard computer, single hard disc, not SSD, not raid etc. so on a proper database server the results should be even better. I had to make some changes to the design as you are using MySQL but none of the changes should affect performace, it's just so I can run it on my database.
Here's the table structure I used, I had to guess at what you would have in the Student and Staff tables as you do not descibe those. I also took the liberty of changing the field names in the Transaction table for Giver_ID and Receiver_ID as I assume only staff give points and students receive them.
I generated random data to fill the tables with the same number of rows as you said you have in your database
I ran the two queries you said are taking a long time, I've changed them to suit my design but I (hope) the result is the same
SELECT TOP 15
Firstname + ' ' + Surname
,Year_Group
,SUM(Points) AS Points
FROM points.[Transaction]
INNER JOIN points.Student ON points.[Transaction].Student_ID = points.Student.Student_ID
WHERE [Datetime] > '2013-09-01'
AND Category_ID NOT IN ( 12, 13, 14, 26 )
GROUP BY Firstname + ' ' + Surname
,Year_Group
ORDER BY SUM(Points) DESC
SELECT Student_ID
,SUM(Points) AS Total_Points
FROM points.[Transaction]
GROUP BY Student_ID
Both queries returned results in about 1s. I have not created any additional indexes on the tables other than the CLUSTERED indexes generated by default on the primary keys. Looking at the execution plan the query processor estimates that implementing the following index could improve the query cost by 81.0309%
CREATE NONCLUSTERED INDEX [<Name of Missing Index>]
ON [points].[Transaction] ([Datetime],[Category_ID])
INCLUDE ([Student_ID],[Points])
As others have commented I would look elsewhere for bottlenecks before spending a lot of time redesigning your database.
Update:
I realised I never actually addressed your specific question:
what is the most efficient and effective way of storing students'
point totals?
The only idea I can come up with is to have a separate table called
totals with Student_ID and Points fields. Every time a member of staff
gives out some points, it adds a row into the transactions table but
also updates the totals table.
I would not recommend keeping a separate point total unless you have explored every other possible way to speed up the database. A separate tally can become out of sync with the transactions and then you have to reconcile everything and track down what went wrong, and what the correct total should be.
You should always focus on maintaining the correctness and consistency of the data before trying to increase speed. Most of the time a correct (normalised) data model will operate quickly enough.
In one place I worked we found the most cost effective way to speed up our database was simply to upgrade the hardware; much quicker and cheaper than spending many man-hours redesigning the database :)
I have a table that its structure is as like as follow:
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`ttype` int(1) DEFAULT '19',
`title` mediumtext,
`tcode` char(2) DEFAULT NULL,
`tdate` int(11) DEFAULT NULL,
`visit` int(11) DEFAULT '0',
PRIMARY KEY (`id`),
KEY `tcode` (`tcode`),
KEY `ttype` (`ttype`),
KEY `tdate` (`tdate`)
ENGINE=MyISAM
I have two query on x.php same as:
SELECT * FROM table_name WHERE id='10' LIMIT 1
UPDATE table_name SET visit=visit+1 WHERE id='10' LIMIT 1
My first problem is that whether updating 'visit' in table cause reindexing and decreasing performance or not? Note to this point that 'visit' is not key.
Second method may be creating new table that contain 'visit' like as follow:
'newid' int(10) unsigned NOT NULL ,
`visit` int(11) DEFAULT '0',
PRIMARY KEY (`newid`),
ENGINE=MyISAM
So selecting by
SELECT w.*,q.visit FROM table_name w LEFT JOIN table_name2 q
ON (w.id=q.newid) WHERE w.id='10' LIMIT 1
UPDATE table_name2 SET visit=visit+1 WHERE newid='10' LIMIT 1
Is second method prefered rescpect to first method? Which one would have better performance and would be quick?
Note: all sql queries would be run by PHP (mysql_query command). Also I need first table indexes for other queries on other pages.
I'd say your first method is the best, and simplest. Updating visit will be very fast and no updating of indexes needs to be performed.
I'd prefer the first, and have used that for similar things in the past with no problems. You can remove the limit clause since id is your primary key you will never have more than 1 result, although the query optimizer probably does this for you.
There was a question someone asked earlier to which I responded with a solution you may want to consider as well. When you do 'count' columns you lose the ability to mine the data later. With a transaction table not only can you get 'views' counts, but you can also query for date ranges etc. Sure you will carry the weight of storing potentially hundreds of thousands of rows, but the table is narrow and indices numeric.
I cannot see a solution on the database side... Perhaps you can do it in PHP: If the user has a PHP session, you could, for example, only update the visitor count each 10th time, like:
<?php
session_start();
$_SESSION['count']+=1;
if ($_SESSION['count'] > 10) {
do_the_function_that_updates_the_count_plus_10();
$_SESSION['count'] = 0;
}
Of course you loose some counts, this way, but perhaps this is not that important?
I have a social network similar to myspace but I use PHP and mysql, I have been looking for the best way to show users bulletins posted only
fronm themself and from users they are confirmed friends with.
This involves 3 tables
friend_friend = this table stores records for who is who's friend
friend_bulletins = this stores the bulletins
friend_reg_user = this is the main user table with all user data like name and photo url
I will post bulletin and friend table scheme below, I will only post the fields important for the user table.
-- Table structure for table friend_bulletin
CREATE TABLE IF NOT EXISTS `friend_bulletin` (
`auto_id` int(11) NOT NULL AUTO_INCREMENT,
`user_id` int(10) NOT NULL DEFAULT '0',
`bulletin` text NOT NULL,
`subject` varchar(255) NOT NULL DEFAULT '',
`color` varchar(6) NOT NULL DEFAULT '000000',
`submit_date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`status` enum('Active','In Active') NOT NULL DEFAULT 'Active',
`spam` enum('0','1') NOT NULL DEFAULT '1',
PRIMARY KEY (`auto_id`),
KEY `user_id` (`user_id`),
KEY `submit_date` (`submit_date`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=245144 ;
-- Table structure for table friend_friend
CREATE TABLE IF NOT EXISTS `friend_friend` (
`autoid` int(11) NOT NULL AUTO_INCREMENT,
`userid` int(10) DEFAULT NULL,
`friendid` int(10) DEFAULT NULL,
`status` enum('1','0','3') NOT NULL DEFAULT '0',
`submit_date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`alert_message` enum('yes','no') NOT NULL DEFAULT 'yes',
PRIMARY KEY (`autoid`),
KEY `userid` (`userid`),
KEY `friendid` (`friendid`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=2657259 ;
friend_reg_user table fields that will be used
auto_id = this is the users ID number
disp_name = this is the users name
pic_url = this is a thumbnail image path
Bulletins should show all bulletins posted by a user ID that is in our friend list
should also show all bulletins that we posted ourself
needs to scale well, friends table is several million rows
// 1 Old method uses a subselect
SELECT auto_id, user_id, bulletin, subject, color, fb.submit_date, spam
FROM friend_bulletin AS fb
WHERE (user_id IN (SELECT userid FROM friend_friend WHERE friendid = $MY_ID AND status =1) OR user_id = $MY_ID)
ORDER BY auto_id
// Another old method that I used on accounts that had a small amount of friends because this one uses another query
that would return a string of all there friends in this format $str_friend_ids = "1,2,3,4,5,6,7,8"
select auto_id,subject,submit_date,user_id,color,spam
from friend_bulletin
where user_id=$MY_ID or user_id in ($str_friend_ids)
order by auto_id DESC
I know these are not good for performance as my site is getting really large so I have been experimenting with JOINS
I beleive this gets everything I need except it needs to be modified to also get bulletins posted by myself, when I add that into the WHERE part it seems to break it and return multiple results for each bulletin posted, I think because it is trying to
return results that I am a friedn of and then I try to consider myself a friend and that doesnt work well.
My main point of this whole post though is I am open to opinions on the best performance way to do this task, many big social networks have similar function that return a list of items posted only by your friends. There has to be other faster ways???? I keep reading that JOINS are not great for performance but how else can I do this? Keep in mind I do use indexes and have a dedicated database server but my userbase is large there is no way around that
SELECT fb.auto_id, fb.user_id, fb.bulletin, fb.subject, fb.color, fb.submit_date, fru.disp_name, fru.pic_url
FROM friend_bulletin AS fb
LEFT JOIN friend_friend AS ff ON fb.user_id = ff.userid
LEFT JOIN friend_reg_user AS fru ON fb.user_id = fru.auto_id
WHERE (
ff.friendid =1
AND ff.status =1
)
LIMIT 0 , 30
First of all, you can try to partition out the database so that you're only accessing a table with the primary rows you need. Move rows that are less often used to another table.
JOINs can impact performance but from what I've seen, subqueries are not any better. Try refactoring your query so that you're not pulling all that data at once. It also seems like some of that query can be run once elsewhere in your app, and those results either stored in variables or cached.
For example, you can cache an array of friends who are connected for each user and just reference that when running the query, and only update the cache when a new friend is added/removed.
It also depends on the structure of your systems and your code architecture - your bottle nneck may not entirely be in the db.