Fastest MySQL row rank of big table

Fastest MySQL row rank of big table - php

Info: I have this table (PERSONS):
PERSON_ID int(10)
POINTS int(6)
4 OTHER COLUMNS which are of type int(5 or 6)
The table consist of 25M rows and is growing 0.25M a day. The distribution of points is around 0 to 300 points and 85% of the table has 0 points.
Question: I would like to return to the user which rank he/she has if they got at least 1 point. How and where would be the fastest way to do it, in SQL or PHP or combination?
Extra Info: Those lookups can happen every second 100 times. The solutions I have seen so far are not fast enough, if more info needed please ask.
Any advice is welcome, as you understand I am new to PHP and MySQL :)

Create an index on t(points) and on t(person_id, points). Then run the following query:
select count(*)
from persons p
where p.points >= (select points from persons p where p.person_id = <particular person>)
The subquery should use the second index as a lookup. The first should be an index scan on the first index.
Sometimes MySQL can be a little strange about optimization. So, this might actually be better:
select count(*)
from persons p cross join
(select points from persons p where p.person_id = <particular person>) const
where p.points > const.points;
This just ensures that the lookup for the points for the given person happens once, rather than for each row.

Partition your table into two partitions - one for people with 0 points and one for people with one or more points.
Add one index on points to your table and another on person_id (if these indexes don't already exist).
To find the dense rank of a specific person, run the query:
select count(distinct p2.points)+1
from person p1
join person p2 on p2.points > p1.points
where p1.person_id = ?
To find the non-dense rank of a specific person, run the query:
select count(*)
from person p1
join person p2 on p2.points >= p1.points
where p1.person_id = ?
(I would expect the dense rank query to run significantly faster.)

Related

Optimal joins in MySQL or offloading to application layer

I have 3 tables in a MySQL database: courses, users and participants, which contains about 30mil, 30k and 3k entries respectively.
My goal is to (efficiently) figure out the number of users that have been assigned to courses that matches our criteria. The criteria is a little more complex, but for this example we only care about users where deleted_at is null and courses where deleted_at is null and active is 1.
Simplified these are the columns:
users:
id
deleted_at
1
null
2
2022-01-01
courses:
id
active 
deleted_at
1
1
null
1
1
2020-01-01
2
0
2020-01-01
participants:
id
participant_id 
course_id
1
1
1
2
1
2
3
2
2
Based on the data above, the number we would get would be 1 as only user 1 is not deleted and that user assigned to some course (id 1) that is active and not deleted.
Here is a list of what I've tried.
Joining all the tables and do simple where's.
Joining using subqueries.
Pulling the correct courses and users out to the application layer (PHP), and querying participants using WHERE IN.
Pulling everything out and doing the filtering in the application layer.
Calling using EXPLAIN to add better indexes - I, admittedly, do not do this often and may not have done this well enough.
A combination of all the above.
An example of a query would be:
SELECT COUNT(DISTINCT participant_id)
FROM `participants`
INNER JOIN
(SELECT `courses`.`id`
FROM `courses`
WHERE (`active` = '1')
AND `deleted_at` IS NULL) AS `tempCourses` ON `tempCourses`.`id` = `participants`.`course_id`
WHERE `participant_type` = 'Eloomi\\Models\\User'
AND `participant_id` in
(SELECT `users`.`id`
FROM `users`
WHERE `users`.`deleted_at` IS NULL)
From what I can gather doing this will create a massive table, which only then will start applying where's. In my mind it should be possible to short circuit a lot of that because once we get a match for a user, we can disregard that going forward. That would be how to handle it, in my mind, in the application layer.
We could do this on a per-user basis in the application layer, but the number of requests to the database would make this a bad solution.
I have tagged it as PHP as well as MySQL, not because it has to be PHP but because I do not mind offloading some parts to the application layer if that is required. It's my experience that joins do not always use indexes optimally
Edit:
To specify my question: Can someone help me provide a efficient way to pull out the number of non-deleted users that have been assigned to to active non-deleted courses?

I would write it this way:
SELECT COUNT(DISTINCT p.participant_id)
FROM courses AS c
INNER JOIN participants AS p
ON c.id = p.course_id
INNER JOIN users AS u
ON p.participant_id = u.id
WHERE u.deleted_at IS NULL
AND c.active = 1 AND c.deleted_at IS NULL
AND p.participant_type = 'Eloomi\\Models\\User';
MySQL may join the tables in another order, not the order you list the tables in the query.
I hope that courses is the first table MySQL accesses, because it's probably the smallest table. Especially after filtering by active and deleted_at. The following index will help to narrow down that filtering, so only matching rows are examined:
ALTER TABLE courses ADD KEY (active, deleted_at);
Every index implicitly has the table's primary key (e.g. id) appended as the last column. That column being part of the index, it is used in the join to participants. So you need an index in participants that the join uses to find the corresponding rows in that table. The order of columns in the index is important.
ALTER TABLE participants ADD KEY (course_id, participant_type, participant_id);
The participant_id is used to join to the users table. MySQL's optimizer will probably prefer to join to users by its primary key, but you also want to restrict that by deleted_at, so you might need this index:
ALTER TABLE users ADD KEY (id, deleted_at);
And you might need to use an index hint to coax the optimizer to prefer this secondary index over the primary key index.
SELECT COUNT(DISTINCT p.participant_id)
FROM courses AS c
INNER JOIN participants AS p
ON c.id = p.course_id
INNER JOIN users AS u USE INDEX(deleted_at)
ON p.participant_id = u.id
WHERE u.deleted_at IS NULL
AND c.active = 1 AND c.deleted_at IS NULL
AND p.participant_type = 'Eloomi\\Models\\User';
MySQL knows how to use compound indexes even if some conditions are in join clauses and other conditions are in the WHERE clause.
Caveat: I have not tested this. Choosing indexes may take several tries, and testing the EXPLAIN after each try.

Calculation of Points in a Database in 3rd NF

I have a database where the results from a shooter game are stored. I put them to 3NF to allow extensions of the system. So it looks like this:
Player
-------------------
GameId integer
PlayerId integer
TeamId integer
Hits
-------------------
GameId integer
FromId integer
ToId integer
Hits integer
So basically for every game there is a ID and every Player and Team has its ID (with their names stored in other databases)
Now I want to calculate points for each player. I need the points for each game but more importantly the total per player. The points are basically: 3 Points for each hit on opponent, -2 points for each hit of a team member and -2 points for each hit taken.
Alone the calculation of the number of team hits requires a JOIN with 3 tables and I fear for performance in production environment. (Each game has ~8 players-> PlayerDB-Size is 8n and HitsDB-Size is (8-1)^2*n)
And at the end: I need to calculate the points per player for each game and sum those up because the minimum points per game should be zero. And finally get a rank for each player (player x has the 2nd most total points etc)
I feel like I'm getting lost in overly complicated queries that will kill the database' performance at some point.
Could anyone judge the design and maybe give me some pointers where to start looking further? I though about storing the TeamHits and Points per Game in the players Database (Points for summing over them, teamHits for statistical purposes) but that would of course break normalization.
PS: I'm working with PHP 5 and MYSQL. I also thought about getting each game from the database, calculating the points in PHP (which I'm already doing when I show the game) and writing this back (optimally on putting in the game to the DB but also when the parameters for the points change)
Edit: Idea to avoid subselects would be:
SELECT p.*, SUM(h.Hits) AS TeamHits, SUM(h2.Hits) as Hits
FROM player p
LEFT JOIN
(hits h
INNER JOIN player p2
ON h.GameId=p2.GameId AND h.ToId=p2.PlayerId
)
ON p.GameId=p2.GameId AND h.FromId=p.PlayerId AND p.TeamId=p2.TeamId
GROUP BY p.PlayerId, p.GameId
LEFT JOIN hits h2
ON h2.GameId=p.GameId AND h2.FromId=p.PlayerId
But of course this does not work. Is it even possible to combine groupings with joins or will I have to use subqueries?
Best I have is:
SELECT p.PlayerId, SUM((-2-3)*IFNULL(th.TeamHits, 0) + (3)*IFNULL(h.Hits, 0) + (-2)*IFNULL(ht.HitsTaken, 0)) AS Points
FROM player p
LEFT JOIN
(SELECT p.GameId, p.PlayerId, SUM(h.Hits) AS TeamHits
FROM player p
INNER JOIN hits h
ON h.GameId=p.GameId AND p.PlayerId=h.FromId
INNER JOIN player p2
ON p.GameId=p2.GameId AND p2.PlayerId=h.ToId AND p.TeamId=p2.TeamId
GROUP BY p.PlayerId, p.GameId) th
ON p.GameId=th.GameId AND p.PlayerId=th.PlayerId
LEFT JOIN
(SELECT p.GameId, p.PlayerId, SUM(h.Hits) AS Hits
FROM player p
INNER JOIN hits h
ON h.GameId=p.GameId AND p.PlayerId=h.FromId
GROUP BY p.PlayerId, p.GameId) h
ON p.GameId=h.GameId AND p.PlayerId=h.PlayerId
LEFT JOIN
(SELECT p.GameId, p.PlayerId, SUM(h.Hits) AS HitsTaken
FROM player p
INNER JOIN hits h
ON h.GameId=p.GameId AND p.PlayerId=h.ToId
INNER JOIN player p2
ON p.GameId=p2.GameId AND p2.PlayerId=h.FromId AND p.TeamId!=p2.TeamId
GROUP BY p.PlayerId, p.GameId) ht
ON p.GameId=ht.GameId AND p.PlayerId=ht.PlayerId
GROUP BY p.PlayerId
Fiddle: http://sqlfiddle.com/#!9/dc0cb/4
Current problem: For a database with about 10,000 games calculating the points for all players takes about 18s. This is unusable, so I need to improve this...

Joins are not that expensive, subqueries are. as long as you can avoid subqueries you're not hitting too bad.
Remember, a database is built for this stuff these days.
Just make sure you have the proper indexes on the right fields so its optimised. Like teamID and GameID and playerID should be indexes.
Just run it in phpmyadmin and see how many milliseconds it takes to execute. if it takes more than 50 its a heavy query, but usually its pretty hard to hit this... I once managed to make a very heavy query that joined 100.000+ rows out of different tables and views and still did that in 5ms...
What numbers of requests a hour are we talking about? 200 players a day? 200.000 players a day? How often do the requests happen? 10 per second per player? once a minute? how loaded is your database?
I think that all these parameters are low, so you shouldnt worry about this optimisation yet.
Get your game up and running, clean up the php code where real gains can be had, and stay clear of complex subqueries or views.
As long as your table does joins and unions its pretty darn fast. and if you must do a subquery see if there is not an alternative way by using a linking table to link certain results to certain other tables so you can do a join instead of a subquery.

Highscores (top scores) rank calculation & update query?

I have a high scoring (top scores) system, which is calculating positions by players's eperience.
But now I need to use the player's rank in other places just the web, maybe more places in the web too like personal
high scores, and it will show the player's rank in that skill.
Therefore just looping & playing with the loop cycle like rank++ won't really work, cause I need to save that rank for
other places.
What I could do is loop through all players and then send a query to update that player's rank, but what if i have 1000 players? or more?
that means 1000 queries per load.
I have thought if there could be a SQL query I can use to do the same action, in one or two queries.
How can I do this? I calculate ranks by ordering by player's eperience, so my table structure looks like this:
Tables:
Players
id (auto_increment) integer(255)
displayname varchar(255) unique
rank integer(255) default null
experience bigint(255)

This should give you the rank for user with id = 1. If you want every player, just remove the WHERE clause:
SELECT a.id, a.displayname, a.rank, a.experience
FROM (
SELECT id, displayname, #r:=#r+1 AS rank, experience
FROM players, (SELECT #rank:=0) tmp
ORDER BY experience DESC) a
WHERE a.id = 1
I wouldn't have rank in the players table directly, since this would mean that you would have to recalculate it every time a user changes experience. You could do this query anytime you want to get the rank for a player or for a leaderboard.
If you still want to update it, You can do an INNER JOIN with this query to UPDATE the original table with the rank from this query.

How to optimize a score/rank table with different specific scores and ranks?

in our project we've got an user table where userdata with name and different kind of scores (overall score, quest score etc. is stored). How the values are calculated doesn't matter, but take them as seperated.
Lets look table 'users' like below
id name score_overall score_trade score_quest
1 one 40000 10000 20000
2 two 20000 15000 0
3 three 30000 1000 50000
4 four 80000 60000 3000
For showing the scores there are then a dummy table and one table for each kind of score where the username is stored together with the point score and a rank. All the tables look the same but have different names.
id name score rank
They are seperated to allow the users to search and filter the tables. Lets say there is one row with the player "playerX" who has rank 60. So if I filter the score for "playerX" I only see this row, but with rank 60. That means the rank are "hard stored" and not only displayed dynamically via a rownumber or something like that.
The different score tables are filled via a cronjob (and under the use of a addional dummy table) which does the following:
copies the userdata to a dummy table
alters the dummy table by order by score
copies the dummy table to the specific score table so the AI primary key (rank) is automatically filled with the right values, representing the rank for each user.
That means: Wheren there are five specific scores there are also five score tables and the dummy table, making a total of 6.
How to optimize?
What I would like to do is to optimize the whole thing and to drop duplicate tables (and to avoid the dummy table if possible) to store all the score data in one table which has the following cols:
userid, overall_score, overall_rank, trade_score, trade_rank, quest_score, quest_rank
My question is now how I could do this the best way and is there another way as the one shown above (with all the different tables)? MYSQL-Statements and/or php-code is welcome.
Some time ago I tried using row numbers but this doesn't work a) because they can't be used in insert statements and b) because when filtering every player (like 'playerX' in the example above) would be on rank 1 as it's the only row returning.

Well, you can try creating a table with the following configuration:
id | name | score_overall | score_trade | score_quest | overall_rank | trade_rank | quest_rank
If you do that, you can use the following query to populate the table:
SET #overall_rank:=-(SELECT COUNT(id) FROM users);
SET #trade_rank:=#overall_rank;
SET #quest_rank:=#overall_rank;
SELECT *
FROM users u
INNER JOIN (SELECT id,
#overall_rank:=#overall_rank+1 AS overall_rank
FROM users
ORDER BY score_overall DESC) ovr
ON u.id = ovr.id
INNER JOIN (SELECT id,
#trade_rank:=#trade_rank+1 AS trade_rank
FROM users
ORDER BY score_trade DESC) tr
ON u.id = tr.id
INNER JOIN (SELECT id,
#quest_rank:=#quest_rank+1 AS quest_rank
FROM users
ORDER BY score_quest DESC) qr
ON u.id = qr.id
ORDER BY u.id ASC
I've prepared an SQL-fiddle for you.
Although I think performance will weigh in if you start getting a lot of records.
A bit of explanation: the #*_rank things are SQL variables. They get increased with 1 on every new row.

How to improve the performance of MYSQL query with large data?

I am using MySQL tables that have the following data:
users(ID, name, email, create_added) (about 10000 rows)
points(user_id, point) (about 15000 rows)
And my query:
SELECT u.*, SUM(p.point) point
FROM users u
LEFT JOIN points p ON p.user_id = u.ID
WHERE u.id > 0
GROUP BY u.id
ORDER BY point DESC
LIMIT 0, 10
I only get the top 10 users having best point, but then it dies. How can I improve the performance of my query?

Like #Grim said, you can use INNER JOIN instead of LEFT JOIN. However, if you truly look for optimization, I would suggest you to have an extra field at table users with a precalculate point. This solution would beat any query optimization with your current database design.

Swapping the LEFT JOIN for an INNER JOIN would help a lot. Make sure points.point and points.user_id are indexed. I assume you can get rid of the WHERE clause, as u.id will always be more than 0 (although MySQL probably does this for you at the query optimisation stage).

It doesn't really matter than you are getting only 10 rows. MySQL has to sum up the points for every user, before it can sort them ("Using filesort" operation.) That LIMIT is applied last.
A covering index ON points(user_id,point) is going to be the best bet for optimum performance. (I'm really just guessing, without any EXPLAIN output or table definitions.)
The id column in users is likely the primary key, or at least a unique index. So it's likely you already have an index with id as the leading column, or primary key cluster index if it's InnoDB.)
I'd be tempted to test a query like this:
SELECT u.*
, s.total_points
FROM ( SELECT p.user_id
, SUM(p.point) AS total_points
FROM points p
WHERE p.user_id > 0
GROUP BY p.user_id
ORDER BY total_points DESC
LIMIT 10
) s
JOIN user u
ON u.id = s.user_id
ORDER BY s.total_points DESC
That does have the overhead of creating a derived table, but with a suitable index on points, with a leading column of user_id, and including the point column, it's likely that MySQL can optimize the group by using the index, and avoiding one "Using filesort" operation (for the GROUP BY).
There will likely be a "Using filesort" operation on that resultset, to get the rows ordered by total_points. Then get the first 10 rows from that.
With those 10 rows, we can join to the user table to get the corresponding rows.
BUT.. there is one slight difference with this result, if any of the values of user_id that are in the top 10 which aren't in the user table, then this query will return less than 10 rows. (I'd expect there to be a foreign key defined, so that wouldn't happen, but I'm really just guessing without table definitions.)
An EXPLAIN would show the access plan being used by MySQL.

Ever thought about partitioning?
I'm currently working with large database and successfully improve sql query.
For example,
PARTITION BY RANGE (`ID`) (
PARTITION p1 VALUES LESS THAN (100) ENGINE = InnoDB,
PARTITION p2 VALUES LESS THAN (200) ENGINE = InnoDB,
PARTITION p3 VALUES LESS THAN (300) ENGINE = InnoDB,
... and so on..
)
It allows us to get better speed while scanning mysql table. Mysql will scan only partition p 1 that contains userid 1 to 99 even if there are million rows in table.
Check out this http://dev.mysql.com/doc/refman/5.5/en/partitioning.html

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.