Best way to update user rankings without killing the server - php

I have a website that has user ranking as a central part, but the user count has grown to over 50,000 and it is putting a strain on the server to loop through all of those to update the rank every 5 minutes. Is there a better method that can be used to easily update the ranks at least every 5 minutes? It doesn't have to be with php, it could be something that is run like a perl script or something if something like that would be able to do the job better (though I'm not sure why that would be, just leaving my options open here).
This is what I currently do to update ranks:
$get_users = mysql_query("SELECT id FROM users WHERE status = '1' ORDER BY month_score DESC");
$i=0;
while ($a = mysql_fetch_array($get_users)) {
$i++;
mysql_query("UPDATE users SET month_rank = '$i' WHERE id = '$a[id]'");
}
UPDATE (solution):
Here is the solution code, which takes less than 1/2 of a second to execute and update all 50,000 rows (make rank the primary key as suggested by Tom Haigh).
mysql_query("TRUNCATE TABLE userRanks");
mysql_query("INSERT INTO userRanks (userid) SELECT id FROM users WHERE status = '1' ORDER BY month_score DESC");
mysql_query("UPDATE users, userRanks SET users.month_rank = userRanks.rank WHERE users.id = userRanks.id");

Make userRanks.rank an autoincrementing primary key. If you then insert userids into userRanks in descending rank order it will increment the rank column on every row. This should be extremely fast.
TRUNCATE TABLE userRanks;
INSERT INTO userRanks (userid) SELECT id FROM users WHERE status = '1' ORDER BY month_score DESC;
UPDATE users, userRanks SET users.month_rank = userRanks.rank WHERE users.id = userRanks.id;

My first question would be: why are you doing this polling-type operation every five minutes?
Surely rank changes will be in response to some event and you can localize the changes to a few rows in the database at the time when that event occurs. I'm pretty certain the entire user base of 50,000 doesn't change rankings every five minutes.
I'm assuming the "status = '1'" indicates that a user's rank has changed so, rather than setting this when the user triggers a rank change, why don't you calculate the rank at that time?
That would seem to be a better solution as the cost of re-ranking would be amortized over all the operations.
Now I may have misunderstood what you meant by ranking in which case feel free to set me straight.

A simple alternative for bulk update might be something like:
set #rnk = 0;
update users
set month_rank = (#rnk := #rnk + 1)
order by month_score DESC
This code uses a local variable (#rnk) that is incremented on each update. Because the update is done over the ordered list of rows, the month_rank column will be set to the incremented value for each row.

Updating the users table row by row will be a time consuming task. It would be better if you could re-organise your query so that row by row updates are not required.
I'm not 100% sure of the syntax (as I've never used MySQL before) but here's a sample of the syntax used in MS SQL Server 2000
DECLARE #tmp TABLE
(
[MonthRank] [INT] NOT NULL,
[UserId] [INT] NOT NULL,
)
INSERT INTO #tmp ([UserId])
SELECT [id]
FROM [users]
WHERE [status] = '1'
ORDER BY [month_score] DESC
UPDATE users
SET month_rank = [tmp].[MonthRank]
FROM #tmp AS [tmp], [users]
WHERE [users].[Id] = [tmp].[UserId]
In MS SQL Server 2005/2008 you would probably use a CTE.

Any time you have a loop of any significant size that executes queries inside, you've got a very likely antipattern. We could look at the schema and processing requirement with more info, and see if we can do the whole job without a loop.
How much time does it spend calculating the scores, compared with assigning the rankings?

Your problem can be handled in a number of ways. Honestly more details from your server may point you in a totally different direction. But doing it that way you are causing 50,000 little locks on a heavily read table. You might get better performance with a staging table and then some sort of transition. Inserts into a table no one is reading from are probably going to be better.
Consider
mysql_query("delete from month_rank_staging;");
while(bla){
mysql_query("insert into month_rank_staging values ('$id', '$i');");
}
mysql_query("update month_rank_staging src, users set users.month_rank=src.month_rank where src.id=users.id;");
That'll cause one (bigger) lock on the table, but might improve your situation. But again, that may be way off base depending on the true source of your performance problem. You should probably look deeper at your logs, mysql config, database connections, etc.

Possibly you could use shards by time or other category. But read this carefully before...

You can split up the rank processing and the updating execution. So, run through all the data and process the query. Add each update statement to a cache. When the processing is complete, run the updates. You should have the WHERE portion of the UPDATE reference a primary key set to auto_increment, as mentioned in other posts. This will prevent the updates from interfering with the performance of the processing. It will also prevent users later in the processing queue from wrongfully taking advantage of the values from the users who were processed before them (if one user's rank affects that of another). It also prevents the database from clearing out its table caches from the SELECTS your processing code does.

Related

Select query takes too long

These 2 querys take too long to produce a result (sometimes 1 min or even sometime end up on some error) and put really heavy load on the server:
("SELECT SUM(`rate`) AS `today_earned` FROM `".PREFIX."traffic_stats` WHERE `userid` = ?i AND from_unixtime(created) > CURRENT_DATE ORDER BY created DESC", $user->data->userid)
("SELECT COUNT(`userid`) AS `total_clicks` FROM `".PREFIX."traffic_stats` WHERE `userid` = ?i", $user->data->userid)
The table has about 4 million rows.
This is the table structure:
I have one index on traffic_id:
If you select anything from traffic_stats table it will take forever, however inserting to this table is normal.
Is it possible to reduce the time spent on executing this query? I use PDO and I am new to all this.
ORDER BY will take a lot of time and since you only need aggregate data (adding numbers or counting numbers is commutative), the ORDER BY will do a lot of useless sorting, costing you time and server power.
You will need to make sure that your indexing is right, you will probably need an index for user_id and for (user_id, created).
Is user_id numeric? If not, then you might consider converting it into numeric type, int for example.
These are improving your query and structure. But let's improve the concept as well. Are insertions and modifications very frequent? Do you absolutely need real-time data, or you can do with quasi-realtime data as well?
If insertions/modifications are not very frequent, or you can do with older data, or the problem is causing huge trouble, then you could do this by running periodically a cron job which would calculate these values and cache them. The application would read them from the cache.
I'm not sure why you accepted an answer, when you really didn't get to the heart of your problem.
I also want to clarify that this is a mysql question, and the fact that you are using PDO or PHP for that matter is not important.
People advised you to utilize EXPLAIN. I would go one further and tell you that you need to use EXPLAIN EXTENDED possibly with the format=json option to get a full picture of what is going on. Looking at your screen shot of the explain, what should jump out at you is that the query looked at over 1m rows to get an answer. This is why your queries are taking so long!
At the end of the day, if you have properly indexed your tables, your goal should be in a large table like this, to have number of rows examined be fairly close to the final result set.
So let's look at the 2nd query, which is quite simple:
("SELECT COUNT(`userid`) AS `total_clicks` FROM `".PREFIX."traffic_stats` WHERE `userid` = ?i", $user->data->userid)
In this case the only thing that is really important is that you have an index on traffic_stats.userid.
I would recommend, that, if you are uncertain at this point, drop all indexes other than the original primary key (traffic_id) index, and start with only an index on the userid column. Run your query. What is the result, and how long does it take? Look at the EXPLAIN EXTENDED. Given the simplicity of the query, you should see that only the index is being used and the rows should match the result.
Now to your first query:
("SELECT SUM(`rate`) AS `today_earned` FROM `".PREFIX."traffic_stats` WHERE `userid` = ?i AND from_unixtime(created) > CURRENT_DATE ORDER BY created DESC", $user->data->userid)
Looking at the WHERE clause there are these criteria:
userid =
from_unixtime(created) > CURRENT_DATE
You already have an index on userid. Despite the advice given previously, it is not necessarily correct to have an index on userid, created, and in your case it is of no value whatsoever.
The reason for this is that you are utilizing a mysql function from_unixtime(created) to transform the raw value of the created column.
Whenever you do this, an index can't be used. You would not have any concerns in doing a comparison with the CURRENT_DATE if you were using the native TIMESTAMP type but in this case, to handle the mismatch, you simply need to convert CURRENT_DATE rather than the created column.
You can do this by passing CURRENT_DATE as a parameter to UNIX_TIMESTAMP.
mysql> select UNIX_TIMESTAMP(), UNIX_TIMESTAMP(CURRENT_DATE);
+------------------+------------------------------+
| UNIX_TIMESTAMP() | UNIX_TIMESTAMP(CURRENT_DATE) |
+------------------+------------------------------+
| 1490059767 | 1490054400 |
+------------------+------------------------------+
1 row in set (0.00 sec)
As you can see from this quick example, UNIX_TIMESTAMP by itself is going to be the current time, but CURRENT_DATE is essentially the start of day, which is apparently what you are looking for.
I'm willing to bet that the number of rows for the current date are going to be fewer in number than the total rows for a user over the history of the system, so this is why you would not want an index on user, created as previously advised in the accepted answer. You might benefit from an index on created, userid.
My advice would be to start with an individual index on each of the columns separately.
("SELECT SUM(`rate`) AS `today_earned` FROM `".PREFIX."traffic_stats` WHERE `userid` = ?i AND created > UNIX_TIMESTAMP(CURRENT_DATE)", $user->data->userid)
And with your re-written query, again assuming that the result set is relatively small, you should see a clean EXPLAIN with rows matching your final result set.
As for whether or not you should apply an ORDER BY, this shouldn't be something you eliminate for performance reasons, but rather because it isn't relevant to your desired result. If you need or want the results ordered by user, then leave it. Unless you are producing a large result set, it shouldn't be a major problem.
In the case of that particular query, since you are doing a SUM(), there is no value of ORDERING the data, because you are only going to get one row back, so in that case I agree with Lajos, but there are many times when you might be utilizing a GROUP BY, and in that case, you might want the final results ordered.

very slow search and update database operation

i have a table "table1" which has almost 400,000 records. There is another table "table2" which has around 450,000 records.
I need to delete all the rows in table1 which are duplicate in table2. I been trying to do it with php and the script was running for hours and not completed yet. Does it really takes that much time?
field asin is varchar(20) in table1
field ASIN is Index and char(10) in table2
$duplicat = 0;
$sql="SELECT asin from asins";
$result = $conn->query($sql);
if ($result->num_rows > 0) {
while($row = $result->fetch_assoc()) {
$ASIN = $row['asin'];
$sql2 = "select id from asins_chukh where ASIN='$ASIN' limit 1";
$result2 = $conn->query($sql2);
if ($result2->num_rows > 0) {
$duplicat++;
$sql3 = "UPDATE `asins` SET `duplicate` = '1' WHERE `asins`.`asin` = '$ASIN';";
$result3 = $conn->query($sql3);
if($result3) {
echo "duplicate = $ASIN <br/>";
}
}
}
}
echo "totaal :$duplicat";
u can run one single sql command, instead of a loop, something like:
update table_2 t2
set t2.duplicate = 1
where exists (
select id
from table_1 t1
where t1.id = t2.id);
Warning! i didn't test the sql above, so you may need to verify the syntax.
For such kind of database operation, using php to loop and join is never a good idea. Most of the time will be wasted on network data transfer between your php server and mysql server.
If even the the above sql takes too long, you can consider limiting the query set with some range. Something like:
update table_2 t2
set t2.duplicate = 1
where exists (
select id
from table_1 t1
where t1.id = t2.id
and t2.id > [range_start] and t2.id < [range_end] );
This way, you can kick of several updates running in parallel
Yes, processing RBAR (Row By Agonizing Row) is going to be slow. There is overhead associated with each of those individual SELECT and UPDATE statements that get executed... sending the SQL text to the database, parsing the tokens for valid syntax (keywords, commas, expressions), validating the semantics (table references and column references valid, user has required privileges, etc.), evaluating possible execution plans (index range scan, full index scan, full table scan), converting the selected execution plan into executable code, executing the query plan (obtaining locks, accessing rows, generating rollback, writing to the innodb and mysql binary logs, etc.), and returning the results.
All of that takes time. For a statement or two, the time isn't that noticeable, but put thousands of executions into a tight loop, and it's like watching individual grains of sand falling in an hour glass.
MySQL, like most relational databases, is designed to efficiently operate on sets of data. Give the database work to do, and let the database crank, rather than spend time round tripping back and forth to the database.
It's like you've got a thousand tiny items to deliver, all to the same address. You can individually handle each item. Get a box, put the item into the box with a packing slip, seal the package, address the package, weigh the package and determine postage, affix postage, and then put it into the car, drive to the post office, drop the package off. Then drive back, and handle the next item, put it into a box, ... over and over and over.
Or, we could handle a lot of tiny items together, as a larger package, and reduce the amount of overhead work (time) packaging and round trips to and from the post office.
For one thing, there's really no need to run a separate SELECT statement, to find out if we need to do an UPDATE. We could just run the UPDATE. If there are no rows to be updated, the query will return an "affected rows" count of 0.
(Running the separate SELECT is like another round trip in the car to the post office, to check the list of packages that need to be delivered, before each round trip to the post office to drop off a package. Instead of two round trips, we can take the package with us one the first trip.)
So, that could improve things a bit. But it doesn't really get to the root of the performance problem.
The real performance boost comes from getting more work done in fewer SQL statements.
How would we identify ALL of the rows that need to be updated?
SELECT t.asins
FROM asins t
JOIN asins_chukh s
ON s.asin = t.asin
WHERE NOT ( t.duplicate <=> '1' )
(If asin isn't unique, we need to tweak the query a bit, to avoid returning "duplicate" rows. The point is, we can write a single SELECT statement that identifies all of the rows that need to be updated.)
For non-trivial tables, for performance, we need to have suitable indexes available. In this case, we'd want indexes with a leading column of asin. If such an index doesn't exist, for example...
... ON asins_chukh (asin)
If that query doesn't return a huge number of rows, we can handle the UPDATE in one fell swoop:
UPDATE asins t
JOIN asins_chukh s
ON s.asin = t.asin
SET t.duplicate = '1'
WHERE NOT ( t.duplicate <=> '1' )
We need to be careful about the number of rows. We want to avoid holding blocking locks for a long time (impacting concurrent processes that may be accessing the asins table), and we want to avoid generating a huge amount of rollback.
We can break the work up into more manageable chunks.
(Referring back to the shipping tiny items analogy... if we have millions of tiny items, and putting all of those into a single shipment would create a package larger and heaver than a container ship container... we can break the shipment into manageably sized boxes.)
For example, we could handle the UPDATE in "batches" of 10,000 id values (assuming id is a unique (or nearly unique), is the leading column in the cluster key, and the id values are grouped fairly well into mostly contiguous ranges, we can get the update activity localized into one section of blocks, and not have to revist most of those same blocks again...
The WHERE clause could be something like this:
WHERE NOT ( t.duplicate <=> 1 )
AND t.id >= 0
AND t.id < 0 + 10000
For the next next batch...
WHERE NOT ( t.duplicate <=> 1 )
AND t.id >= 10000
AND t.id < 10000 + 10000
Then
WHERE NOT ( t.duplicate <=> 1 )
AND t.id >= 20000
AND t.id < 20000 + 10000
And so on, repeating that until we're past the maximum id value. (We could run a SELECT MAX(id) FROM asins as the first step, before the loop.)
(We want to test these statements as SELECT statements first, before we convert to an UPDATE.)
Using the id column might not be the most appropriate way to create our batches.
Our objective is to create manageable "chunks" we can put into a loop, where the chunks don't overlap the same database blocks... we won't need to revisit the same block over and over, with multiple statements, to make changes to rows within the same block multiple times.

Long polling with PHP and jQuery - issue with update and delete

I wrote a small script which uses the concept of long polling.
It works as follows:
jQuery sends the request with some parameters (say lastId) to php
PHP gets the latest id from database and compares with the lastId.
If the lastId is smaller than the newly fetched Id, then it kills the
script and echoes the new records.
From jQuery, i display this output.
I have taken care of all security checks. The problem is when a record is deleted or updated, there is no way to know this.
The nearest solution i can get is to count the number of rows and match it with some saved row count variable. But then, if i have 1000 records, i have to echo out all the 1000 records which can be a big performance issue.
The CRUD functionality of this application is completely separated and runs in a different server. So i dont get to know which record was deleted.
I don't need any help coding wise, but i am looking for some suggestion to make this work while updating and deleting.
Please note, websockets(my fav) and node.js is not an option for me.
Instead of using a certain ID from your table, you could also check when the table itself was modified the last time.
SQL:
SELECT UPDATE_TIME
FROM information_schema.tables
WHERE TABLE_SCHEMA = 'yourdb'
AND TABLE_NAME = 'yourtable';
If successful, the statement should return something like
UPDATE_TIME
2014-04-02 11:12:15
Then use the resulting timestamp instead of the lastid. I am using a very similar technique to display and auto-refresh logs, works like a charm.
You have to adjust the statement to your needs, and replace yourdb and yourtable with the values needed for your application. It also requires you to have access to information_schema.tables, so check if this is available, too.
Two alternative solutions:
If the solution described above is too imprecise for your purpose (it might lead to issues when the table is changed multiple times per second), you might combine that timestamp with your current mechanism with lastid to cover new inserts.
Another way would be to implement a table, in which the current state is logged. This is where your ajax requests check the current state. Then generade triggers in your data tables, which update this table.
You can get the highest ID by
SELECT id FROM table ORDER BY id DESC LIMIT 1
but this is not reliable in my opinion, because you can have ID's of 1, 2, 3, 7 and you insert a new row having the ID 5.
Keep in mind: the highest ID, is not necessarily the most recent row.
The current auto increment value can be obtained by
SELECT AUTO_INCREMENT FROM information_schema.tables
WHERE TABLE_SCHEMA = 'yourdb'
AND TABLE_NAME = 'yourtable';
Maybe a timestamp + microtime is an option for you?

GROUP BY and ORDER BY too slow. How to make faster?

I've trying to create some stats for my table but it has over 3 million rows so it is really slow.
I'm trying to find the most popular value for column name and also showing how many times it pops up.
I'm using this at the momment but it doesn't work cause its too slow and I just get errors.
$total = mysql_query("SELECT `name`, COUNT(*) as b FROM `people` GROUP BY `name` ORDER BY `b` DESC LIMIT 0,5;")or die(mysql_error());
As you may see I'm trying to get all the names and how many times that name has been used but only show the top 5 to hopefully speed it up.
I would like to be able to then do get the values like
while($row = mysql_fetch_array($result)){
echo $row['name'].': '.$row['b']."\r\n";
}
And it will show things like this;
Bob: 215
Steve: 120
Sophie: 118
RandomGuy: 50
RandomGirl: 50
I don't care much about ordering the names afterwards like RandomGirl and RandomGuy been the wrong way round.
I think I've have provided enough information. :) I would like the names to be case-insensitive if possible though. Bob should be the same as BoB, bOb, BOB and so on.
Thank-you for your time
Paul
Limiting results on the top 5 won't give you a lot of speed-up, you'll gain time in the result retrieval, but in mySQL side the whole table still needs to be parsed (to count).
You will speed-up your count query having index on name column, of course as only the index will be parsed and not the table.
Now if you really want to speed up the result and avoid parsing the name index when you need this result (which will still be quite slow if you really have millions of rows), then the only other solution is computing the stats when inserting, deleting or updating rows on this table. That is using triggers on this table to maintain a statistics table near this one. Then you will really only have a simple select query on this statistics table, with only 5 rows parsed. But you will slow down your inserts, delete and update operations (which are already quite slow, especially if you maintain indexes, so if the stats are important you should study this solution).
Do you have an index on name? It might help.
Since you are doing the counting/grouping and then sorting an index on name doesn't help at all MySql should go through all rows every time, there is no way to optimize this. You need to have a separate stats table like this:
CREATE TABLE name_stats( name VARCHAR(n), cnt INT, UNIQUE( name ), INDEX( cnt ) )
and you should update this table whenever you add a new row to 'people' table like this:
INSERT INTO name_stats VALUES( 'Bob', 1 ) ON DUPLICATE KEY UPDATE cnt = cnt + 1;
Querying this table for the list of top names should give you the results instantaneously.

mysql/php - How to implementing a unique view system?

Introduction
I am wondering what is the best way to implement a unique views system... I think I know how, so if I could explain how I think to do it and you point out any mistakes or improvements.
obviously I will have to store a log table containing a video id and something which (relatively) uniquely identifies the user. At first I considered a combination of header request and IP but decided to keep it simple and use just IP. Also that way a user can not increase the views of their video by using a different browser.
This is how I would think to do it:
When a user visits I do a SELECT
similar to this:
SELECT 1 FROM tbl_log WHERE IP =
$usersip AND video_id = $video_id
if there is no result then I must
insert a record
INSERT into tbl_log (IP,video_id)
VALUES ($usersip, $video_id)
and increase the views by 1
SELECT views FROM tbl_video WHERE
video_id = $video_id
UPDATE tbl_video SET views =
$result['views'] + 1 WHERE video_id
= $video_id
Questions
I guess I do not want to have
millions of log records slowing down
my site so should I run a cron job to
empty the log table once a day?
Should I make the views
transactional? (I guess a slightly
depreciated view count is less
important than a slow site because of
row locks)
Is there a way to reduce the load on
the mysql server.... I fear if every
view requires an increased view count
and an IP log that it will be pretty
expensive. I have seen that youtube
and the like do not update the views
instantly... do they cache the
updates some how and then run them at
once? if so how?
How efficient is my system? Can you
think of any improvements?
Here are some ideas for improvements you can make.
Set a primary key on tbl_log to be IP + video_id. Then you can simply do a
REPLACE INTO tbl_log (IP,video_id) VALUES ($usersip, $video_id)
(Be sure to escape those php variables to avoid SQL injection.)
Now you're updating your log table with only one query. Next, you can update the views field in tbl_video periodically with something like:
UPDATE tbl_video SET views = (select count(*) from tbl_log where video_id = $video_id) where video_id = $video_id
You can do that with a cron job, or you can add a 'last_count_update' field and update the video when it is accessed if the last count time is older than 2 hours or whatever. This will be a little less work if you have a bunch of videos that aren't visited often.
I guess I do not want to have
millions of log records slowing down my site so should I run a cron
job to empty the log table once a day?
Consider using mysql's ON DUPLICATE KEY UPDATE syntax to avoid using a SELECT which would have an expensive WHERE clause. If your log table also had a timestamp column, you could refresh that value.
INSERT into tbl_log (IP,video_id) VALUES ($usersip, $video_id) ON DUPLICATE KEY UPDATE time_recorded = now();
This would require you to have a UNIQUE constraint on the IP and video_id columns.
Should I make the views transactional?
(I guess a slightly depreciated view
count is less important than a slow
site because of row locks)
No, because you can achieve this with a single UDPATE query.
UPDATE tbl_video SET views = views + 1 WHERE video_id = $video_id
Is there a way to reduce the load on
the mysql server.... I fear if every
view requires an increased view count
and an IP log that it will be pretty
expensive. I have seen that youtube
and the like do not update the views
instantly... do they cache the updates
some how and then run them at once? if
so how?
It's not too bad - there's really no other way to reliably capture record-view data. In the case of Youtube, it's more likely delayed writes or replication that's causing the delay you notice since they have hundreds of servers (although it's possible they are caching the value as well)
How efficient is my system? Can you
think of any improvements?
Other than what I mentioned here already, not off the top of my head.

Categories