So what I am trying to do is make a trending algorithm, i need help with the SQL code as i cant get it to go.
There are three aspects to the algorithm: (I am completely open to ideas on a better trend algorithm)
1.Plays during 24h / Total plays of the song
2.Plays during 7d / Total plays of the song
3.Plays during 24h / The value of plays of the most played item over 24h (whatever item leads the play count over 24h)
Each aspect is to be worth 0.33, for a maximum value of 1.0 being possible.
The third aspect is necessary as newly uploaded items would automatically be at top place unless their was a way to drop them down.
The table is called aud_plays and the columns are:
PlayID: Just an auto-incrementing ID for the table
AID: The id of the song
IP: ip address of the user listening
time: UNIX time code
I have tried a few sql codes but im pretty stuck being unable to get this to work.
In your ?aud_songs? (the one the AID points to) table add the following columns
Last24hrPlays INT -- use BIGINT if you plan on getting billion+
Last7dPlays INT
TotalPlays INT
In your aud_plays table create an AFTER INSERT trigger that will increment aud_song.TotalPlays.
UPDATE aud_song SET TotalPlays = TotalPlays + 1 WHERE id = INSERTED.aid
Calculating your trending in real time for every request would be taxing on your server, so it's best to just run a job to update the data every ~5 minutes. So create a SQL Agent Job to run every X minutes that updates Last7dPlays and Last24hrPlays.
UPDATE aud_songs SET Last7dPlays = (SELECT COUNT(*) FROM aud_plays WHERE aud_plays.aid = aud_songs.id AND aud_plays.time BETWEEN GetDate()-7 AND GetDate()),
Last24hrPlays = (SELECT COUNT(*) FROM aud_plays WHERE aud_plays.aid = aud_songs.id AND aud_plays.time BETWEEN GetDate()-1 AND GetDate())
I would also recommend removing old records from aud_plays (possibly older than 7days since you will have the TotalPlays trigger.
It should be easy to figure out how to calculate your 1 and 2 (from the question). Here's the SQL for 3.
SELECT cast(Last24hrPlays as float) / (SELECT MAX(Last24hrPlays) FROM aud_songs) FROM aud_songs WHERE aud_songs.id = #ID
NOTE I made the T-SQL pretty generic and unoptimized to illustrate how the process works.
Related
i have a big data record from onlines table. (more than 40 Million record)
now i want to show online user in any time from the table but this execute from server has been failed ...
for example when i send request get online in last week , it's dose not work (because the table have very large record).
this is my example php code:
$d = $_GET['date'];
$time = time() - 60*60 * 24 * $d;
$phql = "SELECT DISTINCT aid FROM onlines WHERE time > '$time'";
so, do you have any better tips ?
Tnx.
Use EXPLAIN SELECT ... to see which indices are defined on your table. Ensure that particularly for big tables, that your columns that you query are indexed. In this case time.
You can create an index by:
CREATE INDEX time_index ON onlines (time);
This should speed up the query. If you do not care about potential data loss or persistance you might look into using an in-memory table to avoid IO. That speed up queries significantly but will empty the table if the server restarts or MySQL is shut down.
I'm using PHP 7, MySQL and a small custom-built forum and a query for grabbing 7 columns with 2 SQL join statements into a "latest post" page. When the time comes that I hit 1 million rows will the limit 30 stop at 30 rows or will it have to sort the entire DB each run?
The reason I'm asking is I'm trying to wrap my head around how to paginate this custom forum I've built and if that pagination will be "ok" once it has to (theoretically) read through a million rows?
EDIT: My current query is a limit 30, sort desc.
EDIT2: Currently I'm getting about 500-600 posts give or take 50 a day. It's quickly adding up so I'm trying to monitor this before I get 1 million. That being said I'm only looking up one table right now, tblTopics and topic_id, topic_name, and topic_author (a fk). Then I'm doing another another lookup after that with the topic itself's foreign keys, topic_rating, and topic_category. The original lookup is where I have the sort and limit.
Sort is applied on the complete set, limit is applied after the sort, so adding a limit to an ORDER BY query does not make it a lot faster.
It depends.
SELECT ... FROM tbl ORDER BY x LIMIT 30;
INDEX(x)
will probably use the index and stop after 30 rows, not 1 million.
SELECT ... FROM tbl GROUP BY zz ORDER BY x LIMIT 30;
will scan all million rows, do the grouping, write to a tmp table, sort that tmp table, and only then deliver 30 rows.
SELECT ... FROM tbl WHERE yy = 123 ORDER BY x LIMIT 30;
INDEX(yy)
will probably prefer INDEX(yy), and it is hard to say how efficient it will be.
SELECT ... FROM tbl WHERE yy = 123 ORDER BY x LIMIT 30;
INDEX(yy, x)
will be very efficient -- not only can it use the index for filtering, but also for the ORDER BY and the LIMIT. Only 30 rows will be touched.
SELECT ... FROM tbl LIMIT 30;
is of dubious use. You will get some 30 rows, but who knows which 30? But it will be fast.
Well, this is still not answering you question. Your question involves a JOIN. Can you guess how much more complex the question becomes with JOIN involved?
If you would like to discuss your specific query, please provide the query and SHOW CREATE TABLE for each table and how many rows in each table.
If you are joining a 1-row table to a million row table, the 1-row table probably does not add any complexity.
If you are joining two million-row tables together without any indexes, then you are looking at a trillion intermediate 'rows' to work with!
Oh, and then you will want the 'second' 30 rows? That adds another dimension of complexity. I could spend a few more paragraphs on what can go wrong with OFFSET.
If this forum is somewhat open-ended where anyone can post "topics" and be the originating author, you probably want at a minimum a topics table with a PKID, Name, Author as you have, but also date added and most recent post and also count of posts against it. Too many times people build web sites that want counters all over the place and try to do aggregates, or the most recent, etc. Come to mention the most recent post, hold the ID of the most recent post too so you don't have to find the max date, then get the join base on that.
Then secondary table would be the details associated for a given post.
Then, via a trigger on your detail table for whatever you are posting against, you can do an update to the parent topic id and stamp it with count +1, most recent date of now, and the last ID with the ID of the newest record just created.
So now, joining to get that most recent context entry is a simple join and not overly complex.
Index on your topics table on the most recent post date so you are now getting ex: the most recent 30 topics, not necessarily the most recent 30 posts, such as 3 posts have a bunch of hits and account for all 30. Get 30 distinct topics, then let user see the details as they select the topic of interest. Your query at the top level is never going against the underlying details.
Obviously brief on true context of your website, but hopefully suggestions make sense for you to run with.
I wrote a small script which uses the concept of long polling.
It works as follows:
jQuery sends the request with some parameters (say lastId) to php
PHP gets the latest id from database and compares with the lastId.
If the lastId is smaller than the newly fetched Id, then it kills the
script and echoes the new records.
From jQuery, i display this output.
I have taken care of all security checks. The problem is when a record is deleted or updated, there is no way to know this.
The nearest solution i can get is to count the number of rows and match it with some saved row count variable. But then, if i have 1000 records, i have to echo out all the 1000 records which can be a big performance issue.
The CRUD functionality of this application is completely separated and runs in a different server. So i dont get to know which record was deleted.
I don't need any help coding wise, but i am looking for some suggestion to make this work while updating and deleting.
Please note, websockets(my fav) and node.js is not an option for me.
Instead of using a certain ID from your table, you could also check when the table itself was modified the last time.
SQL:
SELECT UPDATE_TIME
FROM information_schema.tables
WHERE TABLE_SCHEMA = 'yourdb'
AND TABLE_NAME = 'yourtable';
If successful, the statement should return something like
UPDATE_TIME
2014-04-02 11:12:15
Then use the resulting timestamp instead of the lastid. I am using a very similar technique to display and auto-refresh logs, works like a charm.
You have to adjust the statement to your needs, and replace yourdb and yourtable with the values needed for your application. It also requires you to have access to information_schema.tables, so check if this is available, too.
Two alternative solutions:
If the solution described above is too imprecise for your purpose (it might lead to issues when the table is changed multiple times per second), you might combine that timestamp with your current mechanism with lastid to cover new inserts.
Another way would be to implement a table, in which the current state is logged. This is where your ajax requests check the current state. Then generade triggers in your data tables, which update this table.
You can get the highest ID by
SELECT id FROM table ORDER BY id DESC LIMIT 1
but this is not reliable in my opinion, because you can have ID's of 1, 2, 3, 7 and you insert a new row having the ID 5.
Keep in mind: the highest ID, is not necessarily the most recent row.
The current auto increment value can be obtained by
SELECT AUTO_INCREMENT FROM information_schema.tables
WHERE TABLE_SCHEMA = 'yourdb'
AND TABLE_NAME = 'yourtable';
Maybe a timestamp + microtime is an option for you?
What are some of the strategies being used for pagination of data sets that involve complex queries? count(*) takes ~1.5 sec so we don't want to hit the DB for every page view. Currently there are ~45k rows returned by this query.
Here are some of the approaches I've considered:
Cache the row count and update it every X minutes
Limit (and offset) the rows counted to 41 (for example) and display the page picker as "1 2 3 4 ..."; then recompute if anyone actually goes to page 4 and display "... 3 4 5 6 7 ..."
Get the row count once and store it in the user's session
Get rid of the page picker and just have a "Next Page" link
I've had to engineer a few pagination strategies using PHP and MySQL for a site that does over a million page views a day. I persued the strategy in stages:
Multi-column indexes I should have done this first before attempting a materialized view.
Generating a materialized view. I created a cron job that did a common denormalization of the document tables I was using. I would SELECT ... INTO OUTFILE ... and then create the new table, and rotate it in:
SELECT ... INTO OUTFILE '/tmp/ondeck.txt' FROM mytable ...;
CREATE TABLE ondeck_mytable LIKE mytable;
LOAD DATA INFILE '/tmp/ondeck.txt' INTO TABLE ondeck_mytable...;
DROP TABLE IF EXISTS dugout_mytable;
RENAME TABLE atbat_mytable TO dugout_mytable, ondeck_mytable TO atbat_mytable;
This kept the lock time on the write contended mytable down to a minimum and the pagination queries could hammer away on the atbat materialized view. I've simplified the above, leaving out the actual manipulation, which are unimportant.
Memcache I then created a wrapper about my database connection to cache these paginated results into memcache. This was a huge performance win. However, it was still not good enough.
Batch generation I wrote a PHP daemon and extracted the pagination logic into it. It would detect changes mytable and periodically regenerate the from the oldest changed record to the most recent record all the pages to the webserver's filesystem. With a bit of mod_rewrite, I could check to see if the page existed on disk, and serve it up. This also allowed me to take effective advantage of reverse proxying by letting Apache detect If-Modified-Since headers, and respond with 304 response codes. (Obviously, I removed any option of allowing users to select the number of results per page, an unimportant feature.)
Updated:
RE count(*): When using MyISAM tables, COUNT didn't create a problem when I was able to reduce the amount of read-write contention on the table. If I were doing InnoDB, I would create a trigger that updated an adjacent table with the row count. That trigger would just +1 or -1 depending on INSERT or DELETE statements.
RE page-pickers (thumbwheels) When I moved to agressive query caching, thumb wheel queries were also cached, and when it came to batch generating the pages, I was using temporary tables--so computing the thumbwheel was no problem. A lot of thumbwheel calculation simplified because it became a predictable filesystem pattern that actually only needed the largest page numer. The smallest page number was always 1.
Windowed thumbweel The example you give above for a windowed thumbwheel (<< 4 [5] 6 >>) should be pretty easy to do without any queries at all so long as you know your maximum number of pages.
My suggestion is ask MySQL for 1 row more than you need in each query, and decide based on the number of rows in the result set whether or not to show the next page-link.
MySQL has a specific mechanism to compute an approximated count of a result set without the LIMIT clause: FOUND_ROWS().
MySQL is quite good in optimizing LIMIT queries.
That means it picks appropriate join buffer, filesort buffer etc just enough to satisfy LIMIT clause.
Also note that with 45k rows you probably don't need exact count. Approximate counts can be figured out using separate queries on the indexed fields. Say, this query:
SELECT COUNT(*)
FROM mytable
WHERE col1 = :myvalue
AND col2 = :othervalue
can be approximated by this one:
SELECT COUNT(*) *
(
SELECT COUNT(*)
FROM mytable
) / 1000
FROM (
SELECT 1
FROM mytable
WHERE col1 = :myvalue
AND col2 = :othervalue
LIMIT 1000
)
, which is much more efficient in MyISAM.
If you give an example of your complex query, probably I can say something more definite on how to improve its pagination.
I'm by no means a MySQL expert, but perhaps giving up the COUNT(*) and going ahead with COUNT(id)?
I have a website that has user ranking as a central part, but the user count has grown to over 50,000 and it is putting a strain on the server to loop through all of those to update the rank every 5 minutes. Is there a better method that can be used to easily update the ranks at least every 5 minutes? It doesn't have to be with php, it could be something that is run like a perl script or something if something like that would be able to do the job better (though I'm not sure why that would be, just leaving my options open here).
This is what I currently do to update ranks:
$get_users = mysql_query("SELECT id FROM users WHERE status = '1' ORDER BY month_score DESC");
$i=0;
while ($a = mysql_fetch_array($get_users)) {
$i++;
mysql_query("UPDATE users SET month_rank = '$i' WHERE id = '$a[id]'");
}
UPDATE (solution):
Here is the solution code, which takes less than 1/2 of a second to execute and update all 50,000 rows (make rank the primary key as suggested by Tom Haigh).
mysql_query("TRUNCATE TABLE userRanks");
mysql_query("INSERT INTO userRanks (userid) SELECT id FROM users WHERE status = '1' ORDER BY month_score DESC");
mysql_query("UPDATE users, userRanks SET users.month_rank = userRanks.rank WHERE users.id = userRanks.id");
Make userRanks.rank an autoincrementing primary key. If you then insert userids into userRanks in descending rank order it will increment the rank column on every row. This should be extremely fast.
TRUNCATE TABLE userRanks;
INSERT INTO userRanks (userid) SELECT id FROM users WHERE status = '1' ORDER BY month_score DESC;
UPDATE users, userRanks SET users.month_rank = userRanks.rank WHERE users.id = userRanks.id;
My first question would be: why are you doing this polling-type operation every five minutes?
Surely rank changes will be in response to some event and you can localize the changes to a few rows in the database at the time when that event occurs. I'm pretty certain the entire user base of 50,000 doesn't change rankings every five minutes.
I'm assuming the "status = '1'" indicates that a user's rank has changed so, rather than setting this when the user triggers a rank change, why don't you calculate the rank at that time?
That would seem to be a better solution as the cost of re-ranking would be amortized over all the operations.
Now I may have misunderstood what you meant by ranking in which case feel free to set me straight.
A simple alternative for bulk update might be something like:
set #rnk = 0;
update users
set month_rank = (#rnk := #rnk + 1)
order by month_score DESC
This code uses a local variable (#rnk) that is incremented on each update. Because the update is done over the ordered list of rows, the month_rank column will be set to the incremented value for each row.
Updating the users table row by row will be a time consuming task. It would be better if you could re-organise your query so that row by row updates are not required.
I'm not 100% sure of the syntax (as I've never used MySQL before) but here's a sample of the syntax used in MS SQL Server 2000
DECLARE #tmp TABLE
(
[MonthRank] [INT] NOT NULL,
[UserId] [INT] NOT NULL,
)
INSERT INTO #tmp ([UserId])
SELECT [id]
FROM [users]
WHERE [status] = '1'
ORDER BY [month_score] DESC
UPDATE users
SET month_rank = [tmp].[MonthRank]
FROM #tmp AS [tmp], [users]
WHERE [users].[Id] = [tmp].[UserId]
In MS SQL Server 2005/2008 you would probably use a CTE.
Any time you have a loop of any significant size that executes queries inside, you've got a very likely antipattern. We could look at the schema and processing requirement with more info, and see if we can do the whole job without a loop.
How much time does it spend calculating the scores, compared with assigning the rankings?
Your problem can be handled in a number of ways. Honestly more details from your server may point you in a totally different direction. But doing it that way you are causing 50,000 little locks on a heavily read table. You might get better performance with a staging table and then some sort of transition. Inserts into a table no one is reading from are probably going to be better.
Consider
mysql_query("delete from month_rank_staging;");
while(bla){
mysql_query("insert into month_rank_staging values ('$id', '$i');");
}
mysql_query("update month_rank_staging src, users set users.month_rank=src.month_rank where src.id=users.id;");
That'll cause one (bigger) lock on the table, but might improve your situation. But again, that may be way off base depending on the true source of your performance problem. You should probably look deeper at your logs, mysql config, database connections, etc.
Possibly you could use shards by time or other category. But read this carefully before...
You can split up the rank processing and the updating execution. So, run through all the data and process the query. Add each update statement to a cache. When the processing is complete, run the updates. You should have the WHERE portion of the UPDATE reference a primary key set to auto_increment, as mentioned in other posts. This will prevent the updates from interfering with the performance of the processing. It will also prevent users later in the processing queue from wrongfully taking advantage of the values from the users who were processed before them (if one user's rank affects that of another). It also prevents the database from clearing out its table caches from the SELECTS your processing code does.