GROUP BY and ORDER BY too slow. How to make faster? - php

I've trying to create some stats for my table but it has over 3 million rows so it is really slow.
I'm trying to find the most popular value for column name and also showing how many times it pops up.
I'm using this at the momment but it doesn't work cause its too slow and I just get errors.
$total = mysql_query("SELECT `name`, COUNT(*) as b FROM `people` GROUP BY `name` ORDER BY `b` DESC LIMIT 0,5;")or die(mysql_error());
As you may see I'm trying to get all the names and how many times that name has been used but only show the top 5 to hopefully speed it up.
I would like to be able to then do get the values like
while($row = mysql_fetch_array($result)){
echo $row['name'].': '.$row['b']."\r\n";
}
And it will show things like this;
Bob: 215
Steve: 120
Sophie: 118
RandomGuy: 50
RandomGirl: 50
I don't care much about ordering the names afterwards like RandomGirl and RandomGuy been the wrong way round.
I think I've have provided enough information. :) I would like the names to be case-insensitive if possible though. Bob should be the same as BoB, bOb, BOB and so on.
Thank-you for your time
Paul

Limiting results on the top 5 won't give you a lot of speed-up, you'll gain time in the result retrieval, but in mySQL side the whole table still needs to be parsed (to count).
You will speed-up your count query having index on name column, of course as only the index will be parsed and not the table.
Now if you really want to speed up the result and avoid parsing the name index when you need this result (which will still be quite slow if you really have millions of rows), then the only other solution is computing the stats when inserting, deleting or updating rows on this table. That is using triggers on this table to maintain a statistics table near this one. Then you will really only have a simple select query on this statistics table, with only 5 rows parsed. But you will slow down your inserts, delete and update operations (which are already quite slow, especially if you maintain indexes, so if the stats are important you should study this solution).

Do you have an index on name? It might help.

Since you are doing the counting/grouping and then sorting an index on name doesn't help at all MySql should go through all rows every time, there is no way to optimize this. You need to have a separate stats table like this:
CREATE TABLE name_stats( name VARCHAR(n), cnt INT, UNIQUE( name ), INDEX( cnt ) )
and you should update this table whenever you add a new row to 'people' table like this:
INSERT INTO name_stats VALUES( 'Bob', 1 ) ON DUPLICATE KEY UPDATE cnt = cnt + 1;
Querying this table for the list of top names should give you the results instantaneously.

Related

After reaching a large MySQL DB (say 1 mil rows) does providing LIMIT help optimizations?

I'm using PHP 7, MySQL and a small custom-built forum and a query for grabbing 7 columns with 2 SQL join statements into a "latest post" page. When the time comes that I hit 1 million rows will the limit 30 stop at 30 rows or will it have to sort the entire DB each run?
The reason I'm asking is I'm trying to wrap my head around how to paginate this custom forum I've built and if that pagination will be "ok" once it has to (theoretically) read through a million rows?
EDIT: My current query is a limit 30, sort desc.
EDIT2: Currently I'm getting about 500-600 posts give or take 50 a day. It's quickly adding up so I'm trying to monitor this before I get 1 million. That being said I'm only looking up one table right now, tblTopics and topic_id, topic_name, and topic_author (a fk). Then I'm doing another another lookup after that with the topic itself's foreign keys, topic_rating, and topic_category. The original lookup is where I have the sort and limit.
Sort is applied on the complete set, limit is applied after the sort, so adding a limit to an ORDER BY query does not make it a lot faster.
It depends.
SELECT ... FROM tbl ORDER BY x LIMIT 30;
INDEX(x)
will probably use the index and stop after 30 rows, not 1 million.
SELECT ... FROM tbl GROUP BY zz ORDER BY x LIMIT 30;
will scan all million rows, do the grouping, write to a tmp table, sort that tmp table, and only then deliver 30 rows.
SELECT ... FROM tbl WHERE yy = 123 ORDER BY x LIMIT 30;
INDEX(yy)
will probably prefer INDEX(yy), and it is hard to say how efficient it will be.
SELECT ... FROM tbl WHERE yy = 123 ORDER BY x LIMIT 30;
INDEX(yy, x)
will be very efficient -- not only can it use the index for filtering, but also for the ORDER BY and the LIMIT. Only 30 rows will be touched.
SELECT ... FROM tbl LIMIT 30;
is of dubious use. You will get some 30 rows, but who knows which 30? But it will be fast.
Well, this is still not answering you question. Your question involves a JOIN. Can you guess how much more complex the question becomes with JOIN involved?
If you would like to discuss your specific query, please provide the query and SHOW CREATE TABLE for each table and how many rows in each table.
If you are joining a 1-row table to a million row table, the 1-row table probably does not add any complexity.
If you are joining two million-row tables together without any indexes, then you are looking at a trillion intermediate 'rows' to work with!
Oh, and then you will want the 'second' 30 rows? That adds another dimension of complexity. I could spend a few more paragraphs on what can go wrong with OFFSET.
If this forum is somewhat open-ended where anyone can post "topics" and be the originating author, you probably want at a minimum a topics table with a PKID, Name, Author as you have, but also date added and most recent post and also count of posts against it. Too many times people build web sites that want counters all over the place and try to do aggregates, or the most recent, etc. Come to mention the most recent post, hold the ID of the most recent post too so you don't have to find the max date, then get the join base on that.
Then secondary table would be the details associated for a given post.
Then, via a trigger on your detail table for whatever you are posting against, you can do an update to the parent topic id and stamp it with count +1, most recent date of now, and the last ID with the ID of the newest record just created.
So now, joining to get that most recent context entry is a simple join and not overly complex.
Index on your topics table on the most recent post date so you are now getting ex: the most recent 30 topics, not necessarily the most recent 30 posts, such as 3 posts have a bunch of hits and account for all 30. Get 30 distinct topics, then let user see the details as they select the topic of interest. Your query at the top level is never going against the underlying details.
Obviously brief on true context of your website, but hopefully suggestions make sense for you to run with.

Count content views in Php, Mysql

Hi, I am creating a simple news website and I need to count news views. Currently I have 25000 rows and 25 columns. The hits count increases per page reload like Joomla. How should I structure the tables?
I have 2 approaches to this issue:
Create column named hits in the content table.
Create a new table that has 2 columns: content_id and hits.
I used the first approach and I think that slows my site.Will the second approaches perform better than the first one? Is there a better approach?
Well I don't know what's your logic in MySQL or PHP or what is you current table structure for news but I would suggest you to use Stored Procedure in MySQL as
Begin
update tblnews set hits = hits+1;
select news from tblnews;
End
and off course use PHP PDO Prepared Statement for performance
and if you are trying to get last 10 news or something like that then must set Indexing for content_id say like Primary key with auto_increment for better retrieval of query otherwise don't even use content_id column. I don't think there should be any hard structre for table. This would definitely increase performance and more than 100000000 rows would not make any big difference I hope. I don't think there would be any other better solution because these 2 queries needs to be performed at every page view.
Option 1 sounds the best. Option 2 is redundant because your just storing the hits. Again you would a join query to pull the hits

Optimal mySQL table index structure for faster SELECT of a large range of daily data

I am wondering the best format to lay out my data in a mySQL table so that it can be queried in the fastest manner to gather an array of daily values to be further utilized by php.
So far, I have laid out the table as such:
item_id price_date price_amount
1 2000-03-01 22.4
2 2000-03-01 19.23
3 2000-03-01 13.4
4 2000-03-01 14.95
1 2000-03-02 22.5
2 2000-03-02 19.42
3 2000-03-02 13.4
4 2000-03-02 13.95
with item_id defined as an index.
Also, I am using:
"SELECT DISTINCT price_date FROM table_name"
to get an array containing a unique list of dates.
Furthermore, the part of the code that is within a loop (and the focus of my optimization question) is currently written as:
"SELECT price_amount FROM table_name WHERE item_id = 1 ORDER BY price_date"
This second "SELECT" statement is actually within a loop where I am selecting/storing-in-array the daily prices of each item_id requested.
All is currently functioning and pulling the data from mySQL properly, however, both the above listed "SELECT" statements are taking approx 4-5 seconds to complete per each run, and when looping through 100+ products to create a summary, adds up to a very inefficient/slow information system.
Is there any more-efficient way that I could structure the mySQL table and/or SELECT statements to retrieve the results faster? Perhaps defining a different index on a different column? I have used the EXPLAIN command to return information per the queries but am unsure how to use the EXPLAIN information to increase the efficiency of my queries.
Thanks in advance for any mySQL wizards that may be able to assist.
Single column index
I am using:
"SELECT DISTINCT price_date FROM table_name"
to get an array containing a unique list of dates.
This query can be executed more efficiently if you create an index for the price_date column:
ALTER TABLE table_name ADD INDEX price_idx (price_date);
Mutiple column index
Furthermore, the part of the code that is within a loop (and the focus of my optimization question) is currently written as:
"SELECT price_amount FROM table_name WHERE item_id = 1 ORDER BY price_date"
For the second query, you should create an index covering both the item_id and price_date column:
ALTER TABLE table_name ADD INDEX item_price_idx (item_id, price_date);
I know this is a bit late, but i stumbled across this and thought I would throw my thoughts into the mix.
Indexes used well are very helpful in speeding up queries (Explain shows some really godd results around which indexes are being chosen - if any - for a specific query). However efficient PHP will help even more.
In your case you do not show the PHP, but it looks like you offer a list of dates and then loop through finding all the items in that date to get the prices. It would be more efficient to do something like the following:
Select item_id, price_amount from table_name where price_date= order by item_id, price_amount
with an index (preferably a Unique Index) on price_date,item_id,price_amount
You then have a single loop through the resultant SQL not a loop with multiple SQL connections (this is especially true if your SQL server is separate from the PHP box as an external network connection can have an overhead).
4-5 seconds for a single query though is very slow )by a factor of at least 100x) so it would indicate a problem (very large table with no key to use) or disk issues (potentially).

CREATE VIEW for MYSQL for last 30 days

I know i am writing query's wrong and when we get a lot of traffic, our database gets hit HARD and the page slows to a grind...
I think I need to write queries based on CREATE VIEW from the last 30 days from the CURDATE ?? But not sure where to begin or if this will be MORE efficient query for the database?
Anyways, here is a sample query I have written..
$query_Recordset6 = "SELECT `date`, title, category, url, comments
FROM cute_news
WHERE category LIKE '%45%'
ORDER BY `date` DESC";
Any help or suggestions would be great! I have about 11 queries like this, but I am confident if I could get help on one of these, then I can implement them to the rest!!
Putting a wildcard on the left side of a value comparison:
LIKE '%xyz'
...means that an index can not be used, even if one exists. Might want to consider using Full Text Searching (FTS), which means adding full text indexing.
Normalizing the data would be another step to consider - categories should likely be in a separate table.
SELECT `date`, title, category, url, comments
FROM cute_news
WHERE category LIKE '%45%'
ORDER BY `date` DESC
The LIKE '%45%' means a full table scan will need to be performed. Are you perhaps storing a list of categories in the column? If so creating a new table storing category and news_article_id will allow an index to be used to retrieve the matching records much more efficiently.
OK, time for psychic debugging.
In my mind's eye, I see that query performance would be improved considerably through database normalization, specifically by splitting the category multi-valued column into a a separate table that has two columns: the primary key for cute_news and the category ID.
This would also allow you to directly link said table to the categories table without having to parse it first.
Or, as Chris Date said: "Every row-and-column intersection contains exactly one value from the applicable domain (and nothing else)."
Anything with LIKE '%XXX%' is going to be slow. Its a slow operation.
For something like categories, you might want to separate categories out into another table and use a foreign key in the cute_news table. That way you can have category_id, and use that in the query which will be MUCH faster.
Also, I'm not quite sure why you're talking about using CREATE VIEW. Views will not really help you for speed. Not unless its a materialized view, which MySQL doesn't suppose natively.
If your database is getting hit hard, the solution isn't to make a view (the view is still basically the same amount of work for the database to do), the solution is to cache the results.
This is especially applicable since, from what it sounds like, your data only needs to be refreshed once every 30 days.
I'd guess that your category column is a list of category values like "12,34,45,78" ?
This is not good relational database design. One reason it's not good is as you've discovered: it's incredibly slow to search for a substring that might appear in the middle of that list.
Some people have suggested using fulltext search instead of the LIKE predicate with wildcards, but in this case it's simpler to create another table so you can list one category value per row, with a reference back to your cute_news table:
CREATE TABLE cute_news_category (
news_id INT NOT NULL,
category INT NOT NULL,
PRIMARY KEY (news_id, category),
FOREIGN KEY (news_id) REFERENCES cute_news(news_id)
) ENGINE=InnoDB;
Then you can query and it'll go a lot faster:
SELECT n.`date`, n.title, c.category, n.url, n.comments
FROM cute_news n
JOIN cute_news_category c ON (n.news_id = c.news_id)
WHERE c.category = 45
ORDER BY n.`date` DESC
Any answer is a guess, show:
- the relevant SHOW CREATE TABLE outputs
- the EXPLAIN output from your common queries.
And Bill Karwin's comment certainly applies.
After all this & optimizing, sampling the data into a table with only the last 30 days could still be desired, in which case you're better of running a daily cronjob to do just that.

Best way to update user rankings without killing the server

I have a website that has user ranking as a central part, but the user count has grown to over 50,000 and it is putting a strain on the server to loop through all of those to update the rank every 5 minutes. Is there a better method that can be used to easily update the ranks at least every 5 minutes? It doesn't have to be with php, it could be something that is run like a perl script or something if something like that would be able to do the job better (though I'm not sure why that would be, just leaving my options open here).
This is what I currently do to update ranks:
$get_users = mysql_query("SELECT id FROM users WHERE status = '1' ORDER BY month_score DESC");
$i=0;
while ($a = mysql_fetch_array($get_users)) {
$i++;
mysql_query("UPDATE users SET month_rank = '$i' WHERE id = '$a[id]'");
}
UPDATE (solution):
Here is the solution code, which takes less than 1/2 of a second to execute and update all 50,000 rows (make rank the primary key as suggested by Tom Haigh).
mysql_query("TRUNCATE TABLE userRanks");
mysql_query("INSERT INTO userRanks (userid) SELECT id FROM users WHERE status = '1' ORDER BY month_score DESC");
mysql_query("UPDATE users, userRanks SET users.month_rank = userRanks.rank WHERE users.id = userRanks.id");
Make userRanks.rank an autoincrementing primary key. If you then insert userids into userRanks in descending rank order it will increment the rank column on every row. This should be extremely fast.
TRUNCATE TABLE userRanks;
INSERT INTO userRanks (userid) SELECT id FROM users WHERE status = '1' ORDER BY month_score DESC;
UPDATE users, userRanks SET users.month_rank = userRanks.rank WHERE users.id = userRanks.id;
My first question would be: why are you doing this polling-type operation every five minutes?
Surely rank changes will be in response to some event and you can localize the changes to a few rows in the database at the time when that event occurs. I'm pretty certain the entire user base of 50,000 doesn't change rankings every five minutes.
I'm assuming the "status = '1'" indicates that a user's rank has changed so, rather than setting this when the user triggers a rank change, why don't you calculate the rank at that time?
That would seem to be a better solution as the cost of re-ranking would be amortized over all the operations.
Now I may have misunderstood what you meant by ranking in which case feel free to set me straight.
A simple alternative for bulk update might be something like:
set #rnk = 0;
update users
set month_rank = (#rnk := #rnk + 1)
order by month_score DESC
This code uses a local variable (#rnk) that is incremented on each update. Because the update is done over the ordered list of rows, the month_rank column will be set to the incremented value for each row.
Updating the users table row by row will be a time consuming task. It would be better if you could re-organise your query so that row by row updates are not required.
I'm not 100% sure of the syntax (as I've never used MySQL before) but here's a sample of the syntax used in MS SQL Server 2000
DECLARE #tmp TABLE
(
[MonthRank] [INT] NOT NULL,
[UserId] [INT] NOT NULL,
)
INSERT INTO #tmp ([UserId])
SELECT [id]
FROM [users]
WHERE [status] = '1'
ORDER BY [month_score] DESC
UPDATE users
SET month_rank = [tmp].[MonthRank]
FROM #tmp AS [tmp], [users]
WHERE [users].[Id] = [tmp].[UserId]
In MS SQL Server 2005/2008 you would probably use a CTE.
Any time you have a loop of any significant size that executes queries inside, you've got a very likely antipattern. We could look at the schema and processing requirement with more info, and see if we can do the whole job without a loop.
How much time does it spend calculating the scores, compared with assigning the rankings?
Your problem can be handled in a number of ways. Honestly more details from your server may point you in a totally different direction. But doing it that way you are causing 50,000 little locks on a heavily read table. You might get better performance with a staging table and then some sort of transition. Inserts into a table no one is reading from are probably going to be better.
Consider
mysql_query("delete from month_rank_staging;");
while(bla){
mysql_query("insert into month_rank_staging values ('$id', '$i');");
}
mysql_query("update month_rank_staging src, users set users.month_rank=src.month_rank where src.id=users.id;");
That'll cause one (bigger) lock on the table, but might improve your situation. But again, that may be way off base depending on the true source of your performance problem. You should probably look deeper at your logs, mysql config, database connections, etc.
Possibly you could use shards by time or other category. But read this carefully before...
You can split up the rank processing and the updating execution. So, run through all the data and process the query. Add each update statement to a cache. When the processing is complete, run the updates. You should have the WHERE portion of the UPDATE reference a primary key set to auto_increment, as mentioned in other posts. This will prevent the updates from interfering with the performance of the processing. It will also prevent users later in the processing queue from wrongfully taking advantage of the values from the users who were processed before them (if one user's rank affects that of another). It also prevents the database from clearing out its table caches from the SELECTS your processing code does.

Categories