Select query takes too long - php

These 2 querys take too long to produce a result (sometimes 1 min or even sometime end up on some error) and put really heavy load on the server:
("SELECT SUM(`rate`) AS `today_earned` FROM `".PREFIX."traffic_stats` WHERE `userid` = ?i AND from_unixtime(created) > CURRENT_DATE ORDER BY created DESC", $user->data->userid)
("SELECT COUNT(`userid`) AS `total_clicks` FROM `".PREFIX."traffic_stats` WHERE `userid` = ?i", $user->data->userid)
The table has about 4 million rows.
This is the table structure:
I have one index on traffic_id:
If you select anything from traffic_stats table it will take forever, however inserting to this table is normal.
Is it possible to reduce the time spent on executing this query? I use PDO and I am new to all this.

ORDER BY will take a lot of time and since you only need aggregate data (adding numbers or counting numbers is commutative), the ORDER BY will do a lot of useless sorting, costing you time and server power.
You will need to make sure that your indexing is right, you will probably need an index for user_id and for (user_id, created).
Is user_id numeric? If not, then you might consider converting it into numeric type, int for example.
These are improving your query and structure. But let's improve the concept as well. Are insertions and modifications very frequent? Do you absolutely need real-time data, or you can do with quasi-realtime data as well?
If insertions/modifications are not very frequent, or you can do with older data, or the problem is causing huge trouble, then you could do this by running periodically a cron job which would calculate these values and cache them. The application would read them from the cache.

I'm not sure why you accepted an answer, when you really didn't get to the heart of your problem.
I also want to clarify that this is a mysql question, and the fact that you are using PDO or PHP for that matter is not important.
People advised you to utilize EXPLAIN. I would go one further and tell you that you need to use EXPLAIN EXTENDED possibly with the format=json option to get a full picture of what is going on. Looking at your screen shot of the explain, what should jump out at you is that the query looked at over 1m rows to get an answer. This is why your queries are taking so long!
At the end of the day, if you have properly indexed your tables, your goal should be in a large table like this, to have number of rows examined be fairly close to the final result set.
So let's look at the 2nd query, which is quite simple:
("SELECT COUNT(`userid`) AS `total_clicks` FROM `".PREFIX."traffic_stats` WHERE `userid` = ?i", $user->data->userid)
In this case the only thing that is really important is that you have an index on traffic_stats.userid.
I would recommend, that, if you are uncertain at this point, drop all indexes other than the original primary key (traffic_id) index, and start with only an index on the userid column. Run your query. What is the result, and how long does it take? Look at the EXPLAIN EXTENDED. Given the simplicity of the query, you should see that only the index is being used and the rows should match the result.
Now to your first query:
("SELECT SUM(`rate`) AS `today_earned` FROM `".PREFIX."traffic_stats` WHERE `userid` = ?i AND from_unixtime(created) > CURRENT_DATE ORDER BY created DESC", $user->data->userid)
Looking at the WHERE clause there are these criteria:
userid =
from_unixtime(created) > CURRENT_DATE
You already have an index on userid. Despite the advice given previously, it is not necessarily correct to have an index on userid, created, and in your case it is of no value whatsoever.
The reason for this is that you are utilizing a mysql function from_unixtime(created) to transform the raw value of the created column.
Whenever you do this, an index can't be used. You would not have any concerns in doing a comparison with the CURRENT_DATE if you were using the native TIMESTAMP type but in this case, to handle the mismatch, you simply need to convert CURRENT_DATE rather than the created column.
You can do this by passing CURRENT_DATE as a parameter to UNIX_TIMESTAMP.
mysql> select UNIX_TIMESTAMP(), UNIX_TIMESTAMP(CURRENT_DATE);
+------------------+------------------------------+
| UNIX_TIMESTAMP() | UNIX_TIMESTAMP(CURRENT_DATE) |
+------------------+------------------------------+
| 1490059767 | 1490054400 |
+------------------+------------------------------+
1 row in set (0.00 sec)
As you can see from this quick example, UNIX_TIMESTAMP by itself is going to be the current time, but CURRENT_DATE is essentially the start of day, which is apparently what you are looking for.
I'm willing to bet that the number of rows for the current date are going to be fewer in number than the total rows for a user over the history of the system, so this is why you would not want an index on user, created as previously advised in the accepted answer. You might benefit from an index on created, userid.
My advice would be to start with an individual index on each of the columns separately.
("SELECT SUM(`rate`) AS `today_earned` FROM `".PREFIX."traffic_stats` WHERE `userid` = ?i AND created > UNIX_TIMESTAMP(CURRENT_DATE)", $user->data->userid)
And with your re-written query, again assuming that the result set is relatively small, you should see a clean EXPLAIN with rows matching your final result set.
As for whether or not you should apply an ORDER BY, this shouldn't be something you eliminate for performance reasons, but rather because it isn't relevant to your desired result. If you need or want the results ordered by user, then leave it. Unless you are producing a large result set, it shouldn't be a major problem.
In the case of that particular query, since you are doing a SUM(), there is no value of ORDERING the data, because you are only going to get one row back, so in that case I agree with Lajos, but there are many times when you might be utilizing a GROUP BY, and in that case, you might want the final results ordered.

Related

MySQL Select efficient first and last row

I want to get two rows from my table in a MySQL databse. these two rows must be the first one and last one after I ordered them. To achieve this i made two querys, these two:
SELECT dateBegin, dateTimeBegin FROM worktime ORDER BY dateTimeBegin ASC LIMIT 1;";
SELECT dateBegin, dateTimeBegin FROM worktime ORDER BY dateTimeBegin DESC LIMIT 1;";
I decided to not get the entire set and pick the first and last in PHP to avoid possibly very large arrays. My problem is, that I have two querys and I do not really know how efficient this is. I wanted to combine them for example with UNION, but then I would still have to order an unsorted list twice which I also want to avoid, because the second sorting does exactly the same as the first
I would like to order once and then select the first and last value of this ordered list, but I do not know a more efficient way then the one with two querys. I know the perfomance benefit will not be gigantic, but nevertheless I know that the lists are growing and as they get bigger and bigger and I execute this part for some tables I need the most efficient way to do this.
I found a couple of similar topics, but none of them adressed this particular perfomance question.
Any help is highly appreciated.
(This is both an "answer" and a rebuttal to errors in some of the comments.)
INDEX(dateTimeBegin)
will facilitate SELECT ... ORDER BY dateTimeBegin ASC LIMIT 1 and the corresponding row from the other end, using DESC.
MAX(dateTimeBegin) will find only the max value for that column; it will not directly find the rest of the columns in that row. That would require a subquery or JOIN.
INDEX(... DESC) -- The DESC is ignored by MySQL. This is almost never a drawback, since the optimizer is willing to go either direction through an index. The case where it does matter is ORDER BY x ASC, y DESC cannot use INDEX(x, y), nor INDEX(x ASC, y DESC). This is a MySQL deficiency. (Other than that, I agree with Gordon's 'answer'.)
( SELECT ... ASC )
UNION ALL
( SELECT ... DESC )
won't provide much, if any, performance advantage over two separate selects. Pick the technique that keeps your code simpler.
You are almost always better off having a single DATETIME (or TIMESTAMP) field than splitting out the DATE and/or TIME. SELECT DATE(dateTimeBegin), dateTimeBegin ... works simply, and "fast enough". See also the function DATE_FORMAT(). I recommend dropping the dateBegin column and adjusting the code accordingly. Note that shrinking the table may actually speed up the processing more than the cost of DATE(). (The diff will be infinitesimal.)
Without an index starting with dateTimeBegin, any of the techniques would be slow, and get slower as the table grows in size. (I'm pretty sure it can find both the MIN() and MAX() in only one full pass, and do it without sorting. The pair of ORDER BYs would take two full passes, plus two sorts; 5.6 may have an optimization that almost eliminates the sorts.)
If there are two rows with exactly the same min dateTimeBegin, which one you get will be unpredictable.
Your queries are fine. What you want is an index on worktime(dateTimeBegin). MySQL should be smart enough to use this index for both the ASC and DESC sorts. If you test it out, and it is not, then you'll want two indexes: worktime(dateTimeBegin asc) and worktime(dateTimeBegin desc).
Whether you run one query or two is up to you. One query (connected by UNION ALL) is slightly more efficient, because you have only one round-trip to the database. However, two might fit more easily into your code, and the difference in performance is unimportant for most purposes.

Long polling with PHP and jQuery - issue with update and delete

I wrote a small script which uses the concept of long polling.
It works as follows:
jQuery sends the request with some parameters (say lastId) to php
PHP gets the latest id from database and compares with the lastId.
If the lastId is smaller than the newly fetched Id, then it kills the
script and echoes the new records.
From jQuery, i display this output.
I have taken care of all security checks. The problem is when a record is deleted or updated, there is no way to know this.
The nearest solution i can get is to count the number of rows and match it with some saved row count variable. But then, if i have 1000 records, i have to echo out all the 1000 records which can be a big performance issue.
The CRUD functionality of this application is completely separated and runs in a different server. So i dont get to know which record was deleted.
I don't need any help coding wise, but i am looking for some suggestion to make this work while updating and deleting.
Please note, websockets(my fav) and node.js is not an option for me.
Instead of using a certain ID from your table, you could also check when the table itself was modified the last time.
SQL:
SELECT UPDATE_TIME
FROM information_schema.tables
WHERE TABLE_SCHEMA = 'yourdb'
AND TABLE_NAME = 'yourtable';
If successful, the statement should return something like
UPDATE_TIME
2014-04-02 11:12:15
Then use the resulting timestamp instead of the lastid. I am using a very similar technique to display and auto-refresh logs, works like a charm.
You have to adjust the statement to your needs, and replace yourdb and yourtable with the values needed for your application. It also requires you to have access to information_schema.tables, so check if this is available, too.
Two alternative solutions:
If the solution described above is too imprecise for your purpose (it might lead to issues when the table is changed multiple times per second), you might combine that timestamp with your current mechanism with lastid to cover new inserts.
Another way would be to implement a table, in which the current state is logged. This is where your ajax requests check the current state. Then generade triggers in your data tables, which update this table.
You can get the highest ID by
SELECT id FROM table ORDER BY id DESC LIMIT 1
but this is not reliable in my opinion, because you can have ID's of 1, 2, 3, 7 and you insert a new row having the ID 5.
Keep in mind: the highest ID, is not necessarily the most recent row.
The current auto increment value can be obtained by
SELECT AUTO_INCREMENT FROM information_schema.tables
WHERE TABLE_SCHEMA = 'yourdb'
AND TABLE_NAME = 'yourtable';
Maybe a timestamp + microtime is an option for you?

Show relationship using two table JOIN, or use PHP functions?

I'm making a micro-blogging website. The users can follow each other. I've to make stream of posts (activity stream) for the current user ( $userid ) based on the users the current user is following, like in Twitter. I know two ways of implementing this. Which one is better?
Tables:
Table: posts
Columns: PostID, AuthorID, TimeStamp, Content
Table: follow
Columns: poster, follower
The first way, by joining these two tables:
select `posts`.* from `posts`,`follow` where `follow`.`follower`='$userid' and
`posts`.`AuthorID`=`follow`.`poster` order by `posts`.`postid` desc
The second way is by making an array of users the $userid is following (posters), then doing php implode on this array, and then doing where in:
One thing I'll like to tell here that I'm storing the the number of users a user is following in the `following` record of the `user` table, so here I'll use this number as a limit when extracting the list of posters - the 'followingList':
function followingList($userid){
$listArray=array();
$limit="select `following` from `users` where `userid`='$userid' limit 1";
$limit=mysql_query($limit);
$limit=mysql_fetch_row($limit);
$limit= (int) $limit[0];
$sql="select `poster` from `follow` where `follower`='$userid' limit $limit";
$result=mysql_query($sql);
while($data = mysql_fetch_row($result)){
$listArray[] = $data[0];
}
$posters=implode("','",$listArray);
return $posters;
}
Now I've a comma separated list of user IDs the current $userid is following.And now selecting the posts to make the activity stream:
$posters=followingList($userid);
$sql = "select * from `posts` where (`AuthorID` in ('$posters'))
order by `postid` desc";
Which of the two methods is better?
And can knowing the total number of following (number of users the current user is following), make things faster in the first method as it's doing in the second method?
Any other better method?
You should go all the way with the first option. Always try as much as possible to process the data on the mysql server instead of in your PHP code. PHP will not implicitly cache the results of the operations while MySQL will do it.
The most important thing is to make sure you index your data correctly. Try using "EXPLAIN" statements to make sure you have optimized your database as much as possible and use #1 to link your data together.
http://dev.mysql.com/doc/refman/5.0/en/explain.html
This will allow you later to compute statistics also, while the second method requires you to process a part of the statistics.
The first important point is that PHP is good at building pages but very bad are managing data, everything manipulated by PHP will fill the memory and no special behavior can be applied in PHP to prevent using to much memory, except crashing.
On the other side the datatase job is to analyse relation between the tables, real number used by the query (cardinality of indexes and statictics on rows and index usage in fact), and a lot of different mechanism can be choosen by the engine depending on the size of data (merge joins, temporary tables, etc). That means you could have 256.278.242 posts and 145.268 users, with 5.684 average followers the datatabase job would be to find the fastest way to give you an answer. Well, when you hit really big numbers you'll see that all databases are not equal, but that's another problem.
On the PHP side Retrieving the list of users from the fisrt query coudl became very long (with a big number of followed users, let's say 15.000. Simply building the query string with 15 000 identifiers inside would take a quite big amount a memory. Trasnferring this new query to the SQL server would also be slow. It's definitively the wrong way.
Now be careful of the way you build your SQL request. A request is something you should be able to read from the top to the end, explaining what you really want. This will help the SQL (good) engine in choosing the right solution.
select `posts`.*
from `posts`
INNER JOIN `follow` ON posts`.`AuthorID`=`follow`.`poster`
where `follow`.`follower`='#userid'
order by `posts`.`postid` desc
LIMIT 15
Several remarks:
I have used an INNER JOIN.I want an INNER JOIN, let's write it, it will be easier to read for me later and it should be the same for the query analyser.
if #userid is an int do not use quotes. Please use ints for identifiers (this is really faster than strings). And on the PHP side cast the int "SELECT ..." . (int) $user_id ." ORDER ... or use query with parameters (This is for security).
I have used a LIMIT 15, maybe an offset could be used as well, if you want to show some pagination control around the posts. Let's say this query will retrieve 15.263 documents from my 5.642 folowwed users, you do not want, and the user do not want, to show theses 15.263 documents on a web page. And knowing with $limit that the number is 15.263 is a good thing but certainly not for a request limit. You know this number, but the database may know it as well if it has a good query analyser and some good internal statistics.
The request limit has several goals
1. Limit the size of data transfered from the database to your PHP script
2. Limit the memory usage of your PHP script (an array with 15.263 documents containg some HTMl stuff... ouch)
3. Limit the size of the final user output (and get a faster response)

Optimizing SQL query

I have to get all entries in database that have a publish_date between two dates. All dates are stored as integers because dates are in UNIX TIMESTAMP format...
Following query works perfect but it takes "too long". It returns all entries made between 10 and 20 dazs ago.
SELECT * FROM tbl_post WHERE published < (UNIX_TIMESTAMP(NOW())-864000)
AND published> (UNIX_TIMESTAMP(NOW())-1728000)
Is there any way to optimize this query? If I am not mistaken it is calling the NOW() and UNIX_TIMESTAMP on evey entry. I thought that saving the result of these 2 repeating functions into mysql #var make the comparison much faster but it didn't. 2nd code I run was:
SET #TenDaysAgo = UNIX_TIMESTAMP(NOW())-864000;
SET #TwentyDaysAgo = UNIX_TIMESTAMP(NOW())-1728000;
SELECT * FROM tbl_post WHERE fecha_publicado < #TenDaysAgo
AND fecha_publicado > #TwentyDaysAgo;
Another confusing thing was that PHP can't run the bove query throught mysql_query(); ?!
Please, if you have any comments on this problem it will be more than welcome :)
Luka
Be sure to have an index on published.And make sure it is being used.
EXPLAIN SELECT * FROM tbl_post WHERE published < (UNIX_TIMESTAMP(NOW())-864000) AND published> (UNIX_TIMESTAMP(NOW())-1728000)
should be a good start to see what's going on on the query. To add an index:
ALTER TABLE tbl_post ADD INDEX (published)
PHP's mysql_query function (assuming that's what you're using) can only accept one query per string, so it can't execute the three queries that you have in your second query.
I'd suggest moving that stuff into a stored procedure and calling that from PHP instead.
As for the optimization, setting those variables is about as optimized as you're going to get for your query. You need to make the comparison for every row, and setting a variable provides the quickest access time to the lower and upper bounds.
One improvement in the indexing of the table, rather than the query itself would be to cluster the index around fecha_publicado to allow MySQL to intelligently handle the query for that range of values. You could do this easily by setting fecha_publicado as PRIMARY KEY of the table.
The obvious things to check are, is there an index on the published date, and is it being used?
The way to optimize would be to partition the table tbl_post on the published key according to date ranges (weekly seems appropriate to your query). This is a feature that is available for MySQL, PostgreSQL, Oracle, Greenplum, and so on.
This will allow the query optimizer to restrict the query to a much narrower dataset.
I agree with BraedenP that a stored procedure would be appropriate here. If you can't use one or really don't want to, you can always either generate the dates on the PHP side, but they might not match exactly with the database unless you have them synced.
You can also do it more quickly as 3 separate queries likely. Query for the begin data, query for the end date, then use those values as input into your target query.

Best way to update user rankings without killing the server

I have a website that has user ranking as a central part, but the user count has grown to over 50,000 and it is putting a strain on the server to loop through all of those to update the rank every 5 minutes. Is there a better method that can be used to easily update the ranks at least every 5 minutes? It doesn't have to be with php, it could be something that is run like a perl script or something if something like that would be able to do the job better (though I'm not sure why that would be, just leaving my options open here).
This is what I currently do to update ranks:
$get_users = mysql_query("SELECT id FROM users WHERE status = '1' ORDER BY month_score DESC");
$i=0;
while ($a = mysql_fetch_array($get_users)) {
$i++;
mysql_query("UPDATE users SET month_rank = '$i' WHERE id = '$a[id]'");
}
UPDATE (solution):
Here is the solution code, which takes less than 1/2 of a second to execute and update all 50,000 rows (make rank the primary key as suggested by Tom Haigh).
mysql_query("TRUNCATE TABLE userRanks");
mysql_query("INSERT INTO userRanks (userid) SELECT id FROM users WHERE status = '1' ORDER BY month_score DESC");
mysql_query("UPDATE users, userRanks SET users.month_rank = userRanks.rank WHERE users.id = userRanks.id");
Make userRanks.rank an autoincrementing primary key. If you then insert userids into userRanks in descending rank order it will increment the rank column on every row. This should be extremely fast.
TRUNCATE TABLE userRanks;
INSERT INTO userRanks (userid) SELECT id FROM users WHERE status = '1' ORDER BY month_score DESC;
UPDATE users, userRanks SET users.month_rank = userRanks.rank WHERE users.id = userRanks.id;
My first question would be: why are you doing this polling-type operation every five minutes?
Surely rank changes will be in response to some event and you can localize the changes to a few rows in the database at the time when that event occurs. I'm pretty certain the entire user base of 50,000 doesn't change rankings every five minutes.
I'm assuming the "status = '1'" indicates that a user's rank has changed so, rather than setting this when the user triggers a rank change, why don't you calculate the rank at that time?
That would seem to be a better solution as the cost of re-ranking would be amortized over all the operations.
Now I may have misunderstood what you meant by ranking in which case feel free to set me straight.
A simple alternative for bulk update might be something like:
set #rnk = 0;
update users
set month_rank = (#rnk := #rnk + 1)
order by month_score DESC
This code uses a local variable (#rnk) that is incremented on each update. Because the update is done over the ordered list of rows, the month_rank column will be set to the incremented value for each row.
Updating the users table row by row will be a time consuming task. It would be better if you could re-organise your query so that row by row updates are not required.
I'm not 100% sure of the syntax (as I've never used MySQL before) but here's a sample of the syntax used in MS SQL Server 2000
DECLARE #tmp TABLE
(
[MonthRank] [INT] NOT NULL,
[UserId] [INT] NOT NULL,
)
INSERT INTO #tmp ([UserId])
SELECT [id]
FROM [users]
WHERE [status] = '1'
ORDER BY [month_score] DESC
UPDATE users
SET month_rank = [tmp].[MonthRank]
FROM #tmp AS [tmp], [users]
WHERE [users].[Id] = [tmp].[UserId]
In MS SQL Server 2005/2008 you would probably use a CTE.
Any time you have a loop of any significant size that executes queries inside, you've got a very likely antipattern. We could look at the schema and processing requirement with more info, and see if we can do the whole job without a loop.
How much time does it spend calculating the scores, compared with assigning the rankings?
Your problem can be handled in a number of ways. Honestly more details from your server may point you in a totally different direction. But doing it that way you are causing 50,000 little locks on a heavily read table. You might get better performance with a staging table and then some sort of transition. Inserts into a table no one is reading from are probably going to be better.
Consider
mysql_query("delete from month_rank_staging;");
while(bla){
mysql_query("insert into month_rank_staging values ('$id', '$i');");
}
mysql_query("update month_rank_staging src, users set users.month_rank=src.month_rank where src.id=users.id;");
That'll cause one (bigger) lock on the table, but might improve your situation. But again, that may be way off base depending on the true source of your performance problem. You should probably look deeper at your logs, mysql config, database connections, etc.
Possibly you could use shards by time or other category. But read this carefully before...
You can split up the rank processing and the updating execution. So, run through all the data and process the query. Add each update statement to a cache. When the processing is complete, run the updates. You should have the WHERE portion of the UPDATE reference a primary key set to auto_increment, as mentioned in other posts. This will prevent the updates from interfering with the performance of the processing. It will also prevent users later in the processing queue from wrongfully taking advantage of the values from the users who were processed before them (if one user's rank affects that of another). It also prevents the database from clearing out its table caches from the SELECTS your processing code does.

Categories