manipulating 15+ million records in mysql with php?

manipulating 15+ million records in mysql with php? - php

I got a user table containing 15+ million records and while doing the registration function i wish to check whether the username already exist. I did indexing for username column and when i run the query "select count(uid) from users where username='webdev'" ,. hmmm, its keep on loading blank screen finally hanged up. I'm doing this in my localhost with php 5 & mysql 5. So suggest me some technique to handle this situation.
Is that mongodb is good alternative for handling this process in our local machine?
Thanks,
Nithish.

If you just want to check that it exists or not, try not using the count. Just a simple select username from users where username='webdev' LIMIT 1 may be faster.
ALSO, change the column type to varchar, if it's not already so. Don't user text type. It's much much slower.

This might be a moot point, but to test and see if the user name already exists, I would issue the following query (a slight modification on shamittomar's query):
SELECT DISTINCT `username` FROM `users` WHERE `username` = 'webdev';
This will, by default, return the only instance of "webdev" in the "username" column; if you add more parameters, though, it could change your results. An example being, if you run
SELECT DISTINCT `user_id`, `username` FROM `users` WHERE `username` = 'webdev';
it would return all unique combinations of "user_id" and "username".

One thing you can do is change the indexing of the username from index to unique that will make the search much much faster and like shamittomar said add a limit 1 at the end even though it will only help if the value already exists.

your user name is unique so you must set limit of 1 in your query it will be more faster
select count(uid) from users where username='webdev' limit 1

Related

Select query takes too long

These 2 querys take too long to produce a result (sometimes 1 min or even sometime end up on some error) and put really heavy load on the server:
("SELECT SUM(`rate`) AS `today_earned` FROM `".PREFIX."traffic_stats` WHERE `userid` = ?i AND from_unixtime(created) > CURRENT_DATE ORDER BY created DESC", $user->data->userid)
("SELECT COUNT(`userid`) AS `total_clicks` FROM `".PREFIX."traffic_stats` WHERE `userid` = ?i", $user->data->userid)
The table has about 4 million rows.
This is the table structure:
I have one index on traffic_id:
If you select anything from traffic_stats table it will take forever, however inserting to this table is normal.
Is it possible to reduce the time spent on executing this query? I use PDO and I am new to all this.

ORDER BY will take a lot of time and since you only need aggregate data (adding numbers or counting numbers is commutative), the ORDER BY will do a lot of useless sorting, costing you time and server power.
You will need to make sure that your indexing is right, you will probably need an index for user_id and for (user_id, created).
Is user_id numeric? If not, then you might consider converting it into numeric type, int for example.
These are improving your query and structure. But let's improve the concept as well. Are insertions and modifications very frequent? Do you absolutely need real-time data, or you can do with quasi-realtime data as well?
If insertions/modifications are not very frequent, or you can do with older data, or the problem is causing huge trouble, then you could do this by running periodically a cron job which would calculate these values and cache them. The application would read them from the cache.

I'm not sure why you accepted an answer, when you really didn't get to the heart of your problem.
I also want to clarify that this is a mysql question, and the fact that you are using PDO or PHP for that matter is not important.
People advised you to utilize EXPLAIN. I would go one further and tell you that you need to use EXPLAIN EXTENDED possibly with the format=json option to get a full picture of what is going on. Looking at your screen shot of the explain, what should jump out at you is that the query looked at over 1m rows to get an answer. This is why your queries are taking so long!
At the end of the day, if you have properly indexed your tables, your goal should be in a large table like this, to have number of rows examined be fairly close to the final result set.
So let's look at the 2nd query, which is quite simple:
("SELECT COUNT(`userid`) AS `total_clicks` FROM `".PREFIX."traffic_stats` WHERE `userid` = ?i", $user->data->userid)
In this case the only thing that is really important is that you have an index on traffic_stats.userid.
I would recommend, that, if you are uncertain at this point, drop all indexes other than the original primary key (traffic_id) index, and start with only an index on the userid column. Run your query. What is the result, and how long does it take? Look at the EXPLAIN EXTENDED. Given the simplicity of the query, you should see that only the index is being used and the rows should match the result.
Now to your first query:
("SELECT SUM(`rate`) AS `today_earned` FROM `".PREFIX."traffic_stats` WHERE `userid` = ?i AND from_unixtime(created) > CURRENT_DATE ORDER BY created DESC", $user->data->userid)
Looking at the WHERE clause there are these criteria:
userid =
from_unixtime(created) > CURRENT_DATE
You already have an index on userid. Despite the advice given previously, it is not necessarily correct to have an index on userid, created, and in your case it is of no value whatsoever.
The reason for this is that you are utilizing a mysql function from_unixtime(created) to transform the raw value of the created column.
Whenever you do this, an index can't be used. You would not have any concerns in doing a comparison with the CURRENT_DATE if you were using the native TIMESTAMP type but in this case, to handle the mismatch, you simply need to convert CURRENT_DATE rather than the created column.
You can do this by passing CURRENT_DATE as a parameter to UNIX_TIMESTAMP.
mysql> select UNIX_TIMESTAMP(), UNIX_TIMESTAMP(CURRENT_DATE);
+------------------+------------------------------+
| UNIX_TIMESTAMP() | UNIX_TIMESTAMP(CURRENT_DATE) |
+------------------+------------------------------+
| 1490059767 | 1490054400 |
+------------------+------------------------------+
1 row in set (0.00 sec)
As you can see from this quick example, UNIX_TIMESTAMP by itself is going to be the current time, but CURRENT_DATE is essentially the start of day, which is apparently what you are looking for.
I'm willing to bet that the number of rows for the current date are going to be fewer in number than the total rows for a user over the history of the system, so this is why you would not want an index on user, created as previously advised in the accepted answer. You might benefit from an index on created, userid.
My advice would be to start with an individual index on each of the columns separately.
("SELECT SUM(`rate`) AS `today_earned` FROM `".PREFIX."traffic_stats` WHERE `userid` = ?i AND created > UNIX_TIMESTAMP(CURRENT_DATE)", $user->data->userid)
And with your re-written query, again assuming that the result set is relatively small, you should see a clean EXPLAIN with rows matching your final result set.
As for whether or not you should apply an ORDER BY, this shouldn't be something you eliminate for performance reasons, but rather because it isn't relevant to your desired result. If you need or want the results ordered by user, then leave it. Unless you are producing a large result set, it shouldn't be a major problem.
In the case of that particular query, since you are doing a SUM(), there is no value of ORDERING the data, because you are only going to get one row back, so in that case I agree with Lajos, but there are many times when you might be utilizing a GROUP BY, and in that case, you might want the final results ordered.

Why is this query abnormally long?

I'm timing various part of the site's "initialisation" code (including such things as verifying the user is logged in, connecting to the database, importing functions...)
This query is currently taking up abouve half the total initialisation time all by itself:
$sql = "update `users` set `lastclick`=now(),".(substr($_SERVER['PHP_SELF'],0,6) == "/ajax/" ? "" : " `lastactive`=now(),")." `lastip`='".addslashes($_SERVER['REMOTE_ADDR'])."' where `id`=".$userdata['id'];
Generating the query takes no time at all, it's the running that's the problem. Example result query:
update `users` set `lastclick`=now(), `lastactive`=now(), `lastip`='192.168.0.1' where `id`=1
Simple enough query, right? I am the only user on the server right now, there is literally nothing else running. So why does a simple update take up more time than connecting to the database, SELECTing the user data in the first place, validating the cookies, and defining a bunch of functions all combined?
(I just tried replacing now() with a literal value, but that made no difference - in fact it ended up taking 13ms the first time instead of 4...)
EDIT: As requested:
explain select * from `users` where `id`=1
1 row returned
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE users const PRIMARY PRIMARY 4 const 1

Solved my own mystery. Turns out one of the fields being updated (lastactive) was in an index, and the slowness was coming from rebuilding that index.
Since the only time that index might be used is in updating the list of users who are online, and that only happens by cron every set interval, I've dropped the index and now the query runs a heck of a lot faster.
Thanks to those who tried to help - you did help me find the problem, indirectly!

How to Concat in mysql Query

Im not even sure if this is possible (Im new to php)
Anyway, what I want to do is this:
mysql_query("SELECT * FROM user_table WHERE concat(username,'#',domain)='$username' LIMIT=1");
Ok, so the $username is an email address that is submitted by a user to search the database to check that they exist. In the user_table usernames are not stored in a single column, they are stored in several with the domain and the actual username being separate.
for example username could be bob and the domain could be website.com.au
Then when the user wants to search for that user the type in bob#website.com.au
This goes to the query above.
So, should it work or not? If not how can I make this work or what suggestions do you have for me?

As BobbyJack has mentioned, this is a slow way of locating a user record.
If you cannot store email address in a single column and place an index on that column, split the string in PHP and make your query:
SELECT * FROM user_table WHERE `username` = '$username' AND `domain` = '$domain'
You could then create a unique index combining domain + username so you wouldn't need LIMIT 1

probably worded the question slightly wrong.
Anyway this is what I have done "SELECT * FROM virtual_user WHERE concat_ws('#',username,domain)='$username'"
I no longer need to use the LIMIT=1, I probably never needed to as all results in the table are individual, so it will always only return a limit of 1 or nothing at all.
It isn't slow in my opinion, but then again Im not really sure what to compare it to. We have about 7000+ records it sorts through so yeah. Is there anyway to get it to tell you how long the query took to complete?
I would like to put both the username and domain into just a single indexed field but its for a postfix mail server and I'm not allowed or game to play with the queries it uses. Especially not on a functioning server that actually handles mail.

How to check the existence of multiple IDs with a single query

I'm trying to find a way to check if some IDs are already in the DB, if an ID is already in the DB I'd naturally try to avoid processing the row it represents
Right now I'm doing a single query to check for the ID, but I think this is too expensive in time because if I'm checking 20 id's the script is taking up to 30 seconds
I know i can do a simple WHERE id=1 OR id=2 OR id=3 , but I'd like to know of a certain group of IDs which ones are already in the database and which ones are not
I don't know much about transactions but maybe this could be useful or something
any thoughts are highly appreciated!

Depends how you determine the "Group of IDs"
If you can do it with a query, you can likely use a join or exists clause.
for example
SELECT firstname
from people p
where not exists (select 1 from otherpeople op where op.firstname = p.firstname)
This will select all the people who are not in the otherpeople table
If you just have a list of IDs, then use WHERE NOT IN (1,3,4...)

30 seconds for 20 queries on a single value is a long time. Did you create an index on the ID field to speed things up?
Also if you create a unique key on the ID field you can just insert all ID's. The database will throw errors and not insert those those ID's that already exist, but you can ignore those errors.

Best way to update user rankings without killing the server

I have a website that has user ranking as a central part, but the user count has grown to over 50,000 and it is putting a strain on the server to loop through all of those to update the rank every 5 minutes. Is there a better method that can be used to easily update the ranks at least every 5 minutes? It doesn't have to be with php, it could be something that is run like a perl script or something if something like that would be able to do the job better (though I'm not sure why that would be, just leaving my options open here).
This is what I currently do to update ranks:
$get_users = mysql_query("SELECT id FROM users WHERE status = '1' ORDER BY month_score DESC");
$i=0;
while ($a = mysql_fetch_array($get_users)) {
$i++;
mysql_query("UPDATE users SET month_rank = '$i' WHERE id = '$a[id]'");
}
UPDATE (solution):
Here is the solution code, which takes less than 1/2 of a second to execute and update all 50,000 rows (make rank the primary key as suggested by Tom Haigh).
mysql_query("TRUNCATE TABLE userRanks");
mysql_query("INSERT INTO userRanks (userid) SELECT id FROM users WHERE status = '1' ORDER BY month_score DESC");
mysql_query("UPDATE users, userRanks SET users.month_rank = userRanks.rank WHERE users.id = userRanks.id");

Make userRanks.rank an autoincrementing primary key. If you then insert userids into userRanks in descending rank order it will increment the rank column on every row. This should be extremely fast.
TRUNCATE TABLE userRanks;
INSERT INTO userRanks (userid) SELECT id FROM users WHERE status = '1' ORDER BY month_score DESC;
UPDATE users, userRanks SET users.month_rank = userRanks.rank WHERE users.id = userRanks.id;

My first question would be: why are you doing this polling-type operation every five minutes?
Surely rank changes will be in response to some event and you can localize the changes to a few rows in the database at the time when that event occurs. I'm pretty certain the entire user base of 50,000 doesn't change rankings every five minutes.
I'm assuming the "status = '1'" indicates that a user's rank has changed so, rather than setting this when the user triggers a rank change, why don't you calculate the rank at that time?
That would seem to be a better solution as the cost of re-ranking would be amortized over all the operations.
Now I may have misunderstood what you meant by ranking in which case feel free to set me straight.

A simple alternative for bulk update might be something like:
set #rnk = 0;
update users
set month_rank = (#rnk := #rnk + 1)
order by month_score DESC
This code uses a local variable (#rnk) that is incremented on each update. Because the update is done over the ordered list of rows, the month_rank column will be set to the incremented value for each row.

Updating the users table row by row will be a time consuming task. It would be better if you could re-organise your query so that row by row updates are not required.
I'm not 100% sure of the syntax (as I've never used MySQL before) but here's a sample of the syntax used in MS SQL Server 2000
DECLARE #tmp TABLE
(
[MonthRank] [INT] NOT NULL,
[UserId] [INT] NOT NULL,
)
INSERT INTO #tmp ([UserId])
SELECT [id]
FROM [users]
WHERE [status] = '1'
ORDER BY [month_score] DESC
UPDATE users
SET month_rank = [tmp].[MonthRank]
FROM #tmp AS [tmp], [users]
WHERE [users].[Id] = [tmp].[UserId]
In MS SQL Server 2005/2008 you would probably use a CTE.

Any time you have a loop of any significant size that executes queries inside, you've got a very likely antipattern. We could look at the schema and processing requirement with more info, and see if we can do the whole job without a loop.
How much time does it spend calculating the scores, compared with assigning the rankings?

Your problem can be handled in a number of ways. Honestly more details from your server may point you in a totally different direction. But doing it that way you are causing 50,000 little locks on a heavily read table. You might get better performance with a staging table and then some sort of transition. Inserts into a table no one is reading from are probably going to be better.
Consider
mysql_query("delete from month_rank_staging;");
while(bla){
mysql_query("insert into month_rank_staging values ('$id', '$i');");
}
mysql_query("update month_rank_staging src, users set users.month_rank=src.month_rank where src.id=users.id;");
That'll cause one (bigger) lock on the table, but might improve your situation. But again, that may be way off base depending on the true source of your performance problem. You should probably look deeper at your logs, mysql config, database connections, etc.

Possibly you could use shards by time or other category. But read this carefully before...

You can split up the rank processing and the updating execution. So, run through all the data and process the query. Add each update statement to a cache. When the processing is complete, run the updates. You should have the WHERE portion of the UPDATE reference a primary key set to auto_increment, as mentioned in other posts. This will prevent the updates from interfering with the performance of the processing. It will also prevent users later in the processing queue from wrongfully taking advantage of the values from the users who were processed before them (if one user's rank affects that of another). It also prevents the database from clearing out its table caches from the SELECTS your processing code does.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

manipulating 15+ million records in mysql with php? - php

If you just want to check that it exists or not, try not using the count. Just a simple select username from users where username='webdev' LIMIT 1 may be faster. ALSO, change the column type to varchar, if it's not already so. Don't user text type. It's much much slower.

One thing you can do is change the indexing of the username from index to unique that will make the search much much faster and like shamittomar said add a limit 1 at the end even though it will only help if the value already exists.

your user name is unique so you must set limit of 1 in your query it will be more faster select count(uid) from users where username='webdev' limit 1

Related

Select query takes too long

Why is this query abnormally long?

How to Concat in mysql Query

How to check the existence of multiple IDs with a single query

Best way to update user rankings without killing the server

Categories

Resources