Large Comparison - MySQL AND PHP

Large Comparison - MySQL AND PHP - php

I've developed a user rating system that takes analyzes a users and saves their information with a score in a db.
I'm getting close to 1 Million users rated and stored.
I'm having issues with taking a certain set of users from the table (score < 50) and then comparing their ids against another set of ids without the whole thing crashing down.
The result of the (score < 50) query is around 65k rows and the comparison is against probably 1,000 user ids so the whole thing is running 65k * 1,000.
Is my bottleneck at the db? Or is it at the comparison of ids? Is there a better way to split this up?
Query -> "select username, userscore from users where userscore < 50"
then
Foreach compares values

Since you haven't provided any table or index information, here's what I'm going to suggest.
Make sure there's an index on userscore. If you have more than a million rows in your table and you're doing a query with "WHERE userscore > 50", that column needs an index.
Make sure your query is using that index. Run your query manually with EXPLAIN at the front, ie. EXPLAIN SELECT username, userscore from users where userscore < 50. Optimize the results.
You haven't mentioned how you're doing the id comparison, so I'll assume it's in a loop that checks each one against the array. You might be better off putting all 1000 ids into the query and limiting your SELECT query to users with score < 50 AND with their id in that set.
If you post more information about your tables, indexes, and comparisons, I can probably be more specific.

seems easy enough to answer if it is the db or not. Just prior to your query, do an fopen of a log file in /tmp. Then fwrite the results of a microtime() into the file. Just after your query, fwrite the results of another microtime(). Run your script once. you will be able to see the following:
1) are you even getting to the pre-query spot
2) Is the script failing in the middle of the query
3) how long does the query take if it doesn't crash the script

Related

Periodically retrieve all rows to find the highest 100 counters

I'm working on a cassandra database storing the amount of times a word occurred. I want to find out which 100 words occur the most times. In a relational database, it'd be something like this:
select * FROM wordcounter ORDER BY counts DESC LIMIT 100;
but ordering by a counter-column in cassandra is impossible.
So, instead I'll have to periodically (probably once per day) fetch all rows and write the 100 words with the highest counters to the db. The following is not an option;
select * FROM wordcounter
Because that would return way too much data. I'll have to do it in increments, but how (and how many rows per query is acceptable)?
UPDATE
It's supposedly possible to iterate over all cassandra rows, but I am using PHP pdo to communicate with cassandra & it certainly doesn't have an iterate feature as far as I've seen. But I found I can query by token so this is possible;
select * FROM wordcounter LIMIT 100;
And then keep looping this until 0 results are returned
select * FROM wordcounter WHERE token(word) > token('lastword') LIMIT 100;
So this basically is the equivelent of an OFFSET which will allow me to process parts of the dataset without having to query it all at once. But I guess this does mean I can't distribute the query over multiple systems. Does anyone know of any alternatives?

Server-side Pagination: total row count for expensive query?

I have a simple query using server-side pagination. The issue is the WHERE Clause makes a call to an expensive function and the functions argument is the user input, eg. what the user is searching for.
SELECT
*
FROM
( SELECT /*+ FIRST_ROWS(numberOfRows) */
query.*,
ROWNUM rn FROM
(SELECT
myColumns
FROM
myTable
WHERE expensiveFunction(:userInput)=1
ORDER BY id ASC
) query
)
WHERE rn >= :startIndex
AND ROWNUM <= :numberOfRows
This works and is quick assuming numberOfRows is small. However I would also like to have the total row count of the query. Depending on the user input and database size the query can take up to minutes. My current approach is to cache this value but that still means the user needs to wait minutes to see first result.
The results should be displayed in the Jquery datatables plugin which greatly helps with things like serer-side paging. It however requires the server to return a value for the total records to correctly display paging controls.
What would be the best approach? (Note: PHP)
I thought if returning first page immediately with a fake (better would be estimated) row count. After the page is loaded do an ajax call to a method that determines total row count of the query (what happens if the user pages during that time?) and then update the faked/estimated total row count.
However I have no clue how to do an estimate. I tried count(*) * 1000 with SAMPLE (0.1) but for whatever reason that actually takes longer than the full count query. Also just returning a fake/random value seems a bit hacky too. It would need to be bigger than 1 page size so that the "Next" button is enabled.
Other ideas?

One way to do it is as I said in the comments, to use a 'countless' approach. Modify the client side script in such a way that the Next button is always enabled and fetch the rows until there are none, then disable the Next button. You can always add a notification message to say that there are no more rows so it will be more user friendly.
Considering that you are expecting a significant amount of records, I doubt that the user will paginate through all the results.
Another way is to schedule a cron job that will do the counting of the records in the background and store that result in a table called totals. The running intervals of the job should be set up based on the frequency of the inserts / deletetions.
Then in the frontend, just use the count previously stored in totals. It should make a decent aproximation of the amount.

Depends on your DB engine.
In mysql, solution looks like this :
mysql> SELECT SQL_CALC_FOUND_ROWS * FROM tbl_name
-> WHERE id > 100 LIMIT 10;
mysql> SELECT FOUND_ROWS();
Basically, you add another attribute on your select (SQL_CALC_FOUND_ROWS) which tells mysql to count the rows as if limit clause was not present, while executing the query, while FOUND_ROWS actually retrieves that number.
For oracle, see this article :
How can I perform this query in oracle
Other DBMS might have something similar, but I don't know.

Show relationship using two table JOIN, or use PHP functions?

I'm making a micro-blogging website. The users can follow each other. I've to make stream of posts (activity stream) for the current user ( $userid ) based on the users the current user is following, like in Twitter. I know two ways of implementing this. Which one is better?
Tables:
Table: posts
Columns: PostID, AuthorID, TimeStamp, Content
Table: follow
Columns: poster, follower
The first way, by joining these two tables:
select `posts`.* from `posts`,`follow` where `follow`.`follower`='$userid' and
`posts`.`AuthorID`=`follow`.`poster` order by `posts`.`postid` desc
The second way is by making an array of users the $userid is following (posters), then doing php implode on this array, and then doing where in:
One thing I'll like to tell here that I'm storing the the number of users a user is following in the `following` record of the `user` table, so here I'll use this number as a limit when extracting the list of posters - the 'followingList':
function followingList($userid){
$listArray=array();
$limit="select `following` from `users` where `userid`='$userid' limit 1";
$limit=mysql_query($limit);
$limit=mysql_fetch_row($limit);
$limit= (int) $limit[0];
$sql="select `poster` from `follow` where `follower`='$userid' limit $limit";
$result=mysql_query($sql);
while($data = mysql_fetch_row($result)){
$listArray[] = $data[0];
}
$posters=implode("','",$listArray);
return $posters;
}
Now I've a comma separated list of user IDs the current $userid is following.And now selecting the posts to make the activity stream:
$posters=followingList($userid);
$sql = "select * from `posts` where (`AuthorID` in ('$posters'))
order by `postid` desc";
Which of the two methods is better?
And can knowing the total number of following (number of users the current user is following), make things faster in the first method as it's doing in the second method?
Any other better method?

You should go all the way with the first option. Always try as much as possible to process the data on the mysql server instead of in your PHP code. PHP will not implicitly cache the results of the operations while MySQL will do it.
The most important thing is to make sure you index your data correctly. Try using "EXPLAIN" statements to make sure you have optimized your database as much as possible and use #1 to link your data together.
http://dev.mysql.com/doc/refman/5.0/en/explain.html
This will allow you later to compute statistics also, while the second method requires you to process a part of the statistics.

The first important point is that PHP is good at building pages but very bad are managing data, everything manipulated by PHP will fill the memory and no special behavior can be applied in PHP to prevent using to much memory, except crashing.
On the other side the datatase job is to analyse relation between the tables, real number used by the query (cardinality of indexes and statictics on rows and index usage in fact), and a lot of different mechanism can be choosen by the engine depending on the size of data (merge joins, temporary tables, etc). That means you could have 256.278.242 posts and 145.268 users, with 5.684 average followers the datatabase job would be to find the fastest way to give you an answer. Well, when you hit really big numbers you'll see that all databases are not equal, but that's another problem.
On the PHP side Retrieving the list of users from the fisrt query coudl became very long (with a big number of followed users, let's say 15.000. Simply building the query string with 15 000 identifiers inside would take a quite big amount a memory. Trasnferring this new query to the SQL server would also be slow. It's definitively the wrong way.
Now be careful of the way you build your SQL request. A request is something you should be able to read from the top to the end, explaining what you really want. This will help the SQL (good) engine in choosing the right solution.
select `posts`.*
from `posts`
INNER JOIN `follow` ON posts`.`AuthorID`=`follow`.`poster`
where `follow`.`follower`='#userid'
order by `posts`.`postid` desc
LIMIT 15
Several remarks:
I have used an INNER JOIN.I want an INNER JOIN, let's write it, it will be easier to read for me later and it should be the same for the query analyser.
if #userid is an int do not use quotes. Please use ints for identifiers (this is really faster than strings). And on the PHP side cast the int "SELECT ..." . (int) $user_id ." ORDER ... or use query with parameters (This is for security).
I have used a LIMIT 15, maybe an offset could be used as well, if you want to show some pagination control around the posts. Let's say this query will retrieve 15.263 documents from my 5.642 folowwed users, you do not want, and the user do not want, to show theses 15.263 documents on a web page. And knowing with $limit that the number is 15.263 is a good thing but certainly not for a request limit. You know this number, but the database may know it as well if it has a good query analyser and some good internal statistics.
The request limit has several goals
1. Limit the size of data transfered from the database to your PHP script
2. Limit the memory usage of your PHP script (an array with 15.263 documents containg some HTMl stuff... ouch)
3. Limit the size of the final user output (and get a faster response)

My(SQL) selecting random rows, novel way, help to evaluate how good is it?

I again run into problem of selecting random subset of rows. And as many probably know ORDER BY RAND() is quite inefficient, or at least thats the consensus. I have read that mysql generates random value for every row in table, then filters then orders by these random values and then returns set. The biggest performance impact seems to be from the fact that there as many random numbers generated as there are rows in a table. So i was looking for possibly better way to return random subset of results for such query:
SELECT id FROM <table> WHERE <some conditions> LIMIT 10;
Of course simplest and easiest way to do what i want would be the one witch I try to avoid:
SELECT id FROM <table> WHERE <some conditions> ORDER BY RAND() LIMIT 10; (a)
Now after some thinking i came up with other option for this task:
SELECT id FROM <table> WHERE (<some conditions>) AND RAND() > x LIMIT 10; (b)
(Of course we can use < instead of >) Here we take x from range 0.0 - 1.0. Now I'm not exactly sure how MySQL handles this but my guess is that it first selects rows matching <some conditions> (using index[es]?) and then generates random value and sees if it should return or discard row. But what do i know:) thats why i ask here. So some observations about this method:
first it does not guarantee that asked number of rows will be returned even if there is much more matching rows than needed, especially true for x values close to 1.0 (or close to 0.0 if we use <)
returned object don't really have random ordering, they are just objects selected randomly, order by index used or by the way they are stored(?) (of course this might not matter in some cases at all)
we probably need to choose x according to size of result set, since if we have large result set and x is lets say 0.1, it will be very likely that only some random first results will be returned most of the time; on the other hand if have small result set and choose large x it is likely that we might get less object than we want, although there are enough of them
Now some words about performance. I did a little testing using jmeter. <table> has about 20k rows, and there are about 2-3k rows matching <some conditions>. I wrote simple PHP script that executes query and print_r's the result. Then I setup test using jmeter that starts 200 threads, so that is 200 requests per second, and requests said PHP script. I ran it until about 3k requests were done (average response time stabilizes well before this). Also I executed all queries with SQL_NO_CACHE to prevent query cache having some effect. Average response times were:
~30ms for query (a)
13-15ms for query (b) with x = 0.1
17-20ms for query (b) with x = 0.9, as expected larger x is slower since it has to discard more rows
So my questions are: what do you think about this method of selecting random rows? Maybe you have used it or tried it and see that it did not work out? Maybe you can better explain how MySQL handles such query? What could be some caveats that I'm not aware of?
EDIT: I probably was not clear enough, the point is that i need random rows of query not simply table, thus I included <some conditions> and there are quite some. Moreover table is guaranteed to have gaps in id, not that it matters much since this is not random rows from table but from query, and there will be quite a lot such queries so those suggestions involving querying table multiple times do not sound appealing. <some conditions> will vary at least a bit between requests, meaning that there will be requests with different conditions.

From my own experience, I've always used ORDER BY RAND(), but this has it's own performance implications on larger datasets. For example, if you had a table that was too big to fit in memory then MySQL will create a temporary table on disk, and then perform a file sort to randomise the dataset (storage engine permitting). Your LIMIT 10 clause will have no effect on the execution time of the query AFAIK, but it will reduce the size of the data to send back to the client obviously.
Basically, the limit and order by happen after the query has been executed (full table scan to find matching records, then it is ordered and then it is limited). Any rows outside of your LIMIT 10 clause are discarded.
As a side note, adding in the SQL_NO_CACHE will disable MySQL's internal query cache, but will does not prevent your operating system from caching the data (due to the random nature of this query I don't think it would have much of an effect on your execution time anyway).
Hopefully someone can correct me here if I have made any mistakes but I believe that is the general idea.

An alternative way which probably would not be faster, but might who knows :)
Either use a table status query to determine the next auto_increment, or the row count, or use (select count(*)). Then you can decide ahead of time the auto_increment value of a random item and then select that unique item.
This will fail if you have gaps in the auto_increment field, but if it is faster than your other methods, you could loop a few times or fall back to a failsafe method in the case of zero rows returned. Best case might be a big savings, worst case would be comparable to your current method.

You're using the wrong data structure.
The usual method is something like this:
Find out the number of elements n — something like SELECT count(id) FROM tablename.
Choose r distinct randomish numbers in the interval [0,n). I usually recommend a LCG with suitably-chosen parameters, but simply picking r randomish numbers and discarding repeats also works.
Return those elements. The hard bit.
MySQL appears to support indexed lookups with something like SELECT id from tablename ORDER BY id LIMIT :i,1 where :i is a bound-parameter (I forget what syntax mysqli uses); alternative syntax LIMIT 1 OFFSET :i. This means you have to make r queries, but this might be fast enough (it depends on per-statement overheads and how efficiently it can do OFFSET).
An alternative method, which should work for most databases, is a bit like interval-bisection:
SELECT count(id),max(id),min(id) FROM tablename. Then you know rows [0,n-1] have ids [min,max].
So rows [a,b] have ids [min,max]. You want row i. If i == a, return min. If i == b, return max. Otherwise, bisect:
Choose split = min+(max-min)/2 (avoiding integer overflow).
SELECT count(id),max(id) WHERE :min < id AND id < split and SELECT count(id),min(id) WHERE :split <= id and id < :max. The two counts should equal b-a+1 if the table hasn't been modified...
Figure out which range i is in, and update a, b, min, and max appropriately. Repeat.
There are plenty of edge cases (I've probably included some off-by-one errors) and a few potential optimizations (you can do this for all the indexes at once, and you don't really need to do two queries per iteration if you don't assume that i == b implies id = max). It's not really worth doing if SELECT ... OFFSET is even vaguely efficient.

MySQL vs PHP when retrieving a random item

which is more efficient (when managing over 100K records):
A. Mysql
SELECT * FROM user ORDER BY RAND();
of course, after that i would already have all the fields from that record.
B. PHP
use memcached to have $cache_array hold all the data from "SELECT id_user FROM user ORDER BY id_user" for 1 hour or so... and then:
$id = array_rand($cache_array);
of course, after that i have to make a MYSQL call with:
SELECT * FROM user WHERE id_user = $id;
so... which is more efficient? A or B?

The proper way to answer this kind of question is to do a benchmark. Do a quick and dirty implementation each way and then run benchmark tests to determine which one performs better.
Having said that, ORDER BY RAND() is known to be slow because it's impossible for MySQL to use an index. MySQL will basically run the RAND() function once for each row in the table and then sort the rows based on what came back from RAND().
Your other idea of storing all user_ids in memcached and then selecting a random element form the array might perform better if the overhead of memcached proves to be less than the cost of a full table scan. If your dataset is large or staleness is a problem, you may run into issues though. Also you're adding some complexity to your application. I would try to look for another way.
I'll give you a third option which might outperform both your suggestions: Select a count(user_id) of the rows in your user table and then have php generate a random number between 0 and the result of count(user_id) minus 1, inclusive. Then do a SELECT * FROM user LIMIT 1 OFFSET random-number-generated-by-php;.
Again, the proper way to answer these types of questions is to benchmark. Anything else is speculation.

The first one is incredibly slow because
MySQL creates a temporary table with
all the result rows and assigns each
one of them a random sorting index.
The results are then sorted and
returned.
It's elaborated more on this blog post.

$random_no = mt_rand(0, $total_record_count);
$query = "SELECT * FROM user ORDER BY __KEY__ LIMIT {$random_no}, 1";

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.