Best way to get result count before LIMIT was applied

Best way to get result count before LIMIT was applied - php

When paging through data that comes from a DB, you need to know how many pages there will be to render the page jump controls.
Currently I do that by running the query twice, once wrapped in a count() to determine the total results, and a second time with a limit applied to get back just the results I need for the current page.
This seems inefficient. Is there a better way to determine how many results would have been returned before LIMIT was applied?
I am using PHP and Postgres.

Pure SQL
Things have changed since 2008. You can use a window function to get the full count and the limited result in one query. Introduced with PostgreSQL 8.4 in 2009.
SELECT foo
, count(*) OVER() AS full_count
FROM bar
WHERE <some condition>
ORDER BY <some col>
LIMIT <pagesize>
OFFSET <offset>;
Note that this can be considerably more expensive than without the total count. All rows have to be counted, and a possible shortcut taking just the top rows from a matching index may not be helpful any more.
Doesn't matter much with small tables or full_count <= OFFSET + LIMIT. Matters for a substantially bigger full_count.
Corner case: when OFFSET is at least as great as the number of rows from the base query, no row is returned. So you also get no full_count. Possible alternative:
Run a query with a LIMIT/OFFSET and also get the total number of rows
Sequence of events in a SELECT query
( 0. CTEs are evaluated and materialized separately. In Postgres 12 or later the planner may inline those like subqueries before going to work.) Not here.
WHERE clause (and JOIN conditions, though none in your example) filter qualifying rows from the base table(s). The rest is based on the filtered subset.
( 2. GROUP BY and aggregate functions would go here.) Not here.
( 3. Other SELECT list expressions are evaluated, based on grouped / aggregated columns.) Not here.
Window functions are applied depending on the OVER clause and the frame specification of the function. The simple count(*) OVER() is based on all qualifying rows.
ORDER BY
( 6. DISTINCT or DISTINCT ON would go here.) Not here.
LIMIT / OFFSET are applied based on the established order to select rows to return.
LIMIT / OFFSET becomes increasingly inefficient with a growing number of rows in the table. Consider alternative approaches if you need better performance:
Optimize query with OFFSET on large table
Alternatives to get final count
There are completely different approaches to get the count of affected rows (not the full count before OFFSET & LIMIT were applied). Postgres has internal bookkeeping how many rows where affected by the last SQL command. Some clients can access that information or count rows themselves (like psql).
For instance, you can retrieve the number of affected rows in plpgsql immediately after executing an SQL command with:
GET DIAGNOSTICS integer_var = ROW_COUNT;
Details in the manual.
Or you can use pg_num_rows in PHP. Or similar functions in other clients.
Related:
Calculate number of rows affected by batch query in PostgreSQL

As I describe on my blog, MySQL has a feature called SQL_CALC_FOUND_ROWS. This removes the need to do the query twice, but it still needs to do the query in its entireity, even if the limit clause would have allowed it to stop early.
As far as I know, there is no similar feature for PostgreSQL. One thing to watch out for when doing pagination (the most common thing for which LIMIT is used IMHO): doing an "OFFSET 1000 LIMIT 10" means that the DB has to fetch at least 1010 rows, even if it only gives you 10. A more performant way to do is to remember the value of the row you are ordering by for the previous row (the 1000th in this case) and rewrite the query like this: "... WHERE order_row > value_of_1000_th LIMIT 10". The advantage is that "order_row" is most probably indexed (if not, you've go a problem). The disadvantage being that if new elements are added between page views, this can get a little out of synch (but then again, it may not be observable by visitors and can be a big performance gain).

You could mitigate the performance penalty by not running the COUNT() query every time. Cache the number of pages for, say 5 minutes before the query is run again. Unless you're seeing a huge number of INSERTs, that should work just fine.

Since Postgres already does a certain amount of caching things, this type of method isn't as inefficient as it seems. It's definitely not doubling execution time. We have timers built into our DB layer, so I have seen the evidence.

Seeing as you need to know for the purpose of paging, I'd suggest running the full query once, writing the data to disk as a server-side cache, then feeding that through your paging mechanism.
If you're running the COUNT query for the purpose of deciding whether to provide the data to the user or not (i.e. if there are > X records, give back an error), you need to stick with the COUNT approach.

Related

When should I consider saving the total count in a field?

For example, if I have to count the comments belonging to an article, it's obvious I don't need to cache the comments total.
But what if I want to paginate a gallery (WHERE status = 1) containing 1 million photos. Should I save that in a table called counts or SELECT count(id) as total every time is fine?
Are there other solutions?
Please advise. Thanks.

For MySQL, you don't need to store the counts, you can use SQL_CALC_FOUND_ROWS to avoid two queries.
E.g.,
SELECT SQL_CALC_FOUND_ROWS *
FROM Gallery
WHERE status = 1
LIMIT 10;
SELECT FOUND_ROWS();
From the manual:
In some cases, it is desirable to know how many rows the statement
would have returned without the LIMIT, but without running the
statement again. To obtain this row count, include a
SQL_CALC_FOUND_ROWS option in the SELECT statement, and then invoke
FOUND_ROWS() afterward.
Sample usage here.

It depends a bit on the amount of queries that are done on that table with 1 million records. Consider just taking care of good indexes, especially also multi-column indexes (because they are easily forgotton: here. That will do a lot. And, be sure the queries become cached also well on your server.
If you use this column very regular, consider saving it (if it can't be cached by MySQL), as things could become slow. But most of the times good indexing will take care of it.
Best try: setup some tests to find out if a query can still be fast and performance is not dropping when you execute it a lot of times in a row.
EXPLAIN [QUERY]
Use that command (in MySQL) to get information about the way the query is performed and if it can be improved.

Doing the count every time would be OK.
During paging, you can use SQL_CALC_FOUND_ROWS anyway
Note:
A denormalied count will become stale
No-one will page so many items

What's the best way to count MySQL records

I have a search engine on a shared host that uses MySQL. This search engine potentially has millions/trillions etc of records.
Each time a search is performed I return a count of the records that can then be used for pagination purposes.
The count tells you how many results there are in regard to the search performed. MySQL count is I believe considered quite slow.
Order of search queries:
Search executed and results returned
Count query executed
I don't perform a PHP count as this will be far slower in larger data sets.
Question is, do I need to worry about MySQL "count" and at what stage should I worry about it. How do the big search engines perform this task?

In almost all cases the answer is indexing. The larger your database gets the more important it is to have a well designed and optimized indexing strategy.
The importance of indexing on a large database can not be overstated.
You are absolutely right about not looping in code to count DB records. Your RDBMS is optimized for operations like that, your programming language is no. Wherever possible you want to do any sorting, grouping, counting, filtering operations within the SQL language provided by your RDBMS.
As for efficiently getting the count on a "paginated" query that uses a LIMIT clause, check out SQL_CALC_FOUND_ROWS.
SQL_CALC_FOUND_ROWS tells MySQL to calculate how many rows there would
be in the result set, disregarding any LIMIT clause. The number of
rows can then be retrieved with SELECT FOUND_ROWS(). See Section
11.13, “Information Functions”.

If MySQL database reaches several millions of records, that's a sign you'll be forced to stop using monolithic data store - meaning you'll have to split reads, writes and most likely use a different storage engine than the default one.
Once that happens, you'll stop using the actual count of the rows and you'll start using the estimate, cache the search results and so on in order to alleviate the work on the database. Even Google uses caching and displays an estimate of number of records.
Anyway, for now, you've got 2 options:
1 - Run 2 queries, one to retrieve the data and the other one where you use COUNT() to get the number of rows.
2 - Use SQL_CALC_FOUND_ROWS like #JohnFX suggested.
Percona has an article about what's faster, tho it might be outdated now.
The biggest problem you're facing is the way MySQL uses LIMIT OFFSET, which means you probably won't like your users using large offset numbers.
In case you indeed get millions of records - I don't forsee a bright future for your MySQL monolithic storage on a shared server. However, good luck to you and your project.

If I understand what you are trying to do properly, you can execute the one query, and perform the mysql_num_rows() function on the result in PHP... that should be pretty zippy.
http://php.net/manual/en/function.mysql-num-rows.php

Since you're using PHP, you could use the mysql_num_rows method to tell you the count after the query is done. See here: http://www.php.net/manual/en/function.mysql-num-rows.php

My(SQL) selecting random rows, novel way, help to evaluate how good is it?

I again run into problem of selecting random subset of rows. And as many probably know ORDER BY RAND() is quite inefficient, or at least thats the consensus. I have read that mysql generates random value for every row in table, then filters then orders by these random values and then returns set. The biggest performance impact seems to be from the fact that there as many random numbers generated as there are rows in a table. So i was looking for possibly better way to return random subset of results for such query:
SELECT id FROM <table> WHERE <some conditions> LIMIT 10;
Of course simplest and easiest way to do what i want would be the one witch I try to avoid:
SELECT id FROM <table> WHERE <some conditions> ORDER BY RAND() LIMIT 10; (a)
Now after some thinking i came up with other option for this task:
SELECT id FROM <table> WHERE (<some conditions>) AND RAND() > x LIMIT 10; (b)
(Of course we can use < instead of >) Here we take x from range 0.0 - 1.0. Now I'm not exactly sure how MySQL handles this but my guess is that it first selects rows matching <some conditions> (using index[es]?) and then generates random value and sees if it should return or discard row. But what do i know:) thats why i ask here. So some observations about this method:
first it does not guarantee that asked number of rows will be returned even if there is much more matching rows than needed, especially true for x values close to 1.0 (or close to 0.0 if we use <)
returned object don't really have random ordering, they are just objects selected randomly, order by index used or by the way they are stored(?) (of course this might not matter in some cases at all)
we probably need to choose x according to size of result set, since if we have large result set and x is lets say 0.1, it will be very likely that only some random first results will be returned most of the time; on the other hand if have small result set and choose large x it is likely that we might get less object than we want, although there are enough of them
Now some words about performance. I did a little testing using jmeter. <table> has about 20k rows, and there are about 2-3k rows matching <some conditions>. I wrote simple PHP script that executes query and print_r's the result. Then I setup test using jmeter that starts 200 threads, so that is 200 requests per second, and requests said PHP script. I ran it until about 3k requests were done (average response time stabilizes well before this). Also I executed all queries with SQL_NO_CACHE to prevent query cache having some effect. Average response times were:
~30ms for query (a)
13-15ms for query (b) with x = 0.1
17-20ms for query (b) with x = 0.9, as expected larger x is slower since it has to discard more rows
So my questions are: what do you think about this method of selecting random rows? Maybe you have used it or tried it and see that it did not work out? Maybe you can better explain how MySQL handles such query? What could be some caveats that I'm not aware of?
EDIT: I probably was not clear enough, the point is that i need random rows of query not simply table, thus I included <some conditions> and there are quite some. Moreover table is guaranteed to have gaps in id, not that it matters much since this is not random rows from table but from query, and there will be quite a lot such queries so those suggestions involving querying table multiple times do not sound appealing. <some conditions> will vary at least a bit between requests, meaning that there will be requests with different conditions.

From my own experience, I've always used ORDER BY RAND(), but this has it's own performance implications on larger datasets. For example, if you had a table that was too big to fit in memory then MySQL will create a temporary table on disk, and then perform a file sort to randomise the dataset (storage engine permitting). Your LIMIT 10 clause will have no effect on the execution time of the query AFAIK, but it will reduce the size of the data to send back to the client obviously.
Basically, the limit and order by happen after the query has been executed (full table scan to find matching records, then it is ordered and then it is limited). Any rows outside of your LIMIT 10 clause are discarded.
As a side note, adding in the SQL_NO_CACHE will disable MySQL's internal query cache, but will does not prevent your operating system from caching the data (due to the random nature of this query I don't think it would have much of an effect on your execution time anyway).
Hopefully someone can correct me here if I have made any mistakes but I believe that is the general idea.

An alternative way which probably would not be faster, but might who knows :)
Either use a table status query to determine the next auto_increment, or the row count, or use (select count(*)). Then you can decide ahead of time the auto_increment value of a random item and then select that unique item.
This will fail if you have gaps in the auto_increment field, but if it is faster than your other methods, you could loop a few times or fall back to a failsafe method in the case of zero rows returned. Best case might be a big savings, worst case would be comparable to your current method.

You're using the wrong data structure.
The usual method is something like this:
Find out the number of elements n — something like SELECT count(id) FROM tablename.
Choose r distinct randomish numbers in the interval [0,n). I usually recommend a LCG with suitably-chosen parameters, but simply picking r randomish numbers and discarding repeats also works.
Return those elements. The hard bit.
MySQL appears to support indexed lookups with something like SELECT id from tablename ORDER BY id LIMIT :i,1 where :i is a bound-parameter (I forget what syntax mysqli uses); alternative syntax LIMIT 1 OFFSET :i. This means you have to make r queries, but this might be fast enough (it depends on per-statement overheads and how efficiently it can do OFFSET).
An alternative method, which should work for most databases, is a bit like interval-bisection:
SELECT count(id),max(id),min(id) FROM tablename. Then you know rows [0,n-1] have ids [min,max].
So rows [a,b] have ids [min,max]. You want row i. If i == a, return min. If i == b, return max. Otherwise, bisect:
Choose split = min+(max-min)/2 (avoiding integer overflow).
SELECT count(id),max(id) WHERE :min < id AND id < split and SELECT count(id),min(id) WHERE :split <= id and id < :max. The two counts should equal b-a+1 if the table hasn't been modified...
Figure out which range i is in, and update a, b, min, and max appropriately. Repeat.
There are plenty of edge cases (I've probably included some off-by-one errors) and a few potential optimizations (you can do this for all the indexes at once, and you don't really need to do two queries per iteration if you don't assume that i == b implies id = max). It's not really worth doing if SELECT ... OFFSET is even vaguely efficient.

MYSQL and the LIMIT clause

I was wondering if adding a LIMIT 1 to a query would speed up the processing?
For example...
I have a query that will most of the time return 1 result, but will occasionally return 10's, 100's or even 1000's of records. But I will only ever want the first record.
Would the limit 1 speed things up or make no difference?
I know I could use GROUP BY to return 1 result but that would just add more computation.

It depends if you have an ORDER BY. An ORDER BY needs the entire result set anyway, so it can be ordered.
If you don't have any ORDER BY it should run faster.
It will in all cases run at least a bit faster since the entire result set needn't be sent of course.

Yep! it will!
But to be sure, it should only take a sec to add 'Limit 1' to the end of your sql statement so why not give a shot and see

It all depends on the query itself. If you're doing an indexed lookup (Where indexedColumn = somevalue)or a sort on an indexed column (with no Where clause), then limit 1 will really speed it up. If you have joins or multiple where/order clauses, then things get really complicated really quickly. But the major thing to take away, using "LIMIT 1" will NEVER slow down a query. It will sometimes speed it up, but it will never slow it down.
Now, there is another issue when dealing with PHP. By default, PHP will buffer the entire result set before returning from the query (mysql_query or mysqli->query will only return after all the records are downloaded). So while the query time may be altered little by the limit 1, the time and memory that PHP uses to buffer the result are significant. Imagine each row has 200 bytes of data. Now, your query returns 10,000 rows. That means PHP has to allocate an additional 2mb of memory (actually closer to 10mb with the overhead of php's variable structure) that you'll never use. Allocating memory can be very expensive, so the general rule is only ever allocate what you need (or think you will need). Downloading 10,000 rows when you only want 1 is just wasteful.
Combine these two effects, and you can see why if you want only 1 row, you should ALWAYS use "LIMIT 1".

Here is some relevant documentation on the subject from the MySQL site.
It seems it can speed things up in different ways, depending on the other parts of the query. I'm not sure if it helps when you have no ORDER or GROUP or HAVING clauses, aside from being able to stop immediately rather than give back every single result row (which may be a big enough speed up if you are getting back 100,000 records).

This is the fundamental purpose of the LIMIT clause actually :P If you know how many results you want, you should ALWAYS specify that number as the LIMIT not only because it is faster (unless you are doing an ORDER BY), but to be more efficient with memory in your PHP script.
Note: I'm assuming you're using MySQL with PHP since you added PHP to the tags. If you're just selecting from MySQL directly (outside of a scripting language) the advantage to using LIMIT when also using ORDER BY is purely to make the results easier to manage.

How to efficiently paginate large datasets with PHP and MySQL?

As some of you may know, use of the LIMIT keyword in MySQL does not preclude it from reading the preceding records.
For example:
SELECT * FROM my_table LIMIT 10000, 20;
Means that MySQL will still read the first 10,000 records and throw them away before producing the 20 we are after.
So, when paginating a large dataset, high page numbers mean long load times.
Does anyone know of any existing pagination class/technique/methodology that can paginate large datasets in a more efficient way i.e. that does not rely on the LIMIT MySQL keyword?
In PHP if possible as that is the weapon of choice at my company.
Cheers.

First of all, if you want to paginate, you absolutely have to have an ORDER BY clause. Then you simply have to use that clause to dig deeper in your data set. For example, consider this:
SELECT * FROM my_table ORDER BY id LIMIT 20
You'll have the first 20 records, let's say their id's are: 5,8,9,...,55,64. Your pagination link to page 2 will look like "list.php?page=2&id=64" and your query will be
SELECT * FROM my_table WHERE id > 64 ORDER BY id LIMIT 20
No offset, only 20 records read. It doesn't allow you to jump arbitrarily to any page, but most of the time people just browse the next/prev page. An index on "id" will improve the performance, even with big OFFSET values.

A solution might be to not use the limit clause, and use a join instead -- joining on a table used as some kind of sequence.
For more informations, on SO, I found this question / answer, which gives an example -- that might help you ;-)

There are basically 3 approaches to this, each of which have their own trade-offs:
Send all 10000 records to the client, and handle pagination client-side via Javascript or the like. Obvious benefit is that only a single query is necessary for all of the records; obvious downside is that if the record size is in any way significant, the size of the page sent to the browser will be of proportionate size - and the user might not actually care about the full record set.
Do what you're currently doing, namely SQL LIMIT and grab only the records you need with each request, completely stateless. Benefit in that it only sends the records for the page currently requested, so requests are small, downsides in that a) it requires a server request for each page, and b) it's slower as the number of records/pages increases for later pages in the result, as you mentioned. Using a JOIN or a WHERE clause on a monotonically increasing id field can sometimes help in this regard, specifically if you're requesting results from a static table as opposed to a dynamic query.
Maintain some sort of state object on the server which caches the query results and can be referenced in future requests for a limited period of time. Upside is that it has the best query speed, since the actual query only needs to run once; downside is having to manage/store/cleanup those state objects (especially nasty for high-traffic websites).

SELECT * FROM my_table LIMIT 10000, 20;
means show 20 records starting from record # 10000 in the search , if ur using primary keys in the where clause there will not be a heavy load on my sql
any other methods for pagnation will take real huge load like using a join method

I'm not aware of that performance decrease that you've mentioned, and I don't know of any other solution for pagination however a ORDER BY clause might help you reduce the load time.

Best way is to define index field in my_table and for every new inserted row you need increment this field. And after all you need to use WHERE YOUR_INDEX_FIELD BETWEEN 10000 AND 10020
It will much faster.

some other options,
Partition the tables per each page so ignore the limit
Store the results into a session (a good idea would be to create a hash of that data using md5, then using that cache the session per multiple users)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Best way to get result count before LIMIT was applied - php

You could mitigate the performance penalty by not running the COUNT() query every time. Cache the number of pages for, say 5 minutes before the query is run again. Unless you're seeing a huge number of INSERTs, that should work just fine.

Since Postgres already does a certain amount of caching things, this type of method isn't as inefficient as it seems. It's definitely not doubling execution time. We have timers built into our DB layer, so I have seen the evidence.

Related

When should I consider saving the total count in a field?

What's the best way to count MySQL records

My(SQL) selecting random rows, novel way, help to evaluate how good is it?

MYSQL and the LIMIT clause

How to efficiently paginate large datasets with PHP and MySQL?

Categories

Resources