I have a search engine on a shared host that uses MySQL. This search engine potentially has millions/trillions etc of records.
Each time a search is performed I return a count of the records that can then be used for pagination purposes.
The count tells you how many results there are in regard to the search performed. MySQL count is I believe considered quite slow.
Order of search queries:
Search executed and results returned
Count query executed
I don't perform a PHP count as this will be far slower in larger data sets.
Question is, do I need to worry about MySQL "count" and at what stage should I worry about it. How do the big search engines perform this task?
In almost all cases the answer is indexing. The larger your database gets the more important it is to have a well designed and optimized indexing strategy.
The importance of indexing on a large database can not be overstated.
You are absolutely right about not looping in code to count DB records. Your RDBMS is optimized for operations like that, your programming language is no. Wherever possible you want to do any sorting, grouping, counting, filtering operations within the SQL language provided by your RDBMS.
As for efficiently getting the count on a "paginated" query that uses a LIMIT clause, check out SQL_CALC_FOUND_ROWS.
SQL_CALC_FOUND_ROWS tells MySQL to calculate how many rows there would
be in the result set, disregarding any LIMIT clause. The number of
rows can then be retrieved with SELECT FOUND_ROWS(). See Section
11.13, “Information Functions”.
If MySQL database reaches several millions of records, that's a sign you'll be forced to stop using monolithic data store - meaning you'll have to split reads, writes and most likely use a different storage engine than the default one.
Once that happens, you'll stop using the actual count of the rows and you'll start using the estimate, cache the search results and so on in order to alleviate the work on the database. Even Google uses caching and displays an estimate of number of records.
Anyway, for now, you've got 2 options:
1 - Run 2 queries, one to retrieve the data and the other one where you use COUNT() to get the number of rows.
2 - Use SQL_CALC_FOUND_ROWS like #JohnFX suggested.
Percona has an article about what's faster, tho it might be outdated now.
The biggest problem you're facing is the way MySQL uses LIMIT OFFSET, which means you probably won't like your users using large offset numbers.
In case you indeed get millions of records - I don't forsee a bright future for your MySQL monolithic storage on a shared server. However, good luck to you and your project.
If I understand what you are trying to do properly, you can execute the one query, and perform the mysql_num_rows() function on the result in PHP... that should be pretty zippy.
http://php.net/manual/en/function.mysql-num-rows.php
Since you're using PHP, you could use the mysql_num_rows method to tell you the count after the query is done. See here: http://www.php.net/manual/en/function.mysql-num-rows.php
Related
I can think of a couple ways to count the number of rows in a table with Laravel (version 3).
DB::table('threads')->count();
Threads::count();
Threads::max('id');
DB::table('threads')->max('id);
DB::query('SELECT COUNT(*) FROM threads;');
Are any of these notably faster than the others? Is there any one fastest way to run this query? Later on it's going to be part of an expression: ceil(DB::table('threads')->count() / $threads_per_page); and it's executed on every page load so it's good to be optimized.
Database/table is MySQL and the InnoDB engine.
MAX(ID) is not the same as counting rows, so that rules out two of five alternatives.
And then it is your task to actually do a performance comparison between the remaining three methods to get the count. I'd think that actually executing an SQL statement directly might remove plenty of unnecessary ORM-layer overhead and be actually faster, but this would be premature optimization unless proven by facts.
DB::table('threads')->count();
Threads::count();
DB::query('SELECT COUNT(*) FROM threads;');
I was looking for the same thing.
These 3 results are exactly the same query I tested it (You can watch this with laravel debugbar).
Laravel perform "SELECT COUNT(*) as aggregate FROM threads";
It's already optimised with eloquent, but if you do ->get()->count() it's not optimised !
No performance difference with Threads::count();
Max('id') is totally different as it output the max id, it will never count the number of rows.
i dont think that it really that matter.. just be consistent in your code..
any way there is no need to run that query on evey page load.. use some caching to cache that number..
Is there any advantages to having nested queries instead of separating them?
I'm using PHP to frequently query from MySQL and would like to separate them for better organization. For example:
Is:
$query = "SELECT words.unique_attribute
FROM words
LEFT JOIN adjectives ON adjectives.word_id = words.id
WHERE adjectives = 'confused'";
return $con->query($query);
Faster/Better than saying:
$query = "SELECT word_id
FROM adjectives
WHERE adjectives = 'confused';";
$id = getID($con->query($query));
$query = "SELECT unique_attribute
FROM words
WHERE id = $id;";
return $con->query($query);
The second option would give me a way to make a select function, where I wouldn't have to repeat so much query string code, but if making so many additional calls(these can get very deeply nested) will be very bad for performance, I might keep it. Or at least look out for it.
Like most questions containing 'faster' or 'better', it's a trade-off and it depends on which part you want to speed up and what your definition of 'better' is.
Compared with the two separate queries, the combined query has the advantages of:
speed: you only need to send one query to the database system, the database only needs to parse one query string, only needs to compose one query plan, only needs to push one result back up and through the connection to PHP. The difference (when not executing these queries thousands of times) is very minimal, however.
atomicity: the query in two parts may deliver a different result from the combined query if the words table changes between the first and second query (although in this specific example this is probably not a constantly-changing table...)
At the same time the combined query also has the disadvantage of (as you already imply):
re-usability: the split queries might come in handy when you can re-use the first one and replace the second one with something that selects a different column from the words table or something from another table entirely. This disadvantage can be mitigated by using something like a query builder (not to be confused with an ORM!) to dynamically compose your queries, adding where clauses and joins as needed. For an example of a query builder, check out Zend\Db\Sql.
locking: depending on the storage engine and storage engine version you are using, tables might get locked. Most select statements do not lock tables however, and the InnoDB engine definitely doesn't. Nevertheless, if you are working with an old version of MySQL on the MyISAM storage engine and your tables are under heavy load, this may be a factor. Note that even if the combined statement locks the table, the combined query will offer faster average completion time because it is faster in total while the split queries will offer faster initial response (to the first query) while still needing a higher total time (due to the extra round trips et cetera).
It would depend on the size of those tables and where you want to place the load. If those tables are large and seeing a lot of activity, then the second version with two separate queries would minimise the lock time you might see as a result of the join. However if you've got a beefy db server with fast SSD storage, you'd be best off avoiding the overhead of dipping into the database twice.
All things being equal I'd probably go with the former - it's a database problem so it should be resolved there. I imagine those tables wouldn't be written to particularly often so I'd ensure there's plenty of MySQL cache available and keep an eye on the slow query log.
MyPHP Application sends a SELECT statement to MySQL with HTTPClient.
It takes about 20 seconds or more.
I thought MySQL can’t get result immediately because MySQL Administrator shows stat for sending data or copying to tmp table while I'd been waiting for result.
But when I send same SELECT statement from another application like phpMyAdmin or jmater it takes 2 seconds or less.10 times faster!!
Dose anyone know why MySQL perform so difference?
Like #symcbean already said, php's mysql driver caches query results. This is also why you can do another mysql_query() while in a while($row=mysql_fetch_array()) loop.
The reason MySql Administrator or phpMyAdmin shows result so fast is they append a LIMIT 10 to your query behind your back.
If you want to get your query results fast, i can offer some tips. They involve selecting only what you need and when you need:
Select only the columns you need, don't throw select * everywhere. This might bite you later when you want another column but forget to add it to select statement, so do this when needed (like tables with 100 columns or a million rows).
Don't throw a 20 by 1000 table in front of your user. She cant find what she's looking for in a giant table anyway. Offer sorting and filtering. As a bonus, find out what she generally looks for and offer a way to show that records with a single click.
With very big tables, select only primary keys of the records you need. Than retrieve additional details in the while() loop. This might look like illogical 'cause you make more queries but when you deal with queries involving around ~10 tables, hundreds of concurrent users, locks and query caches; things don't always make sense at first :)
These are some tips i learned from my boss and my own experince. As always, YMMV.
Dose anyone know why MySQL perform so difference?
Because MySQL caches query results, and the operating system caches disk I/O (see this link for a description of the process in Linux)
This is a really broad question, but I have come across it a couple of times in the last few weeks and I was wondering what the general consensus is regarding good practice and efficiency.
1)
SELECT COUNT(*) FROM table WHERE id='$id', name='$name', owner='$owner_id'
and then based on if there is one result then the record matches.
2)
SELECT * FROM table WHERE id='$id'
and then a series of if commands to check the results match.
Now obviously there are advantages to the second solution as it allows for accurate error reports as to the field that does not match... but if that is not required which is more efficient, considered better practice and is there a difference to the load on the mySQL server between the two?
Option 1 by a long shot. Let SQL do what it is designed to do best, and better than procedural code. That is, filtering and sorting data.
Also, it is a much more efficient use of resources (bandwidth, DB utilization, etc) to pull down only the data you need from the server.
Use 1). Mysql is very efficient in selecting data based on certain conditions.
Large query can take .1 to 5.1 or more seconds, you need to find, run it and find it. Usually multiple if are way better as PHP is very fast. I did that when I was using it with 5 joins in table with 5 billion products, then I reduce one join and then use if statement to fix it up. Query was taking 4.2 seconds, when I reduced join, it took 3.8s but as you know PHP is way faster.
When paging through data that comes from a DB, you need to know how many pages there will be to render the page jump controls.
Currently I do that by running the query twice, once wrapped in a count() to determine the total results, and a second time with a limit applied to get back just the results I need for the current page.
This seems inefficient. Is there a better way to determine how many results would have been returned before LIMIT was applied?
I am using PHP and Postgres.
Pure SQL
Things have changed since 2008. You can use a window function to get the full count and the limited result in one query. Introduced with PostgreSQL 8.4 in 2009.
SELECT foo
, count(*) OVER() AS full_count
FROM bar
WHERE <some condition>
ORDER BY <some col>
LIMIT <pagesize>
OFFSET <offset>;
Note that this can be considerably more expensive than without the total count. All rows have to be counted, and a possible shortcut taking just the top rows from a matching index may not be helpful any more.
Doesn't matter much with small tables or full_count <= OFFSET + LIMIT. Matters for a substantially bigger full_count.
Corner case: when OFFSET is at least as great as the number of rows from the base query, no row is returned. So you also get no full_count. Possible alternative:
Run a query with a LIMIT/OFFSET and also get the total number of rows
Sequence of events in a SELECT query
( 0. CTEs are evaluated and materialized separately. In Postgres 12 or later the planner may inline those like subqueries before going to work.) Not here.
WHERE clause (and JOIN conditions, though none in your example) filter qualifying rows from the base table(s). The rest is based on the filtered subset.
( 2. GROUP BY and aggregate functions would go here.) Not here.
( 3. Other SELECT list expressions are evaluated, based on grouped / aggregated columns.) Not here.
Window functions are applied depending on the OVER clause and the frame specification of the function. The simple count(*) OVER() is based on all qualifying rows.
ORDER BY
( 6. DISTINCT or DISTINCT ON would go here.) Not here.
LIMIT / OFFSET are applied based on the established order to select rows to return.
LIMIT / OFFSET becomes increasingly inefficient with a growing number of rows in the table. Consider alternative approaches if you need better performance:
Optimize query with OFFSET on large table
Alternatives to get final count
There are completely different approaches to get the count of affected rows (not the full count before OFFSET & LIMIT were applied). Postgres has internal bookkeeping how many rows where affected by the last SQL command. Some clients can access that information or count rows themselves (like psql).
For instance, you can retrieve the number of affected rows in plpgsql immediately after executing an SQL command with:
GET DIAGNOSTICS integer_var = ROW_COUNT;
Details in the manual.
Or you can use pg_num_rows in PHP. Or similar functions in other clients.
Related:
Calculate number of rows affected by batch query in PostgreSQL
As I describe on my blog, MySQL has a feature called SQL_CALC_FOUND_ROWS. This removes the need to do the query twice, but it still needs to do the query in its entireity, even if the limit clause would have allowed it to stop early.
As far as I know, there is no similar feature for PostgreSQL. One thing to watch out for when doing pagination (the most common thing for which LIMIT is used IMHO): doing an "OFFSET 1000 LIMIT 10" means that the DB has to fetch at least 1010 rows, even if it only gives you 10. A more performant way to do is to remember the value of the row you are ordering by for the previous row (the 1000th in this case) and rewrite the query like this: "... WHERE order_row > value_of_1000_th LIMIT 10". The advantage is that "order_row" is most probably indexed (if not, you've go a problem). The disadvantage being that if new elements are added between page views, this can get a little out of synch (but then again, it may not be observable by visitors and can be a big performance gain).
You could mitigate the performance penalty by not running the COUNT() query every time. Cache the number of pages for, say 5 minutes before the query is run again. Unless you're seeing a huge number of INSERTs, that should work just fine.
Since Postgres already does a certain amount of caching things, this type of method isn't as inefficient as it seems. It's definitely not doubling execution time. We have timers built into our DB layer, so I have seen the evidence.
Seeing as you need to know for the purpose of paging, I'd suggest running the full query once, writing the data to disk as a server-side cache, then feeding that through your paging mechanism.
If you're running the COUNT query for the purpose of deciding whether to provide the data to the user or not (i.e. if there are > X records, give back an error), you need to stick with the COUNT approach.