Recently I faced with doctrine 2.1 performance issue. In my request I have a lot of "LEFT JOINS", "WHERE" and "WHERE IN" clauses, and I need total count + specific set of records based on limit and offset.
First of all doctrine 2.1 doesn't support MySQL option Limit. I tried to attach Limit to the query, but every time I got "error expected end of string got 'limit'".
I tried to use
$queryBuilder
->setMaxResults(25)
->setFirstResult(10);
as it was suggested in many articles, but as it said in the documentation http://docs.doctrine-project.org/projects/doctrine-orm/en/latest/reference/dql-doctrine-query-language.html#first-and-max-result-items-dql-query-only
If your query contains a fetch-joined collection specifying the result
limit methods are not working as you would expect.
Which mean that if you expect 25 rows per page, you can get for example 18.
Doctine 2.1 doesn't have paginator by default as doctrine 2.2, so everyone suggest to use doctrine extention https://github.com/wiredmedia/doctrine-extensions/blob/master/lib/DoctrineExtensions/Paginate/Paginate.php, but it's also lame and decrease the performance. As I understood, it sends 2 requests: 1 request gets all the date without a limit and offset to count total count and the second request is a "WHERE IN" request with ids of the specific entities. It's too slow to me.
So you probably want to know what is my solution:
$data = $oQuery->getArrayResult();
return array(
'data' => array_slice($data, $iStart, $iLimit),
'count' => count($data)
);
If you have better and faster solution, please share it with me.
While your query is absent from the question, you have only two options: either use raw SQL with all necessary LIMITs or upgrade Doctrine version.
Related
I have a table with several million records.
The table has a long string as a key, and searching for records, especially new ones, is very slow.
I have tried to narrow the search to whereDate >= this week, or whereID >= 999999 but this doesn't give a huge performance boost.
We are using Laravel. Is there some kind of feature to search a table from the bottom up or in reverse, instead of searching incrementally from the top down? Something like Model::findBackwards($key) that would start looking at the most recent record and process backwards would be ideal in this situation.
How would you approach this problem?
Edit - here is our current query. It's a job running on horizon
if (DB::table('big_table_with_a_stupid_key')
->whereDate('created_at', '>=', '2019-11-10')
->where('hash', $this->hash)
->increment('counter')
);
else {
DB::table('big_table_with_a_stupid_key')->where(['hash' => $this->hash])->increment('counter');
}
If you want to reverse order of the result, you can use latest() function.
Laravel documentation says: https://laravel.com/docs/6.x/queries
latest / oldest
The latest and oldest methods allow you to easily order results by date. By default, result will be ordered by the created_at column. Or, you may pass the column name that you wish to sort by:
$user = DB::table('users')->latest()->first(); // order by created_at
$user = DB::table('users')->latest('id')->first(); // order by id
You will want to do this by using the ORDER BY clause in your underlying database query. Laravel (of course) exposes this with ->orderBy()
https://www.tutorialsplane.com/laravel-order-by-asc-desc/
The SQL engine will place the result rows in the requested sequence before returning them to you.
Then, to address your performance issues, be sure that any column that you routinely use for searching has an index. This is the secret to maintaining good performance as the number of rows increases.
I am trying to select a million rows with Doctrine but am having some problems.
First I've tried doing it using the ORM query, but then I found out the native query is faster.(I don't need the ORM mapping for this).
I am already using an array hydrator(creating objects would be pointless since I need to only read the data).
I also heard about the PDO::MYSQL_ATTR_USE_BUFFERED_QUERY, but I get an error if I turn it off, since I am working with multiple result sets(cursors) at the same time.
So the memory usage is damn high.
Around 100 MB ram just for one integer column on about 1.3 Mil rows.
Example part of my code(using the non-custom HYDRATE_ARRAY):
function getResult() {
$rsm = new ResultSetMapping;
$q = $this->getEntityManager()->
createNativeQuery("select {$this->getTableIdColName()}
from {$this->getTableName()}", $rsm);
return $q->iterate(null, \Doctrine\ORM\Query::HYDRATE_ARRAY);
}
Upon calling said function, even if I don't iterate it at all - the memory is taken.
I've also made a custom hydrator which pretty much does the same as the default one but uses less memory(since it doesn't map column names). But the result ain't good either way.
Am I missing something or is it normal for a query to take 100 MB of ram without even using the result?
The problem was with the Query buffering, for anyone interested:
http://php.net/manual/en/mysqlinfo.concepts.buffering.php
By default, it's enabled, hence the whole results are being kept in memory.
Basically this fixes the problem, but it imposes some limitations:
Requires you to consume/close all results before making new queries to the database(doctrine has no internal way of closing them by the way)
$em->getConnection()->getWrappedConnection()->setAttribute(\PDO::MYSQL_ATTR_USE_BUFFERED_QUERY, false);
I have a mongodb collection contains lots of books with many fields. Some key fields which are relevant for my question are :
{
book_id : 1,
book_title :"Hackers & Painters",
category_id : "12",
related_topics : [ {topic_id : "8", topic_name : "Computers"},
{topic_id : "11", topic_name : "IT"}
]
...
... (at least 20 fields more)
...
}
We have a form for filtering results (with many inputs/selectbox) on our search page. And of course there is also pagination. With the filtered results, we show all categories on the page. For each category, number of results found in that category is also shown on the page.
We try to use MongoDB instead of PostgreSQL. Because performance and speed is our main concern for this process.
Now the question is :
I can easily filter results by feeding "find" function with all filter parameters. That's cool. I can paginate results with skip and limit functions :
$data = $lib_collection->find($filter_params, array())->skip(20)->limit(20);
But I have to collect number of results found for each category_id and topic_id before pagination occurs. And I don't want to "foreach" all results, collect categories and manage pagination with PHP, because filtered data often consists of nearly 200.000 results.
Problem 1 : I found mongodb::command() function in PHP manual with a "distinct" example. I think that I get distinct values by this method. But command function doesn't accept conditional parameters (for filtering). I don't know how to apply same filter params while asking for distinct values.
Problem 2 : Even if there is a way for sending filter parameters with mongodb::command function, this function will be another query in the process and take approximately same time (maybe more) with the previous query I think. And this will be another speed penalty.
Problem 3 : In order to get distinct topic_ids with number of results will be another query, another speed penalty :(
I am new with working MongoDB. Maybe I look problems from the wrong point of view. Can you help me solve the problems and give your opinions about the fastest way to get :
filtered results
pagination
distinct values with number of results found
from a large data set.
So the easy way to do filtered results and pagination is as follows:
$cursor = $lib_collection->find($filter_params, array())
$count = $cursor->count();
$data = $cursor->skip(20)->limit(20);
However, this method may not be somewhat inefficient. If you query on fields that are not indexed, the only way for the server to "count()" is to load each document and check. If you do skip() and limit() with no sort() then the server just needs to find the first 20 matching documents, which is much less work.
The number of results per category is going to be more difficult.
If the data does not change often, you may want to precalculate these values using regular map/reduce jobs. Otherwise you have to run a series of distinct() commands or in-line map/reduce. Neither one is generally intended for ad-hoc queries.
The only other option is basically to load all of the search results and then count on the webserver (instead of the DB). Obviously, this is also inefficient.
Getting all of these features is going to require some planning and tradeoffs.
Pagination
Be careful with pagination on large datasets. Remember that skip() and take() --no matter if you use an index or not-- will have to perform a scan. Therefore, skipping very far is very slow.
Think of it this way: The database has an index (B-Tree) that can compare values to each other: it can tell you quickly whether something is bigger or smaller than some given x. Hence, search times in well-balanced trees are logarithmic. This is not true for count-based indexation: A B-Tree has no way to tell you quickly what the 15.000th element is: it will have to walk and enumerate the entire tree.
From the documentation:
Paging Costs
Unfortunately skip can be (very) costly and requires the
server to walk from the beginning of the collection, or index, to get
to the offset/skip position before it can start returning the page of
data (limit). As the page number increases skip will become slower and
more cpu intensive, and possibly IO bound, with larger collections.
Range based paging provides better use of indexes but does not allow
you to easily jump to a specific page.
Make sure you really need this feature: Typically, nobody cares for the 42436th result. Note that most large websites never let you paginate very far, let alone show exact totals. There's a great website about this topic, but I don't have the address at hand nor the name to find it.
Distinct Topic Counts
I believe you might be using a sledgehammer as a floatation device. Take a look at your data: related_topics. I personally hate RDBMS because of object-relational mapping, but this seems to be the perfect use case for a relational database.
If your documents are very large, performance is a problem and you hate ORM as much as I do, you might want to consider using both MongoDB and the RDBMS of your choice: Let MongoDB fetch the results and the RDBMS aggregate the best matches for a given category. You could even run the queries in parallel! Of course, writing changes to the DB needs to occur on both databases.
As some of you may know, use of the LIMIT keyword in MySQL does not preclude it from reading the preceding records.
For example:
SELECT * FROM my_table LIMIT 10000, 20;
Means that MySQL will still read the first 10,000 records and throw them away before producing the 20 we are after.
So, when paginating a large dataset, high page numbers mean long load times.
Does anyone know of any existing pagination class/technique/methodology that can paginate large datasets in a more efficient way i.e. that does not rely on the LIMIT MySQL keyword?
In PHP if possible as that is the weapon of choice at my company.
Cheers.
First of all, if you want to paginate, you absolutely have to have an ORDER BY clause. Then you simply have to use that clause to dig deeper in your data set. For example, consider this:
SELECT * FROM my_table ORDER BY id LIMIT 20
You'll have the first 20 records, let's say their id's are: 5,8,9,...,55,64. Your pagination link to page 2 will look like "list.php?page=2&id=64" and your query will be
SELECT * FROM my_table WHERE id > 64 ORDER BY id LIMIT 20
No offset, only 20 records read. It doesn't allow you to jump arbitrarily to any page, but most of the time people just browse the next/prev page. An index on "id" will improve the performance, even with big OFFSET values.
A solution might be to not use the limit clause, and use a join instead -- joining on a table used as some kind of sequence.
For more informations, on SO, I found this question / answer, which gives an example -- that might help you ;-)
There are basically 3 approaches to this, each of which have their own trade-offs:
Send all 10000 records to the client, and handle pagination client-side via Javascript or the like. Obvious benefit is that only a single query is necessary for all of the records; obvious downside is that if the record size is in any way significant, the size of the page sent to the browser will be of proportionate size - and the user might not actually care about the full record set.
Do what you're currently doing, namely SQL LIMIT and grab only the records you need with each request, completely stateless. Benefit in that it only sends the records for the page currently requested, so requests are small, downsides in that a) it requires a server request for each page, and b) it's slower as the number of records/pages increases for later pages in the result, as you mentioned. Using a JOIN or a WHERE clause on a monotonically increasing id field can sometimes help in this regard, specifically if you're requesting results from a static table as opposed to a dynamic query.
Maintain some sort of state object on the server which caches the query results and can be referenced in future requests for a limited period of time. Upside is that it has the best query speed, since the actual query only needs to run once; downside is having to manage/store/cleanup those state objects (especially nasty for high-traffic websites).
SELECT * FROM my_table LIMIT 10000, 20;
means show 20 records starting from record # 10000 in the search , if ur using primary keys in the where clause there will not be a heavy load on my sql
any other methods for pagnation will take real huge load like using a join method
I'm not aware of that performance decrease that you've mentioned, and I don't know of any other solution for pagination however a ORDER BY clause might help you reduce the load time.
Best way is to define index field in my_table and for every new inserted row you need increment this field. And after all you need to use WHERE YOUR_INDEX_FIELD BETWEEN 10000 AND 10020
It will much faster.
some other options,
Partition the tables per each page so ignore the limit
Store the results into a session (a good idea would be to create a hash of that data using md5, then using that cache the session per multiple users)
When paging through data that comes from a DB, you need to know how many pages there will be to render the page jump controls.
Currently I do that by running the query twice, once wrapped in a count() to determine the total results, and a second time with a limit applied to get back just the results I need for the current page.
This seems inefficient. Is there a better way to determine how many results would have been returned before LIMIT was applied?
I am using PHP and Postgres.
Pure SQL
Things have changed since 2008. You can use a window function to get the full count and the limited result in one query. Introduced with PostgreSQL 8.4 in 2009.
SELECT foo
, count(*) OVER() AS full_count
FROM bar
WHERE <some condition>
ORDER BY <some col>
LIMIT <pagesize>
OFFSET <offset>;
Note that this can be considerably more expensive than without the total count. All rows have to be counted, and a possible shortcut taking just the top rows from a matching index may not be helpful any more.
Doesn't matter much with small tables or full_count <= OFFSET + LIMIT. Matters for a substantially bigger full_count.
Corner case: when OFFSET is at least as great as the number of rows from the base query, no row is returned. So you also get no full_count. Possible alternative:
Run a query with a LIMIT/OFFSET and also get the total number of rows
Sequence of events in a SELECT query
( 0. CTEs are evaluated and materialized separately. In Postgres 12 or later the planner may inline those like subqueries before going to work.) Not here.
WHERE clause (and JOIN conditions, though none in your example) filter qualifying rows from the base table(s). The rest is based on the filtered subset.
( 2. GROUP BY and aggregate functions would go here.) Not here.
( 3. Other SELECT list expressions are evaluated, based on grouped / aggregated columns.) Not here.
Window functions are applied depending on the OVER clause and the frame specification of the function. The simple count(*) OVER() is based on all qualifying rows.
ORDER BY
( 6. DISTINCT or DISTINCT ON would go here.) Not here.
LIMIT / OFFSET are applied based on the established order to select rows to return.
LIMIT / OFFSET becomes increasingly inefficient with a growing number of rows in the table. Consider alternative approaches if you need better performance:
Optimize query with OFFSET on large table
Alternatives to get final count
There are completely different approaches to get the count of affected rows (not the full count before OFFSET & LIMIT were applied). Postgres has internal bookkeeping how many rows where affected by the last SQL command. Some clients can access that information or count rows themselves (like psql).
For instance, you can retrieve the number of affected rows in plpgsql immediately after executing an SQL command with:
GET DIAGNOSTICS integer_var = ROW_COUNT;
Details in the manual.
Or you can use pg_num_rows in PHP. Or similar functions in other clients.
Related:
Calculate number of rows affected by batch query in PostgreSQL
As I describe on my blog, MySQL has a feature called SQL_CALC_FOUND_ROWS. This removes the need to do the query twice, but it still needs to do the query in its entireity, even if the limit clause would have allowed it to stop early.
As far as I know, there is no similar feature for PostgreSQL. One thing to watch out for when doing pagination (the most common thing for which LIMIT is used IMHO): doing an "OFFSET 1000 LIMIT 10" means that the DB has to fetch at least 1010 rows, even if it only gives you 10. A more performant way to do is to remember the value of the row you are ordering by for the previous row (the 1000th in this case) and rewrite the query like this: "... WHERE order_row > value_of_1000_th LIMIT 10". The advantage is that "order_row" is most probably indexed (if not, you've go a problem). The disadvantage being that if new elements are added between page views, this can get a little out of synch (but then again, it may not be observable by visitors and can be a big performance gain).
You could mitigate the performance penalty by not running the COUNT() query every time. Cache the number of pages for, say 5 minutes before the query is run again. Unless you're seeing a huge number of INSERTs, that should work just fine.
Since Postgres already does a certain amount of caching things, this type of method isn't as inefficient as it seems. It's definitely not doubling execution time. We have timers built into our DB layer, so I have seen the evidence.
Seeing as you need to know for the purpose of paging, I'd suggest running the full query once, writing the data to disk as a server-side cache, then feeding that through your paging mechanism.
If you're running the COUNT query for the purpose of deciding whether to provide the data to the user or not (i.e. if there are > X records, give back an error), you need to stick with the COUNT approach.