I'm using Laravel built in paginate method in a query where i need search in Fulltext against a large dataset (about 100K rows with huge amount of text each).
All working fine, except i do not understand the logic in how laravel counts the results: why must execute the same query two times (the select count() as aggregate) for retrieve the total count of results, and not use the php function count(), that works great in this scenario.
Because with this method, I can literally half the time of this search, that sometimes can take up to 10 second!!
It is really necessary to use 2 query, or it is possible in some way to override this logic?
Or maybe it's me that I'm missing something behind this logic?
The query is executed twice, once to get the total number of records returned by the final query, and the second time to return only the required dataset.
So, if you have a table of 100,000 records, the first query will count the records returned by the SQL query, let's say 8,900 records match your requirement, it will return an integer of 8,900.
The second query then uses the page number you want, multiples it by the count per page, and then returns the relevant 15 or so records from just this page, which is the LIMIT and OFFSET values within your SQL query.
It is worth noting that GROUP BY paginated results are not handled in the same way. If you add GROUP BY to the end of any SQL statement within eloquent, it returns a single SQL query. This query grabs all the relevant datasets, counts the number of rows returned, and then slices the array to return just the 15 or so records you require.
The difference between these two methods is the first returns 2 tiny query responses. Firstly the total count, and secondly 15 or so datasets from your table.
The GROUP option returns a dataset for EVERY record which matches your SQL requirements. If this is a total of 8,900 records, it will be a total of 8,900 eloquent model objects.
As you can see, if you have a database with a good number of records in it, the the second method, while it may execute the SQL statement quicker, will tie up a lot more resource.
If your SQL statements are taking too long to execute twice, you may need to consider optimising your table, or adding a further INDEX. Just a thought.
Related
I have two tables employee and attendance.
employee : empID, empName
attendance: attendanceID, empID, date, inTime, outTime
I need to show these data in a grid where employee name in the left side and then dates. So the column headers would be like Emp Name, 1,2,3,4....,30, With or without data, number of days in the month needs to be printed.
I realized three ways to do this.
Get attendance and employee data in a join query order by empID. Then loop through the data and print it if it is matching with current date.This will go until the empID change in current loop.
Loop through employees, then loop for days in the month, in every record get attendance from the database for particular employee and particular dates.
foreach($employees as $emp)
{
$empID = $emp['empID'];
for($day =1; $day<=$maxDaysInTheMonth $day++)
{
$attendance = getAttendanceFromDatabase($empID,$day);
}
}
To make performance better we try to minimize database connections and unnecessary loops. I like to implement the second way as it has minimum conditions and loops and code is clean. But it is making database retrieval for every employee, every day. Can someone pointout some facts for performance please.
Fetching records in a single query and looping through it is better. As it has to call database server a single time. For the second way - it has to call the database server multiple times which is more costlier.
Then make an associative array from the data. The index would be the empID.
After generating the array you can use it as you want.
Try this query
$sql="SELECT employee.empName AS empName, attendance.date AS date FROM employee,attendance WHERE employee.empID=attendance.empID";
As #Sougata suggest, Fetching records in a single query and looping through it is better. But keep in mind the query performance should be increased as follows:
Avoid Multiple Joins in a Single Query
Try to avoid writing a SQL query using multiple joins that includes outer joins, cross apply, outer apply and other complex sub queries. It reduces the choices for Optimizer to decide the join order and join type. Sometime, Optimizer is forced to use nested loop joins, irrespective of the performance consequences for queries with excessively complex cross apply or sub queries
Avoid Use of Non-correlated Scalar Sub Query
You can re-write your query to remove non-correlated scalar sub query as a separate query instead of part of the main query and store the output in a variable, which can be referred to in the main query or later part of the batch. This will give better options to Optimizer, which may help to return accurate cardinality estimates along with a better plan.
Creation and Use of Indexes
We are aware of the fact that Index can magically reduce the data retrieval time but have a reverse effect on DML operations, which may degrade query performance. With this fact, Indexing is a challenging task, but could help to improve SQL query performance and give you best query response time.
Create a Highly Selective Index
Selectivity define the percentage of qualifying rows in the table (qualifying number of rows/total number of rows). If the ratio of the qualifying number of rows to the total number of rows is low, the index is highly selective and is most useful. A non-clustered index is most useful if the ratio is around 5% or less, which means if the index can eliminate 95% of the rows from consideration. If index is returning more than 5% of the rows in a table, it probably will not be used; either a different index will be chosen or created or the table will be scanned.
Position a Column in an Index
Order or position of a column in an index also plays a vital role to improve SQL query performance. An index can help to improve the SQL query performance if the criteria of the query matches the columns that are left most in the index key. As a best practice, most selective columns should be placed leftmost in the key of a non-clustered index.
I have a dilemma that I'm trying to solve right now. I have a table called "generic_pricing" that has over a million rows. It looks like this....
I have a list of 25000 parts that I need to get generic_pricing data for. Some parts have a CLEI, some have a partNumber, and some have both. For each of the 25000 parts, I need to search the generic_pricing table to find all rows that match either clei or partNumber.
Making matters more difficult is that I have to do matches based on substring searches. For example, one of my parts may have a CLEI of "IDX100AB01", but I need the results of a query like....
SELECT * FROM generic_pricing WHERE clei LIKE 'IDX100AB%';
Currently, my lengthy PHP code for finding these matches is using the following logic is to loop through the 25000 items. For each item, I use the query above on clei. If found, I use that row for my calculations. If not, I execute a similar query on partNumber to try to find the matches.
As you can imagine, this is very time consuming. And this has to be done for about 10 other tables similar to generic_pricing to run all of the calculations. The system is now bogging down and timing out trying to crunch all of this data. So now I'm trying to find a better way.
One thought I have is to just query the database one time to get all rows, and then use loops to find matches. But for 25000 items each having to compare against over a million rows, that just seems like it would take even longer.
Another thought I have is to get 2 associative arrays of all of the generic_pricing data. i.e. one array of all rows indexed by clei, and another all indexed by partNumber. But since I am looking for substrings, that won't work.
I'm at a loss here for an efficient way to handle this task. Is there anything that I'm overlooking to simplify this?
Do not query the db for all rows and sort them in your app. Will cause a lot more headaches.
Here are a few suggestions:
Use parameterized queries. This allows your db engine to compile the query once and use it multiple times. Otherwise it will have to optimize and compile the query each time.
Figure out a way to make in work. Instead of using like try ... left(clei,8) in ('IDX100AB','IDX100AC','IDX101AB'...)
Do the calculations/math on the db side. Build a stored proc which takes a list of part/clei numbers and outputs the same list with the computed prices. You'll have a lot more control of execution and a lot less network overhead. If not a stored proc, build a view.
Paginate. If this data is being displayed somewhere, switch to processing in batches of 100 or less.
Build a cheat sheet. If speed is an issue try precomputing prices into a separate table nightly, include some partial clei/part numbers if needed. Then use the precomputed lookup table.
To get number of rows in result set there are two ways:
Is to use query to get count
$query="Select count(*) as count from some_table where type='t1'";
and then retrieving the value of count.
Is getting count via num_rows(), in php.
so which one is better performance wise?
If your goal is to actually count the rows, use COUNT(*). num_rows is ordinarily (in my experience) only used to confirm that more than zero rows were returned and continue on in that case. It will probably take MySQL longer to read out many selected rows compared to the aggregation on COUNT too even if the query itself takes the same amount of time.
There are a few differences between the two:
num_rows is the number of result rows (records) received.
count(*) is the number of records in the database matching the query.
The database may be configured to limit the number of returned results (MySQL allows this for instance), in which case the two may differ in value if the limit is lower than the number of matching records. Note that limits may be configured by the DBA, so it may not be obvious from the SQL query code itself what limits apply.
Using num_rows to count records implies "transmitting" each record, so if you only want a total number (which would be a single record/row) you are far better off getting the count instead.
Additionally count can be used in more complex query scenario's to do things like sub-totals, which is not easily done with num_rows.
count is much more efficient both performance wise and memory wise as you're not having to retrieve so much data from the database server. If you count by a single column such as a unique id then you can get it a little more efficient
It depends on your implementation. If you're dealing with a lot of rows, count(*) is better because it doesn't have to pass all of those rows to PHP. If, on the other hand, you're dealing with a small amount of rows, the difference is negligible.
num_rows() would be better if you have small quantity of rows and count(*) will give you performance if there are large number of rows and you have to select one and send it to php.
In my program I launch an SQL query and get back a result resource. I then iterate through the rows of this result resource using the mysql_fetch_array() function and use the contents of the fields of each row to construct a further SQL query.
The result of launching this second query is the first set of results that I want. However, because the number of results produced by doing this is not many I want to make the search less specific by dropping the last record used to make the query.
e.g. the query which produces the first set of results I want could be:
SELECT uid FROM users WHERE (gender=male AND relationship_status=single
AND shoe_size=10)
I would then want to drop the last record so that my query became:
SELECT uid FROM users WHERE (gender=male AND relationship_status=single)
I have already written code to produce the first query but as I mentioned above I use the mysql_fetch_array function to iterate through ALL of the records. In subsequent "rounds" I only want to iterate through successively less records so that my query is less specific. How can I do this?
This seems like an very inefficient method too - so I'm welcome to any simple ideas which might make it more efficient.
EDIT: Thanks for the reply - Yeah I am actually doing this in my program. I am basically trying to implement a basic search algorithm by taking all the preferences a user has specified in the DB and using it to form a query to look for people with those preferences. So the first time search using all the criteria, then on successive attempts search using one less criteria and negate the user ids which were previously returned. At the moment I am constructing the query from scratch for each "round", but I want to find a way I can do this using the last query
Using the queries above, you could do:
SELECT uid
FROM users
WHERE uid NOT IN (
SELECT uid
FROM users
WHERE
(gender=male
AND relationship_status=single
AND shoe_size=10)
)
This will essentially turn your first query into a sub-query, and use that to negate the results returned. Ie, it will return all the rows, NOT IN the first query.
When paging through data that comes from a DB, you need to know how many pages there will be to render the page jump controls.
Currently I do that by running the query twice, once wrapped in a count() to determine the total results, and a second time with a limit applied to get back just the results I need for the current page.
This seems inefficient. Is there a better way to determine how many results would have been returned before LIMIT was applied?
I am using PHP and Postgres.
Pure SQL
Things have changed since 2008. You can use a window function to get the full count and the limited result in one query. Introduced with PostgreSQL 8.4 in 2009.
SELECT foo
, count(*) OVER() AS full_count
FROM bar
WHERE <some condition>
ORDER BY <some col>
LIMIT <pagesize>
OFFSET <offset>;
Note that this can be considerably more expensive than without the total count. All rows have to be counted, and a possible shortcut taking just the top rows from a matching index may not be helpful any more.
Doesn't matter much with small tables or full_count <= OFFSET + LIMIT. Matters for a substantially bigger full_count.
Corner case: when OFFSET is at least as great as the number of rows from the base query, no row is returned. So you also get no full_count. Possible alternative:
Run a query with a LIMIT/OFFSET and also get the total number of rows
Sequence of events in a SELECT query
( 0. CTEs are evaluated and materialized separately. In Postgres 12 or later the planner may inline those like subqueries before going to work.) Not here.
WHERE clause (and JOIN conditions, though none in your example) filter qualifying rows from the base table(s). The rest is based on the filtered subset.
( 2. GROUP BY and aggregate functions would go here.) Not here.
( 3. Other SELECT list expressions are evaluated, based on grouped / aggregated columns.) Not here.
Window functions are applied depending on the OVER clause and the frame specification of the function. The simple count(*) OVER() is based on all qualifying rows.
ORDER BY
( 6. DISTINCT or DISTINCT ON would go here.) Not here.
LIMIT / OFFSET are applied based on the established order to select rows to return.
LIMIT / OFFSET becomes increasingly inefficient with a growing number of rows in the table. Consider alternative approaches if you need better performance:
Optimize query with OFFSET on large table
Alternatives to get final count
There are completely different approaches to get the count of affected rows (not the full count before OFFSET & LIMIT were applied). Postgres has internal bookkeeping how many rows where affected by the last SQL command. Some clients can access that information or count rows themselves (like psql).
For instance, you can retrieve the number of affected rows in plpgsql immediately after executing an SQL command with:
GET DIAGNOSTICS integer_var = ROW_COUNT;
Details in the manual.
Or you can use pg_num_rows in PHP. Or similar functions in other clients.
Related:
Calculate number of rows affected by batch query in PostgreSQL
As I describe on my blog, MySQL has a feature called SQL_CALC_FOUND_ROWS. This removes the need to do the query twice, but it still needs to do the query in its entireity, even if the limit clause would have allowed it to stop early.
As far as I know, there is no similar feature for PostgreSQL. One thing to watch out for when doing pagination (the most common thing for which LIMIT is used IMHO): doing an "OFFSET 1000 LIMIT 10" means that the DB has to fetch at least 1010 rows, even if it only gives you 10. A more performant way to do is to remember the value of the row you are ordering by for the previous row (the 1000th in this case) and rewrite the query like this: "... WHERE order_row > value_of_1000_th LIMIT 10". The advantage is that "order_row" is most probably indexed (if not, you've go a problem). The disadvantage being that if new elements are added between page views, this can get a little out of synch (but then again, it may not be observable by visitors and can be a big performance gain).
You could mitigate the performance penalty by not running the COUNT() query every time. Cache the number of pages for, say 5 minutes before the query is run again. Unless you're seeing a huge number of INSERTs, that should work just fine.
Since Postgres already does a certain amount of caching things, this type of method isn't as inefficient as it seems. It's definitely not doubling execution time. We have timers built into our DB layer, so I have seen the evidence.
Seeing as you need to know for the purpose of paging, I'd suggest running the full query once, writing the data to disk as a server-side cache, then feeding that through your paging mechanism.
If you're running the COUNT query for the purpose of deciding whether to provide the data to the user or not (i.e. if there are > X records, give back an error), you need to stick with the COUNT approach.