MySQL + php: new query for each row of a query? - php

This is a subject that pops up all the time where I am. For a type of query that returns a list of rows, we often want to perform a further query that gathers more information about that specific row, this often includes queries that themselves return a list of rows. An example of this would be an orders system which returns a list of customers, and each customer 'row' may also show a list of their orders (perhaps in a pop-up dialog).
Is it generally "better" to:
Perform one single query, using GROUP_CONCAT where possible and split out the results programatically (there are limitations to the length of a returned concatenation)
Perform 'child queries' for each row while looping through the results of a 'parent query'
Perform one 'parent query' to return the customer list and one 'orders' query using the SQL IN keyword to match the customer_ID's returned from the previous query. Looping through the results of the customer query, we can see if the customer_ID exists in the orders query and show orders that match.
Perform the second query as and when. The reasoning being that we don't always want to see the child results for every parent result (using a web app, we could use AJAX to grab the child result)
Something else?
I have been leaning towards #2, as conceptually it seems like the cleanest solution, but I can't help but think that it is a resource hog. Doing our own benchmarks for a particular set of results, #3 comes out quickest. #4 seems like it should be the quickest as some applications don't need to show all results, however, the intention might be to have the result ready and waiting, rather than another roundtrip to retrieve that row's child data. I'm not entirely sure how the mechanics of FETCH_ASSOC etc. work, but any recommendations are very welcome!

I think #3 is better.
I suggest getting all customers, then a list of all orders of this customers (customer_ID IN (...)) then dispatching order to the correct customer on php's side if needed.
This way, you get only two query with all the information, and the dispatching part may be avoided (depend of the logic to do after this query).
Remember that most overhead with querys comes from the query itself (transfering query, then getting back data)
Database are higly optimised for things like search and joins, so selecting data isn't the bottleneck (until you reach very high numbers) so it's another solution.
Additionnaly, if you're selecting with index using IN, the database'll not even have to search for the term, it'll just look at the index, then go directly to each row.
Depending on your application, #4 is better if the user is going to look at only one or two orders list for like 100 customers displayed.
Anyway, considering making a sql query in a loop is generaly a bad practise/bad design/bad logic.

Related

How to avoid calling DB queries inside of loops for child data

I've been struggling with trying to figure out what is the best, most efficient way to handle displaying child data. In my specific case I'm using PHP and MySQL, but I feel that this is more of a "generally in any language" sort of deal.
My two thoughts are (for this example I'll be listing invoices and their line items)
Joining the child data (invoice items) to the main data (invoices) as to only have a single query My problem with this is that, say I have 500 line items on an invoice (probably not realistic, but things happen), then I would have sent 500 times the overall invoice data from the MySQL server to my PHP script and that just sounds ridiculous since I only need it the once time.
And the second option would be to, while looping through the invoices and displaying the overall invoice data, select the invoices's line items. And this, of course, is now contacting the database 500 more times.
Are there any other options for dealing with this data that makes logical sense (with the given schema)? I'm almost 100% sure there are, since I can't believe that I'm the first person to think about this issue, but I think I'm just having difficulty finding the right way to search for more information on this topic.
Joining the child data (invoice items) to the main data (invoices) as to only have a single query
That's the conventional way of handling this requirement. It does, indeed, handle redundant data, and there is some overhead.
But.
that's the reason it's possible to specify a compressed connection to a RDBMS from a client ... compression mitigates the network overhead of the redundant data.
the redundant data in a single result set costs much less than the repeated queries.
Most folks just retrieve the redundant data in this kind of application. Program products like Crystal Reports do this.
If it just won't work for you, you retrieve and save one result set for your master records ... maybe something like this.
SELECT master_id, name, address, whatever
FROM master m
WHERE m.whatever = whatever
ORDER BY whatever
Then, put those into an associative array by master_id.
Then, retrieve the detail records.
SELECT d.master_id, d.detail_id, d.item, d.unit, d.quantity
FROM detail d
JOIN master m ON d.master_id = m.master_id
WHERE m.whatever = whatever
ORDER BY d.master_id, d.detail_id, whatever
You'll get a result set with all your relevant detail records (invoice items) tagged with the master_id value. You can them match them up in your php program to the master records. You're basically doing the join in your application code.
If all that sounds like too much of a pain in the neck... just go for the redundant data and get your project done. You can always optimize later if you must.

Efficient mysql query for product catalogue

So i have a website with a product catalog, this page has 4 product sliders one for recent products, another for bestsellers a third one for special offers.
Is it better to create a query for each type of slider, or should I get all products and then have php sort them out and separate them into three distict arrays one for each slider?
Currently I am just doing
SELECT * FROM products WHERE deleted = 0
For testing.
It's almost always best to refine the query so that it just returns not only the records you actually need, but also only the columns you really need. So, in your case this would look like
SELECT id, description, col3
FROM products
WHERE deleted = 0
AND -- conditions that make it fit in the proper slider
The reason is that it also costs resources (time and bandwidth) to transport a result set over to the processing program and processing time to evaluate the individual return records, and thus the "cost" of the query will evolve with the table size, not with the size of the dataset you actually need.
Just an example, but let's suppose you want to create a top 10 seller list, and you have 100 items in story. You'll retrieve 100 records and retain 10. No big deal. Now, your shop grows and you have 100000 items in store. For your top 10 you'll have to plow through all of theses and throw away 99990 records. 99990 records you'll have the db server read and transfer to you, and 99990 records you have to individually inspect to see whether it's a top 10 item.
As this type of query is going to be executed often, it's also a good idea to optimize them by indexing the search columns, as indexed searches in the db server are much faster.
I said "almost always" because there are rare cases where you have a hard time expressing in SQL what you actually need, or where you need to use fairly exotic techniques like query hints to force the database engine to execute your query in a reasonably efficient way. But these cases are fairly rare, and when a query doesn't perform as expected, with some analysis you'll manage to improve its performance in most cases - have a look at the literally thousands of questions regarding query optimization here on SO.

What's faster, db calls or resorting an array?

In a site I maintain I have a need to query the same table (articles) twice, once for each category of article. AFAIT there are basically two ways of doing this (maybe someone can suggest a better, third way?):
Perform the db query twice, meaning the db server has to sort through the entire table twice. After each query, I iterate over the cursor to generate html for a list entry on the page.
Perform the query just once and pull out all the records, then sort them into two separate arrays. After this, I have to iterate over each array separately in order to generate the HTML.
So it's this:
$newsQuery = $mysqli->query("SELECT * FROM articles WHERE type='news' ");
while($newRow = $newsQuery->fetch_assoc()){
// generate article summary in html
}
// repeat for informational articles
vs this:
$query = $mysqli->query("SELECT * FROM articles ");
$news = Array();
$info = Array();
while($row = $query->fetch_assoc()){
if($row['type'] == "news"){
$news[] = $row;
}else{
$info[] = $row;
}
}
// iterate over each array separate to generate article summaries
The recordset is not very large, current <200 and will probably grow to 1000-2000. Is there a significant different in the times between the two approaches, and if so, which one is faster?
(I know this whole thing seems awfully inefficient, but it's a poorly coded site I inherited and have to take care of without a budget for refactoring the whole thing...)
I'm writing in PHP, no framework :( , on a MySql db.
Edit
I just realized I left out one major detail. On a given page in the site, we will display (and thus retrieve from the db) no more than 30 records at once - but here's the catch: 15 info articles, and 15 news articles. On each page we pull the next 15 of each kind.
You know you can sort in the DB right?
SELECT * FROM articles ORDER BY type
EDIT
Due to the change made to the question, I'm updating my answer to address the newly revealed requirement: 15 rows for 'news' and 15 rows for not-'news'.
The gist of the question is the same "which is faster... one query to two separate queries". The gist of the answer remains the same: each database roundtrip incurs overhead (extra time, especially over a network connection to a separate database server), so with all else being equal, reducing the number database roundtrips can improve performance.
The new requirement really doesn't impact that. What the newly revealed requirement really impacts is the actual query to return the specified resultset.
For example:
( SELECT n.*
FROM articles n
WHERE n.type='news'
LIMIT 15
)
UNION ALL
( SELECT o.*
FROM articles o
WHERE NOT (o.type<=>'news')
LIMIT 15
)
Running that statement as a single query is going to require fewer database resources, and be faster than running two separate statements, and retrieving two disparate resultsets.
We weren't provided any indication of what the other values for type can be, so the statement offered here simply addresses two general categories of rows: rows that have type='news', and all other rows that have some other value for type.
That query assumes that type allows for NULL values, and we want to return rows that have a NULL for type. If that's not the case, we can adjust the predicate to be just
WHERE o.type <> 'news'
Or, if there are specific values for type we're interested in, we can specify that in the predicate instead
WHERE o.type IN ('alert','info','weather')
If "paging" is a requirement... "next 15", the typical pattern we see applied, LIMIT 30,15 can be inefficient. But this question isn't asking about improving efficiency of "paging" queries, it's asking whether running a single statement or running two separate statements is faster.
And the answer to that question is still the same.
ORIGINAL ANSWER below
There's overhead for every database roundtrip. In terms of database performance, for small sets (like you describe) you're better off with a single database query.
The downside is that you're fetching all of those rows and materializing an array. (But, that looks like that's the approach you're using in either case.)
Given the choice between the two options you've shown, go with the single query. That's going to be faster.
As far as a different approach, it really depends on what you are doing with those arrays.
You could actually have the database return the rows in a specified sequence, using an ORDER BY clause.
To get all of the 'news' rows first, followed by everything that isn't 'news', you could
ORDER BY type<=>'news' DESC
That's MySQL short hand for the more ANSI standards compliant:
ORDER BY CASE WHEN t.type = 'news' THEN 1 ELSE 0 END DESC
Rather than fetch every single row and store it in an array, you could just fetch from the cursor as you output each row, e.g.
while($row = $query->fetch_assoc()) {
echo "<br>Title: " . htmlspecialchars($row['title']);
echo "<br>byline: " . htmlspecialchars($row['byline']);
echo "<hr>";
}
Best way of dealing with a situation like this is to test this for yourself. Doesn't matter how many records do you have at the moment. You can simulate whatever amount you'd like, that's never a problem. Also, 1000-2000 is really a small set of data.
I somewhat don't understand why you'd have to iterate over all the records twice. You should never retrieve all the records in a query either way, but only a small subset you need to be working with. In a typical site where you manage articles it's usually about 10 records per page MAX. No user will ever go through 2000 articles in a way you'd have to pull all the records at once. Utilize paging and smart querying.
// iterate over each array separate to generate article summaries
Not really what you mean by this, but something tells me this data should be stored in the database as well. I really hope you're not generating article excerpts on the fly for every page hit.
It all sounds to me more like a bad architecture design than anything else...
PS: I believe sorting/ordering/filtering of a database data should be done on the database server, not in the application itself. You may save some traffic by doing a single query, but it won't help much if you transfer too much data at once, that you won't be using anyway.

Performance issues with mongo + PHP with pagination, distinct values

I have a mongodb collection contains lots of books with many fields. Some key fields which are relevant for my question are :
{
book_id : 1,
book_title :"Hackers & Painters",
category_id : "12",
related_topics : [ {topic_id : "8", topic_name : "Computers"},
{topic_id : "11", topic_name : "IT"}
]
...
... (at least 20 fields more)
...
}
We have a form for filtering results (with many inputs/selectbox) on our search page. And of course there is also pagination. With the filtered results, we show all categories on the page. For each category, number of results found in that category is also shown on the page.
We try to use MongoDB instead of PostgreSQL. Because performance and speed is our main concern for this process.
Now the question is :
I can easily filter results by feeding "find" function with all filter parameters. That's cool. I can paginate results with skip and limit functions :
$data = $lib_collection->find($filter_params, array())->skip(20)->limit(20);
But I have to collect number of results found for each category_id and topic_id before pagination occurs. And I don't want to "foreach" all results, collect categories and manage pagination with PHP, because filtered data often consists of nearly 200.000 results.
Problem 1 : I found mongodb::command() function in PHP manual with a "distinct" example. I think that I get distinct values by this method. But command function doesn't accept conditional parameters (for filtering). I don't know how to apply same filter params while asking for distinct values.
Problem 2 : Even if there is a way for sending filter parameters with mongodb::command function, this function will be another query in the process and take approximately same time (maybe more) with the previous query I think. And this will be another speed penalty.
Problem 3 : In order to get distinct topic_ids with number of results will be another query, another speed penalty :(
I am new with working MongoDB. Maybe I look problems from the wrong point of view. Can you help me solve the problems and give your opinions about the fastest way to get :
filtered results
pagination
distinct values with number of results found
from a large data set.
So the easy way to do filtered results and pagination is as follows:
$cursor = $lib_collection->find($filter_params, array())
$count = $cursor->count();
$data = $cursor->skip(20)->limit(20);
However, this method may not be somewhat inefficient. If you query on fields that are not indexed, the only way for the server to "count()" is to load each document and check. If you do skip() and limit() with no sort() then the server just needs to find the first 20 matching documents, which is much less work.
The number of results per category is going to be more difficult.
If the data does not change often, you may want to precalculate these values using regular map/reduce jobs. Otherwise you have to run a series of distinct() commands or in-line map/reduce. Neither one is generally intended for ad-hoc queries.
The only other option is basically to load all of the search results and then count on the webserver (instead of the DB). Obviously, this is also inefficient.
Getting all of these features is going to require some planning and tradeoffs.
Pagination
Be careful with pagination on large datasets. Remember that skip() and take() --no matter if you use an index or not-- will have to perform a scan. Therefore, skipping very far is very slow.
Think of it this way: The database has an index (B-Tree) that can compare values to each other: it can tell you quickly whether something is bigger or smaller than some given x. Hence, search times in well-balanced trees are logarithmic. This is not true for count-based indexation: A B-Tree has no way to tell you quickly what the 15.000th element is: it will have to walk and enumerate the entire tree.
From the documentation:
Paging Costs
Unfortunately skip can be (very) costly and requires the
server to walk from the beginning of the collection, or index, to get
to the offset/skip position before it can start returning the page of
data (limit). As the page number increases skip will become slower and
more cpu intensive, and possibly IO bound, with larger collections.
Range based paging provides better use of indexes but does not allow
you to easily jump to a specific page.
Make sure you really need this feature: Typically, nobody cares for the 42436th result. Note that most large websites never let you paginate very far, let alone show exact totals. There's a great website about this topic, but I don't have the address at hand nor the name to find it.
Distinct Topic Counts
I believe you might be using a sledgehammer as a floatation device. Take a look at your data: related_topics. I personally hate RDBMS because of object-relational mapping, but this seems to be the perfect use case for a relational database.
If your documents are very large, performance is a problem and you hate ORM as much as I do, you might want to consider using both MongoDB and the RDBMS of your choice: Let MongoDB fetch the results and the RDBMS aggregate the best matches for a given category. You could even run the queries in parallel! Of course, writing changes to the DB needs to occur on both databases.

Tracking a total count of items over a series of paged results

What is the ideal way to keep track of the total count of items when dealing with paged results?
This seems like a simple question at first but it is slightly more complicated (to me... just bail now if you find this too stupid for words) when I actually start thinking about how to do it efficiently.
I need to get a count of items from the database. This is simple enough. I can then store this count in some variable (a $_SESSION variable for instance). I can check to see if this variable is set and if it isn't, get the count again. The trick part is deciding what is the best way to determine when I need to get a new count. It seems I would need to get a new count if I have added/deleted items to the total or if I am reloading or revisiting the grid.
So, how would I decide when to clear this $_SESSION variable? I can see clearing it and getting a new count after an update/delete (or even adding or subtracting to it to avoid the potentially expensive database hit) but (here comes the part I find tricky) what about when someone navigates away from the page or waits a variable amount of time before going to the next page of results or reloads the page?
Since we may be dealing with tens or hundreds of thousands of results, getting a count of them from the database could be quite expensive (right? Or is my assumption incorrect?). Since I need the total count to handle the total number of pages in the paged results... what's the most efficient way to handle this sort of situation and to persist it for... as long as might be needed?
BTW, I would get the count with an SQL query like:
SELECT COUNT(id) FROM foo;
I never use a session variable to store the total found in a query, I include the count in the regular query when I get the information and the count itself comes from a second query:
// first query
SELECT SQL_CALC_FOUND_ROWS * FROM table LIMIT 0, 20;
// I don´t actually use * but just select the columns I need...
// second query
SELECT FOUND_ROWS();
I´ve never noticed any performance degradation because of the second query but I guess you will have to measure that if you want to be sure.
By the way, I use this in PDO, I haven´t tried it in plain MySQL.
Why store it in a session variable? Will the result change per user? I'd rather store it in a user cache like APC or memcached, choose the cache key wisely, and then clear it when inserting or deleting a record related to the query.
A good way to do this would be to use an ORM that does it for you, like Doctrine, which has a result cache.
To get the count, I know that using COUNT(*) is not worse than using COUNT(id). (question: Is it even better?)
EDIT: interesting article about this on the MySQL performance blog
Most likely foo has a PRIMARY KEY index defined on the id column. Indexed COUNT() queries are usually quite easy on the DB.
However, if you want to go the extra mile, another option would be to insert a special hook into code that deals with inserting and deleting rows into foo. Have it write the number of total records into a protected file after each insert/update and read it from there. If every successful insert/update gets accounted for, the number in the protected file is always up-to-date.

Categories