I need to pull several rows from a table and process them in two ways:
aggregated on a key
row-by-row, sorted by the same key
The table looks roughly like this:
table (
key,
string_data,
numeric_data
)
So I'm looking at two approaches to the function I'm writing.
The first would pull the aggregate data with one query, and then query again inside a loop for each set of row-by-row data (the following is PHP-like pseudocode):
$rows = query(
"SELECT key,SUM(numeric_data)
FROM table
GROUP BY key"
);
foreach ($rows as $row) {
<process aggregate data in $row>
$key = $row['key'];
$row_by_row_data = handle_individual_rows($key);
}
function handle_individual_rows($key)
{
$rows = query(
"SELECT string_data
FROM table WHERE key=?",
$key
);
<process $rows one row at a time>
return $processed_data;
}
Or, I could do one big query and let the code do all the work:
$rows = query(
"SELECT key, string_data, numeric_data
FROM table"
);
foreach ($rows as $row) {
<process rows individually and calculate aggregates as I go>
}
Performance is not a practical concern in this application; I'm just looking to write sensible and maintainable code.
I like the first option because it's more modular -- and I like the second option because it seems structurally simple. Is one option better than the other or is it really just a matter of style?
One SQL query, for sure.
This will
Save you lots of roundtrips to database
Allow to use more efficient GROUP BY methods
Since your aggregates may be performed equally well by the database, it will also be better for mainainability: you have all your resultset logic in one place.
Here is an example of a query that returns every row and calculates a SUM:
SELECT string_data, numeric_data, SUM(numeric_data) OVER (PARTITION BY key)
FROM table
Note that this will most probably use parallel access to calculate SUM's for different key's, which is hardly implementable in PHP.
Same query in MySQL:
SELECT key, string_data, numeric_data,
(
SELECT SUM(numeric_data)
FROM table ti
WHERE ti.key = to.key
) AS key_sum
FROM table to
If performance isn't a concern, I'd go with the second. Seems the tiniest bit friendlier.
If performance were a concern, my answer would be "don't think, profile". :)
The second answer is by far more clear, sensible and maintainable. You're saying the same thing with less code, which is usually better.
And I know you said performance is not a concern, but why fetch data more than you have to?
I can't be certain from the example here, but I'd like to know if there's a chance to do the aggregation and other processing right in the SQL query itself. In this case, you'd have to evaluate "more maintainable" with respect to your relative comfort level expressing that processing in SQL code vs. PHP code.
Is there something about the additional processing you need to do on each row that would prevent you from expressing everything in the SQL query itself?
I don't think you'll find many situations at all where doing a query-per-iteration of a loop is the better choice. In fact, I'd say it's probably a good rule of thumb to never do that.
In other words, the fewer round trips to the database, the better.
Depending on your data and actual tables, you might be able to let SQL do the aggregation work and select all the rows you need with one query.
one sql query is probably a better idea.
It avoids you having to re-write relational operations
I think somehow you've answered your own question, because you say you have two different processings : one aggregation and one row by row.
if you want to keep everything readable and maintainable, mixing both in a single query doesn't sound right, the query will answer two different needs so it won't be very readable
even if perf is not an issue, it's faster to do the aggregation on the DB server instead of doing it in code
with only one query, the code that will handle the result will mix two processings, handling rows and computing aggregations in the same time, so in time this code will tend to get confusing and buggy
the same code might evolve over time, for instance the row-by-row can get complex and could create bugs in the aggregation part or the other way around
if in the future you'll need to split these two treatments, it will be harder to disentangle the code that at that moment, somebody else has written ages ago...
Performance considerations aside, in terms of maintainability and readability I'd recommend to use two queries.
But keep in mind that the performance factor might not be an issue at the moment, but it can be in time once the db volume grows or whatever, it's never a negligible factor on long term ...
Even if perf is not an issue, your mind is. When a musician practices every movement is intended to improve the musician's skill. As a developer, you should develop every procedure to improve your skill. iterative loops though data is sloppy and ugly. SQL queries are elegant. Do you want to develop more elegant code or more sloppy code?
Related
Is there any advantages to having nested queries instead of separating them?
I'm using PHP to frequently query from MySQL and would like to separate them for better organization. For example:
Is:
$query = "SELECT words.unique_attribute
FROM words
LEFT JOIN adjectives ON adjectives.word_id = words.id
WHERE adjectives = 'confused'";
return $con->query($query);
Faster/Better than saying:
$query = "SELECT word_id
FROM adjectives
WHERE adjectives = 'confused';";
$id = getID($con->query($query));
$query = "SELECT unique_attribute
FROM words
WHERE id = $id;";
return $con->query($query);
The second option would give me a way to make a select function, where I wouldn't have to repeat so much query string code, but if making so many additional calls(these can get very deeply nested) will be very bad for performance, I might keep it. Or at least look out for it.
Like most questions containing 'faster' or 'better', it's a trade-off and it depends on which part you want to speed up and what your definition of 'better' is.
Compared with the two separate queries, the combined query has the advantages of:
speed: you only need to send one query to the database system, the database only needs to parse one query string, only needs to compose one query plan, only needs to push one result back up and through the connection to PHP. The difference (when not executing these queries thousands of times) is very minimal, however.
atomicity: the query in two parts may deliver a different result from the combined query if the words table changes between the first and second query (although in this specific example this is probably not a constantly-changing table...)
At the same time the combined query also has the disadvantage of (as you already imply):
re-usability: the split queries might come in handy when you can re-use the first one and replace the second one with something that selects a different column from the words table or something from another table entirely. This disadvantage can be mitigated by using something like a query builder (not to be confused with an ORM!) to dynamically compose your queries, adding where clauses and joins as needed. For an example of a query builder, check out Zend\Db\Sql.
locking: depending on the storage engine and storage engine version you are using, tables might get locked. Most select statements do not lock tables however, and the InnoDB engine definitely doesn't. Nevertheless, if you are working with an old version of MySQL on the MyISAM storage engine and your tables are under heavy load, this may be a factor. Note that even if the combined statement locks the table, the combined query will offer faster average completion time because it is faster in total while the split queries will offer faster initial response (to the first query) while still needing a higher total time (due to the extra round trips et cetera).
It would depend on the size of those tables and where you want to place the load. If those tables are large and seeing a lot of activity, then the second version with two separate queries would minimise the lock time you might see as a result of the join. However if you've got a beefy db server with fast SSD storage, you'd be best off avoiding the overhead of dipping into the database twice.
All things being equal I'd probably go with the former - it's a database problem so it should be resolved there. I imagine those tables wouldn't be written to particularly often so I'd ensure there's plenty of MySQL cache available and keep an eye on the slow query log.
I have this 2 mysql tables: TableA and TableB
TableA
* ColumnAId
* ColumnA1
* ColumnA2
TableB
* ColumnBId
* ColumnAId
* ColumnB1
* ColumnB2
In PHP, I wanted to have this multidimensional array format
$array = array(
array(
'ColumnAId' => value,
'ColumnA1' => value,
'ColumnA2' => value,
'TableB' => array(
array(
'ColumnBId' => value,
'ColumnAId' => value,
'ColumnB1' => value,
'ColumnB2' => value
)
)
)
);
so that I can loop it in this way
foreach($array as $i => $TableA) {
echo 'ColumnAId' . $TableA['ColumnAId'];
echo 'ColumnA1' . $TableA['ColumnA1'];
echo 'ColumnA2' . $TableA['ColumnA2'];
echo 'TableB\'s';
foreach($value['TableB'] as $j => $TableB) {
echo $TableB['...']...
echo $TableB['...']...
}
}
My problem is that, what is the best way or the proper way of querying MySQL database so that I can achieve this goal?
Solution1 --- The one I'm using
$array = array();
$rs = mysqli_query("SELECT * FROM TableA", $con);
while ($row = mysqli_fetch_assoc($rs)) {
$rs2 = mysqli_query("SELECT * FROM Table2 WHERE ColumnAId=" . $row['ColumnAId'], $con);
// $array = result in array
$row['TableB'] = $array2;
}
I'm doubting my code cause its always querying the database.
Solution2
$rs = mysqli_query("SELECT * FROM TableA JOIN TableB ON TableA.ColumnAId=TableB.ColumnAId");
while ($row = mysqli_fet...) {
// Code
}
The second solution only query once, but if I have thousand of rows in TableA and thousand of rows in TableB for each TableB.ColumnAId (1 TableA.ColumnAId = 1000 TableB.ColumnAId), thus this solution2 takes much time than the solution1?
Neither of the two solutions proposed are probably optimal, BUT solution 1 is UNPREDICTABLE and thus INHERENTLY FLAWED!
One of the first things you learn when dealing with large databases is that 'the best way' to do a query is often dependent upon factors (referred to as meta-data) within the database:
How many rows there are.
How many tables you are querying.
The size of each row.
Because of this, there's unlikely to be a silver bullet solution for your problem. Your database is not the same as my database, you will need to benchmark different optimizations if you need the best performance available.
You will probably find that applying & building correct indexes (and understanding the native implementation of indexes in MySQL) in your database does a lot more for you.
There are some golden rules with queries which should rarely be broken:
Don't do them in loop structures. As tempting as it often is, the overhead on creating a connection, executing a query and getting a response is high.
Avoid SELECT * unless needed. Selecting more columns will significantly increase overhead of your SQL operations.
Know thy indexes. Use the EXPLAIN feature so that you can see which indexes are being used, optimize your queries to use what's available and create new ones.
Because of this, of the two I'd go for the second query (replacing SELECT * with only the columns you want), but there are probably better ways to structure the query if you have the time to optimize.
However, speed should NOT be your only consideration in this, there is a GREAT reason not to use suggestion one:
PREDICTABILITY: why read-locks are a good thing
One of the other answers suggests that having the table locked for a long period of time is a bad thing, and that therefore the multiple-query solution is good.
I would argue that this couldn't be further from the truth. In fact, I'd argue that in many cases the predictability of running a single locking SELECT query is a greater argument FOR running that query than the optimization & speed benefits.
First of all, when we run a SELECT (read-only) query on a MyISAM or InnoDB database (default systems for MySQL), what happens is that the table is read-locked. This prevents any WRITE operations from happening on the table until the read-lock is surrendered (either our SELECT query completes or fails). Other SELECT queries are not affected, so if you're running a multi-threaded application, they will continue to work.
This delay is a GOOD thing. Why, you may ask? Relational data integrity.
Let's take an example: we're running an operation to get a list of items currently in the inventory of a bunch of users on a game, so we do this join:
SELECT * FROM `users` JOIN `items` ON `users`.`id`=`items`.`inventory_id` WHERE `users`.`logged_in` = 1;
What happens if, during this query operation, a user trades an item to another user? Using this query, we see the game state as it was when we started the query: the item exists once, in the inventory of the user who had it before we ran the query.
But, what happens if we're running it in a loop?
Depending on whether the user traded it before or after we read his details, and in which order we read the inventory of the two players, there are four possibilities:
The item could be shown in the first user's inventory (scan user B -> scan user A -> item traded OR scan user B -> scan user A -> item traded).
The item could be shown in the second user's inventory (item traded -> scan user A -> scan user B OR item traded -> scan user B -> scan user A).
The item could be shown in both inventories (scan user A -> item traded -> scan user B).
The item could be shown in neither of the user's inventories (scan user B -> item traded -> scan user A).
What this means is that we would be unable to predict the results of the query or to ensure relational integrity.
If you're planning to give $5,000 to the guy with item ID 1000000 at midnight on Tuesday, I hope you have $10k on hand. If your program relies on unique items being unique when snapshots are taken, you will possibly raise an exception with this kind of query.
Locking is good because it increases predictability and protects the integrity of results.
Note: You could force a loop to lock with a transaction, but it will still be slower.
Oh, and finally, USE PREPARED STATEMENTS!
You should never have a statement that looks like this:
mysqli_query("SELECT * FROM Table2 WHERE ColumnAId=" . $row['ColumnAId'], $con);
mysqli has support for prepared statements. Read about them and use them, they will help you to avoid something terrible happening to your database.
Definitely second way. Nested query is an ugly thing since you're getting all query overheads (execution, network e t.c.) every time for every nested query, while single JOIN query will be executed once - i.e. all overheads will be done only once.
Simple rule is not to use queries in cycles - in general. There could be exceptions, if one query will be too complex, so due to performance in should be split, but in a certain case that can be shown only by benchmarks and measures.
If you want to do algorithmic evaluation of your data in your application code (which I think is the right thing to do), you should not use SQL at all. SQL was made to be a human readable way to ask for computational achieved data from a relational database, which means, if you just use it to store data, and do the computations in your code, you're doing it wrong anyway.
In such a case you should prefer using a different storage/retrieving possibility like a key-value store (there are persistent ones out there, and newer versions of MySQL exposes a key-value interface as well for InnoDB, but it's still using a relational database for key-value storage, aka the wrong tool for the job).
If you STILL want to use your solution:
Benchmark.
I've often found that issuing multiple queries can be faster than a single query, because MySQL has to parse less query, the optimizer has less work to do, and more often than not the MySQL optimzer just fails (that's the reason things like STRAIGHT JOIN and index hints exist). And even if it does not fail, multiple queries might still be faster depending on the underlying storage engine as well as how many threads try to access the data at once (lock granularity - this only applies with mixing in update queries though - neither MyISAM nor InnoDB lock the whole table for SELECT queries by default). Then again, you might even get different results with the two solutions if you don't use transactions, as data might change between queries if you use multiple queries versus a single one.
In a nutshell: There's way more to your question than what you posted/asked for, and what a generic answer can provide.
Regarding your solutions: I'd prefer the first solution if you have an environment where a) data changes are common and/or b) you have many concurrent running threads (requests) accessing and updating your tables (lock granularity is better with split up queries, as is cacheability of the queries) ; if your database is on a different network, e.g. network latency is an issue, you're probably better of with the first solution (but most people I know have MySQL on the same server, using socket connections, and local socket connections normally don't have much latency).
Situation may also change depending on how often the for loop is actually executed.
Again: Benchmark
Another thing to consider is memory efficiency and algorithmic efficiency. Later one is about O(n) in both cases, but depending on the type of data you use to join, it could be worse in any of the two. E.g. if you use strings to join (you really shouldn't, but you don't say), performance in the more php dependent solution also depends on php hash map algorithm (arrays in php are effectively hash maps) and the likelyhood of a collision, while mysql string indexes are normally fixed length, and thus, depending on your data, might not be applicable.
For memory efficiency, the multi query version is certainly better, as you have the php array anyway (which is very inefficient in terms of memory!) in both solutions, but the join might use a temp table depending on several circumstances (normally it shouldn't, but there ARE cases where it does - you can check using EXPLAIN for your queries)
In some case, you should using limit for best performance
If you wanna show 1000 rows
And some single query( master data)
you should run 1000 with limit between 10-100
Then get your foreign key to master data with single query with using WHERE IN in your query. then count your unique data to limit master data.
Example
Select productID, date from transaction_product limit 100
Get all productID and make it unique
Then
Select price from master_product WHERE IN (1,2 3 4) limit 4(count from total unique)
foreach(transaction)
master_poduct[productID]
Quite simply, is such a query/statement inefficient or bad?
<?
$strSql="SELECT * FROM clients, projects
WHERE clients.clientID = $intClientId
AND projects.clientID=$intClientId LIMIT 1";
$objResult=mysql_query($strSql);
if(mysql_num_rows($objResult)==0) {
echo("No data"); }
while ($arrRow=mysql_fetch_array($objResult))
{
?>
<h1>Sub Project(s) for: <span><?=$arrRow[clientName]?></span></h1>
<?
} ?>
In general, you should avoid using SELECT * and select only the fields you need unless absolutely necessary. Whether or not this is efficient depends on how your tables are indexed. I assume in this case that that clientID is the primary key of the clients table. If you have an index on clientID in the projects table, this query should be quite fast.
There are several things that are off here:
As Michael Mior mentions, you should avoid SELECT *. This can be an efficiency issue. It can also break applications in some cases if your application makes assumptions about the columns that are in the table and then the table changes in the database.
You have a LIMIT 1 in your query, but then you loop over the results. This doesn't make sense because LIMIT 1 means you'll only get one row of results, regardless of how many rows matched your query.
You are not escaping your inputs. This may be OK in this case if, in earlier code, you have already verified that those variables definitely contain integer values. I generally just use prepared statements and avoid this problem altogether.
And i also want to add one little advice: Try not to use php shorttags(Use <?php echo instead of <?=) from now. Since PHP6 it will not be supported and it might create code errors and other difficulties for you in the future.
When writing, or should I say developing new features, one should always debug and run a performance profile on query to ensure that you'r using proper index(s). Always use EXPLAIN (or EXPLAIN EXTENDED if your into specifics) on a query to determine its performance.
EXPLAIN SELECT * FROM clients, projects WHERE clients.clientID = 1 AND projects.clientID = 1
I noted also that your doing a while() loop, which is unnecessary if your are only fetching a single row.
I prefer writing projects.clientID=clients.clientID so to be more easy to see how the tables are joined.
The other thing I will avoid (mostly for clarity) is the while loop. Since you expect one record, no need to loop over the result set.
And finally, it is not a good idea to use the short tags.
Is there an appreciable performance difference between having one SELECT foo, bar, FROM users query that returns 500 rows, and 500 SELECT foo, bar, FROM users WHERE id = x queries coming all at once?
In a PHP application I'm writing, I'm trying to choose between a writing clear, readable section of code that would produce about 500 SELECT statements; or writing a it in an obscure, complex way that would use only one SELECT that returns 500 rows.
I would prefer the way that uses clear, maintainable code, but I'm concerned that the connection overhead for each of the SELECTs will cause performance problems.
Background info, in case it's relevant:
1) This is a Drupal module, coded in PHP
2) The tables in question get very few INSERTs and UPDATEs, and are rarely locked
3) SQL JOINs aren't possible for reasons not relevant to the question
Thanks!
It's almost always faster to do one big batch SELECT and parse the results in your application code than doing a massive amount of SELECTs for one row. I would recommend that you implement both and profile them, though. Always strive to minimize the number of assumptions you have to make.
I would not worry about the connection overhead of mysql queries too much, especially if you are not closing the connection between every query. Consider that if your query creates a temporary table, you've already spent more time in the query than the overhead of the query took.
I love doing a complex SQL query, personally, but I have found that the size of the tables, mysql query cache and query performance of queries that need to do range checking (even against an index) all make a difference.
I suggest this:
1) Establish the simple, correct baseline. I suspect this is the zillion-query approach. This is not wrong, and very likely helfully correct. Run it a few times and watch your query cache and application performance. The ability to keep your app maintainable is very important, especially if you work with other code maintainers. Also, if you're querying really large tables, small queries will maintain scalability.
2) Code the complex query. Compare the results for accuracy, and then the time. Then use EXPECT on the query to see what the rows scanned are. I have often found that if I have a JOIN, or a WHERE x != y, or a condition that creates a temporary table, the query performance could get pretty bad, especially if I'm in a table that's always getting updated. However, I've also found that a complex query might not be correct, and also that a complex query can more easily break as an application grows. Complex queries typically scan larger sets of rows, often creating temporary tables and invoke using where scans. The larger the table, the more expensive these get. Also, you might have team considerations where complex queries don't suit your team's strengths.
3) Share the results with your team.
Complex queries are less likely to hit the mysql query cache, and if they are large enough, don't cache them. (You want to save the mysql query cache for frequently hit queries.) Also, query where predicates that have to scan the index will not do as well. (x != y, x > y, x < y). Queries like SELECT foo, bar FROM users WHERE foo != 'g' and mumble < '360' end up doing scans. (The cost of query overhead could be negligible in that case.)
Small queries can often complete without creating temporary tables just by getting all values from the index, so long as the fields you're selecting and predicating on are indexed. So the query performance of SELECT foo, bar FROM users WHERE id = x is really great (esp if columns foo and bar are indexed like, aka alter table users add index ix_a ( foo, bar );.)
Other good ways to increase performance in your application would be to cache those small query results in the application (if appropriate), or doing batch jobs of a materialized view query. Also, consider memcached or some features found in XCache.
It seems like you know what the 500 id values are, so why not do something like this:
// Assuming you have already validated that this array contains only integers
// so there is not risk of SQl injection
$ids = join(',' $arrayOfIds);
$sql = "SELECT `foo`, `bar` FROM `users` WHERE `id` IN ($ids)";
This is a really broad question, but I have come across it a couple of times in the last few weeks and I was wondering what the general consensus is regarding good practice and efficiency.
1)
SELECT COUNT(*) FROM table WHERE id='$id', name='$name', owner='$owner_id'
and then based on if there is one result then the record matches.
2)
SELECT * FROM table WHERE id='$id'
and then a series of if commands to check the results match.
Now obviously there are advantages to the second solution as it allows for accurate error reports as to the field that does not match... but if that is not required which is more efficient, considered better practice and is there a difference to the load on the mySQL server between the two?
Option 1 by a long shot. Let SQL do what it is designed to do best, and better than procedural code. That is, filtering and sorting data.
Also, it is a much more efficient use of resources (bandwidth, DB utilization, etc) to pull down only the data you need from the server.
Use 1). Mysql is very efficient in selecting data based on certain conditions.
Large query can take .1 to 5.1 or more seconds, you need to find, run it and find it. Usually multiple if are way better as PHP is very fast. I did that when I was using it with 5 joins in table with 5 billion products, then I reduce one join and then use if statement to fix it up. Query was taking 4.2 seconds, when I reduced join, it took 3.8s but as you know PHP is way faster.