Auto index, repair and optimize MySQL table on every page load - php

I am in a debate with a guy telling me that there is no performance hit for using his function that...
Auto index, repair and optimize MySQL tables using PHP class __destruct() on every single page load by every user who runs the page.
He is asking me why I think it is not good for performance but I do not really know, can someone tell me why such a thing isn't good?
UPDATE His reasoning...
Optimizing & repairing the database tables eliminates the byte size of overhead that can essentially slow down additional queries when multiple connections and table use are concerned. Even with a performance enhanced database schema with indexing enabled.
Not to mention the amount of execution time to perform these operations are slim to none in memory and processor threading.
Opening, reading, writing, updating and then cleaning up after oneself makes more sense to me then performing the same operations and leaving unnecessary overhead behind waiting for a cron entry to clean up.

Instead of arguing, why not measure? Use a toolkit to profile where you're spending time, such as Instrumentation for PHP. Prove that the optimize step of your PHP request is taking a long time.
Reindexing is an expensive process, at least as costly as doing a table-scan as if you did not have an index. You should build indexes infrequently, so that you serve many PHP requests with the aid of the index for every one time you build the index. If you're building the index on every PHP request, you might as well not define indexes at all, and just run table-scans all the time.
REPAIR TABLE is only relevant for MyISAM tables (and Archive tables). I don't recommend using MyISAM tables. You should just use InnoDB tables. Not only for the sake of performance, but also data safety. MyISAM is very susceptible to data corruption, whereas InnoDB protects against that in most cases by maintaining internal checksums per page.
OPTIMIZE TABLE for an InnoDB table rebuilds all the data and index pages. This is going to be immensely expensive once your table grows to a non-trivial size. Certainly not something you would want to do on every page load. I would even say you should not do OPTIMIZE TABLE during any PHP web request -- do it offline via some script or admin interface.
A table restructure also locks the table. You will queue up all other PHP requests that access the same table for a long time (i.e. minutes or even hours, depending on the size of the table). When each PHP request gets its chance, it'll run another table restructure. It's ridiculous to incur this amount of overhead on every PHP request.
You can also use an analogy: you don't rebuild or optimize an entire table or index during every PHP request for the same reason you don't give your car a tune-up and oil change every time you start it:
It would be expensive and inconvenient to do so, and it would give no extra benefit compared to performing engine maintenance on an appropriate schedule.

Because every single operation (index,repair and optimize) takes considerable time; in fact they are VERY expensive (table locks, disk IO, risk of data loss) if the tables are even slightly big.
Doing this on every page load is definitely not recommended. It should be done only when needed.

Repair table could cause data loss as stated in documentation, so it requires previous backup to avoid further problems. Also, it is intended to be run only in case of disaster (something HAS failed).
Optimize table blocks the table under maintenance so it could cause problems to concurrent users.
My 0.02: Database management operations should not be part of common user transactions as they are expensive in time and resources as your tables grow.

I have set the following code into our scheduled job running every early morning, when users don't access frequently our site (I read that OPTIMIZE should lock affected tables during optimization).
Advantage using this function is that a single query is composed with all table names comma-separated, instead executing a lot of queries, one for each table to optimize.
It's supposed that you have a db connection opened and a db selected yet, in order to use this function without specifying db connection, db name, etc.
$q = "SHOW TABLE STATUS WHERE Data_Free > '0'";
$res = mysql_query($q); $TOOPT = mysql_num_rows($res);
$N = 0; // number of optimized tables
if(mysql_num_rows($res) > 0)
{
$N = 1;
while($t = mysql_fetch_array($res))
{
$TNAME = $t['Name']; $TSPACE += $t['Data_free'];
if($N < 2)
{
$Q = "OPTIMIZE TABLE ".$TNAME."";
}
else
{
$Q .= ", ".$TNAME."";
}
$N++;
} // endwhile tables
mysql_query($Q);
} // endif tables found (to optimize)

The docs states...
(optimize reference)
OPTIMIZE TABLE should be used if you have deleted a large part of a
table or if you have made many changes to a table with variable-length
rows (tables that have VARCHAR, VARBINARY, BLOB, or TEXT columns).
Deleted rows are maintained in a linked list and subsequent INSERT
operations reuse old row positions. You can use OPTIMIZE TABLE to
reclaim the unused space and to defragment the data file. After
extensive changes to a table, this statement may also improve
performance of statements that use the table, sometimes significantly.
When operations have been performed performance is enhanced by using the 'OPTIMIZE' command.
(flush reference)
FLUSH TABLES has several variant forms. FLUSH TABLE is a synonym for
FLUSH TABLES, except that TABLE does not work with the WITH READ LOCK
variant.
Using the 'FLUSH TABLE' command vs. 'FLUSH TABLES' there is no READ LOCK performed.
(repair reference)
Normally, you should never have to run REPAIR TABLE. However, if
disaster strikes, this statement is very likely to get back all your
data from a MyISAM table. If your tables become corrupted often, you
should try to find the reason for it, to eliminate the need to use
REPAIR TABLE. See Section C.5.4.2, “What to Do If MySQL Keeps
Crashing”, and Section 13.5.4, “MyISAM Table Problems”.
It is my understanding here that if the 'REPAIR TABLE' command is run consistantly the condition concerning large records created would be eliminated as constant maintances is performed. If I am wrong I would like to see benchmarks as my own attempts have not shown anything too detrimental, although the record sets have been under the 10k mark.
Here is the pice of code that is being used and #codedev is asking about...
class db
{
protected static $dbconn;
// rest of database class
public function index($link, $database)
{
$obj = $this->query('SHOW TABLES');
$results = $this->results($obj);
foreach ($results as $key => $value){
if (isset($value['Tables_in_'.$database])){
$this->query('REPAIR TABLE '.$value['Tables_in_'.$database]);
$this->query('OPTIMIZE TABLE '.$value['Tables_in_'.$database]);
$this->query('FLUSH TABLE '.$value['Tables_in_'.$database]);
}
}
}
public function __destruct()
{
$this->index($this->dbconn, $this->configuration['database']);
$this->close();
}
}

Related

Nested queries performance on MySQL vs Multiple calls. (PHP)

Is there any advantages to having nested queries instead of separating them?
I'm using PHP to frequently query from MySQL and would like to separate them for better organization. For example:
Is:
$query = "SELECT words.unique_attribute
FROM words
LEFT JOIN adjectives ON adjectives.word_id = words.id
WHERE adjectives = 'confused'";
return $con->query($query);
Faster/Better than saying:
$query = "SELECT word_id
FROM adjectives
WHERE adjectives = 'confused';";
$id = getID($con->query($query));
$query = "SELECT unique_attribute
FROM words
WHERE id = $id;";
return $con->query($query);
The second option would give me a way to make a select function, where I wouldn't have to repeat so much query string code, but if making so many additional calls(these can get very deeply nested) will be very bad for performance, I might keep it. Or at least look out for it.
Like most questions containing 'faster' or 'better', it's a trade-off and it depends on which part you want to speed up and what your definition of 'better' is.
Compared with the two separate queries, the combined query has the advantages of:
speed: you only need to send one query to the database system, the database only needs to parse one query string, only needs to compose one query plan, only needs to push one result back up and through the connection to PHP. The difference (when not executing these queries thousands of times) is very minimal, however.
atomicity: the query in two parts may deliver a different result from the combined query if the words table changes between the first and second query (although in this specific example this is probably not a constantly-changing table...)
At the same time the combined query also has the disadvantage of (as you already imply):
re-usability: the split queries might come in handy when you can re-use the first one and replace the second one with something that selects a different column from the words table or something from another table entirely. This disadvantage can be mitigated by using something like a query builder (not to be confused with an ORM!) to dynamically compose your queries, adding where clauses and joins as needed. For an example of a query builder, check out Zend\Db\Sql.
locking: depending on the storage engine and storage engine version you are using, tables might get locked. Most select statements do not lock tables however, and the InnoDB engine definitely doesn't. Nevertheless, if you are working with an old version of MySQL on the MyISAM storage engine and your tables are under heavy load, this may be a factor. Note that even if the combined statement locks the table, the combined query will offer faster average completion time because it is faster in total while the split queries will offer faster initial response (to the first query) while still needing a higher total time (due to the extra round trips et cetera).
It would depend on the size of those tables and where you want to place the load. If those tables are large and seeing a lot of activity, then the second version with two separate queries would minimise the lock time you might see as a result of the join. However if you've got a beefy db server with fast SSD storage, you'd be best off avoiding the overhead of dipping into the database twice.
All things being equal I'd probably go with the former - it's a database problem so it should be resolved there. I imagine those tables wouldn't be written to particularly often so I'd ensure there's plenty of MySQL cache available and keep an eye on the slow query log.

Should I use a JOIN function or run several queries in a loop structure?

I have this 2 mysql tables: TableA and TableB
TableA
* ColumnAId
* ColumnA1
* ColumnA2
TableB
* ColumnBId
* ColumnAId
* ColumnB1
* ColumnB2
In PHP, I wanted to have this multidimensional array format
$array = array(
array(
'ColumnAId' => value,
'ColumnA1' => value,
'ColumnA2' => value,
'TableB' => array(
array(
'ColumnBId' => value,
'ColumnAId' => value,
'ColumnB1' => value,
'ColumnB2' => value
)
)
)
);
so that I can loop it in this way
foreach($array as $i => $TableA) {
echo 'ColumnAId' . $TableA['ColumnAId'];
echo 'ColumnA1' . $TableA['ColumnA1'];
echo 'ColumnA2' . $TableA['ColumnA2'];
echo 'TableB\'s';
foreach($value['TableB'] as $j => $TableB) {
echo $TableB['...']...
echo $TableB['...']...
}
}
My problem is that, what is the best way or the proper way of querying MySQL database so that I can achieve this goal?
Solution1 --- The one I'm using
$array = array();
$rs = mysqli_query("SELECT * FROM TableA", $con);
while ($row = mysqli_fetch_assoc($rs)) {
$rs2 = mysqli_query("SELECT * FROM Table2 WHERE ColumnAId=" . $row['ColumnAId'], $con);
// $array = result in array
$row['TableB'] = $array2;
}
I'm doubting my code cause its always querying the database.
Solution2
$rs = mysqli_query("SELECT * FROM TableA JOIN TableB ON TableA.ColumnAId=TableB.ColumnAId");
while ($row = mysqli_fet...) {
// Code
}
The second solution only query once, but if I have thousand of rows in TableA and thousand of rows in TableB for each TableB.ColumnAId (1 TableA.ColumnAId = 1000 TableB.ColumnAId), thus this solution2 takes much time than the solution1?
Neither of the two solutions proposed are probably optimal, BUT solution 1 is UNPREDICTABLE and thus INHERENTLY FLAWED!
One of the first things you learn when dealing with large databases is that 'the best way' to do a query is often dependent upon factors (referred to as meta-data) within the database:
How many rows there are.
How many tables you are querying.
The size of each row.
Because of this, there's unlikely to be a silver bullet solution for your problem. Your database is not the same as my database, you will need to benchmark different optimizations if you need the best performance available.
You will probably find that applying & building correct indexes (and understanding the native implementation of indexes in MySQL) in your database does a lot more for you.
There are some golden rules with queries which should rarely be broken:
Don't do them in loop structures. As tempting as it often is, the overhead on creating a connection, executing a query and getting a response is high.
Avoid SELECT * unless needed. Selecting more columns will significantly increase overhead of your SQL operations.
Know thy indexes. Use the EXPLAIN feature so that you can see which indexes are being used, optimize your queries to use what's available and create new ones.
Because of this, of the two I'd go for the second query (replacing SELECT * with only the columns you want), but there are probably better ways to structure the query if you have the time to optimize.
However, speed should NOT be your only consideration in this, there is a GREAT reason not to use suggestion one:
PREDICTABILITY: why read-locks are a good thing
One of the other answers suggests that having the table locked for a long period of time is a bad thing, and that therefore the multiple-query solution is good.
I would argue that this couldn't be further from the truth. In fact, I'd argue that in many cases the predictability of running a single locking SELECT query is a greater argument FOR running that query than the optimization & speed benefits.
First of all, when we run a SELECT (read-only) query on a MyISAM or InnoDB database (default systems for MySQL), what happens is that the table is read-locked. This prevents any WRITE operations from happening on the table until the read-lock is surrendered (either our SELECT query completes or fails). Other SELECT queries are not affected, so if you're running a multi-threaded application, they will continue to work.
This delay is a GOOD thing. Why, you may ask? Relational data integrity.
Let's take an example: we're running an operation to get a list of items currently in the inventory of a bunch of users on a game, so we do this join:
SELECT * FROM `users` JOIN `items` ON `users`.`id`=`items`.`inventory_id` WHERE `users`.`logged_in` = 1;
What happens if, during this query operation, a user trades an item to another user? Using this query, we see the game state as it was when we started the query: the item exists once, in the inventory of the user who had it before we ran the query.
But, what happens if we're running it in a loop?
Depending on whether the user traded it before or after we read his details, and in which order we read the inventory of the two players, there are four possibilities:
The item could be shown in the first user's inventory (scan user B -> scan user A -> item traded OR scan user B -> scan user A -> item traded).
The item could be shown in the second user's inventory (item traded -> scan user A -> scan user B OR item traded -> scan user B -> scan user A).
The item could be shown in both inventories (scan user A -> item traded -> scan user B).
The item could be shown in neither of the user's inventories (scan user B -> item traded -> scan user A).
What this means is that we would be unable to predict the results of the query or to ensure relational integrity.
If you're planning to give $5,000 to the guy with item ID 1000000 at midnight on Tuesday, I hope you have $10k on hand. If your program relies on unique items being unique when snapshots are taken, you will possibly raise an exception with this kind of query.
Locking is good because it increases predictability and protects the integrity of results.
Note: You could force a loop to lock with a transaction, but it will still be slower.
Oh, and finally, USE PREPARED STATEMENTS!
You should never have a statement that looks like this:
mysqli_query("SELECT * FROM Table2 WHERE ColumnAId=" . $row['ColumnAId'], $con);
mysqli has support for prepared statements. Read about them and use them, they will help you to avoid something terrible happening to your database.
Definitely second way. Nested query is an ugly thing since you're getting all query overheads (execution, network e t.c.) every time for every nested query, while single JOIN query will be executed once - i.e. all overheads will be done only once.
Simple rule is not to use queries in cycles - in general. There could be exceptions, if one query will be too complex, so due to performance in should be split, but in a certain case that can be shown only by benchmarks and measures.
If you want to do algorithmic evaluation of your data in your application code (which I think is the right thing to do), you should not use SQL at all. SQL was made to be a human readable way to ask for computational achieved data from a relational database, which means, if you just use it to store data, and do the computations in your code, you're doing it wrong anyway.
In such a case you should prefer using a different storage/retrieving possibility like a key-value store (there are persistent ones out there, and newer versions of MySQL exposes a key-value interface as well for InnoDB, but it's still using a relational database for key-value storage, aka the wrong tool for the job).
If you STILL want to use your solution:
Benchmark.
I've often found that issuing multiple queries can be faster than a single query, because MySQL has to parse less query, the optimizer has less work to do, and more often than not the MySQL optimzer just fails (that's the reason things like STRAIGHT JOIN and index hints exist). And even if it does not fail, multiple queries might still be faster depending on the underlying storage engine as well as how many threads try to access the data at once (lock granularity - this only applies with mixing in update queries though - neither MyISAM nor InnoDB lock the whole table for SELECT queries by default). Then again, you might even get different results with the two solutions if you don't use transactions, as data might change between queries if you use multiple queries versus a single one.
In a nutshell: There's way more to your question than what you posted/asked for, and what a generic answer can provide.
Regarding your solutions: I'd prefer the first solution if you have an environment where a) data changes are common and/or b) you have many concurrent running threads (requests) accessing and updating your tables (lock granularity is better with split up queries, as is cacheability of the queries) ; if your database is on a different network, e.g. network latency is an issue, you're probably better of with the first solution (but most people I know have MySQL on the same server, using socket connections, and local socket connections normally don't have much latency).
Situation may also change depending on how often the for loop is actually executed.
Again: Benchmark
Another thing to consider is memory efficiency and algorithmic efficiency. Later one is about O(n) in both cases, but depending on the type of data you use to join, it could be worse in any of the two. E.g. if you use strings to join (you really shouldn't, but you don't say), performance in the more php dependent solution also depends on php hash map algorithm (arrays in php are effectively hash maps) and the likelyhood of a collision, while mysql string indexes are normally fixed length, and thus, depending on your data, might not be applicable.
For memory efficiency, the multi query version is certainly better, as you have the php array anyway (which is very inefficient in terms of memory!) in both solutions, but the join might use a temp table depending on several circumstances (normally it shouldn't, but there ARE cases where it does - you can check using EXPLAIN for your queries)
In some case, you should using limit for best performance
If you wanna show 1000 rows
And some single query( master data)
you should run 1000 with limit between 10-100
Then get your foreign key to master data with single query with using WHERE IN in your query. then count your unique data to limit master data.
Example
Select productID, date from transaction_product limit 100
Get all productID and make it unique
Then
Select price from master_product WHERE IN (1,2 3 4) limit 4(count from total unique)
foreach(transaction)
master_poduct[productID]

Scaling with a MySQL Huge Update

We've just built a system that rolls up its data at midnight. It must iterate through several combinations of tables in order to rollup the data it needs. Unfortunately the UPDATE queries are taking forever. We have 1/1000th of our forecasted userbase and it already takes 28 minutes to rollup our data daily with just our beta users.
Since the main lag is UPDATE queries, it may be hard to delegate servers to handle the data processing. What are some other options for optimizing millions of UPDATE queries? Is my scaling issue in the code below?:
$sql = "SELECT ab_id, persistence, count(*) as no_x FROM $query_table ftbl
WHERE ftbl.$query_col > '$date_before' AND ftbl.$query_col <= '$date_end'
GROUP BY ab_id, persistence";
$data_list = DatabaseManager::getResults($sql);
if (isset($data_list)){
foreach($data_list as $data){
$ab_id = $data['ab_id'];
$no_x = $data['no_x'];
$measure = $data['persistence'];
$sql = "SELECT ab_id FROM $rollup_table WHERE ab_id = $ab_id AND rollup_key = '$measure' AND rollup_date = '$day_date'";
if (DatabaseManager::getVar($sql)){
$sql = "UPDATE $rollup_table SET $rollup_col = $no_x WHERE ab_id = $ab_id AND rollup_key = '$measure' AND rollup_date = '$day_date'";
DatabaseManager::update($sql);
} else {
$sql = "INSERT INTO $rollup_table (ab_id, rollup_key, $rollup_col, rollup_date) VALUES ($ab_id, '$measure', $no_x, '$day_date')";
DatabaseManager::insert($sql);
}
}
}
When addressing SQL scaling issues, it is always best to benchmark your problematic SQL. Even at the PHP level is fine in this case, as you're running your queries within PHP.
If your first query could potentially return millions of records, you may be better served running that query as a MySQL stored procedure. That will minimize the amount of data that has to be transferred between database server and PHP application server. Even if both are the same machine, you can still realize a significant performance improvement.
Some questions to consider that may help to resolve your issue follow:
How long do your SELECT queries take to process without the UPDATE or INSERT statements?
What is the percentage breakdown of your queries - by both SQL selects, and the INSERT and UPDATE? It will be easier to help identify solutions with that info.
Is it possible that there may be larger bottlenecks with those that may resolve your performance issues?
Is it necessary to iterate through your data at the PHP source-code level rather than the MySQL stored procedure level?
Is there a necessity to iterate procedurally through your records, or is it possible to accomplish the same thing through set-based operations?
Does your rollup_table have an index that covers the columns from the UPDATE query?
Also, the SELECT query ran right before your UPDATE query appears to have an identical WHERE clause. It seems to be a redundancy. If you can get away with only running the WHERE clause once, you will shave a lot of time off your largest bottleneck.
If you're unfamiliar with writing MySQL stored procedures, the process is quite simple. See http://www.mysqltutorial.org/getting-started-with-mysql-stored-procedures.aspx for an example. MySQL has good documentation on this as well. A stored procedure is a program that runs within the MySQL database process, which may help to improve performance when dealing with queries that potentially return millions of rows.
Set-based database operations are often faster than procedural operations. SQL is a set-based language. You can update all rows in a database table with a single UPDATE statement, i.e. UPDATE customers SET total_owing_to_us = 1000000 updates all rows in the customers table, without the need to create a programmatic loop like you've created in your sample code. If you have 100,000,000 customer entries, the set-based update will be significantly faster than the procedural update. There are lots of useful resources online that you can read up about this. Here's a SO link to get started: Why are relational set-based queries better than cursors?.
Seems like you are doing one insert or update at a time. Have you tried how much faster it would be to have one big insert or update or batching the queries as much as possible? Here is an example http://www.stackoverflow.com/questions/3432/multiple-updates-in-mysql

MYSQL table becoming large

I have a table in which approx 100,000 rows are added every day. I am supposed to generate reports from this table. I am using PHP to generate these reports. Recently the script which used to do this is taking too long to complete. How can I improve the performance by shifting to something else than MYSQL which is scalable in the long run.
MySQL is very scalable, that's for sure.
The key is not changing the db from Mysql to other but you should:
Optimize your queries (can sound silly for others but I remember for instance that a huge improvment I've done sometime ago is to change SELECT * into selecting only the column(s) I need. It's a frequent issue I meet in others code too)
Optimize your table(s) design (normalization etc).
Add indexes on the column(s) you are using frequently in the queries.
Similar advices here
For generating reports or file downloads with large chunks of data you should concider using flush and increasing time_limit and memory limit.
I doubt the problem lies in the amount of rows, since MySQL can support ALOT of rows. But you can of course fetch x rows a time and process them in chunks.
I do assume your MySQL is properly tweaked for performance.
First analyse why (or: whether) your queries are slow: http://dev.mysql.com/doc/refman/5.1/en/using-explain.html
You should read the following and learn a little bit about the advantages of a well designed innodb table and how best to use clustered indexes - only available with innodb !
The example includes a table with 500 million rows with query times of 0.02 seconds.
MySQL and NoSQL: Help me to choose the right one
Hope you find this of interest.
Another thought is to move records beyond a certain age to a historical database for archiving, reporting, etc. If you don't need that large volume for transactional processing it might make sense to extract them from the transactional data store.
It's common to separate transactional and reporting databases.
I am going to make some assumptions
Your 100k rows added every day have timestamps which are either real-time, or are offset by a relatively short amount of time (hours at most); your 100k rows are added either throughout the day or in a few big batches.
The data are never updated
You are using InnoDB engine (Frankly you would be insane to use MyISAM for large tables because in the event of a crash, index rebuild takes a prohibitive time)
You haven't explained what kind of reports you're trying to generate, but I'm assuming that your table looks like this:
CREATE TABLE logdata (
dateandtime some_timestamp_type NOT NULL,
property1 some_type_1 NOT NULL,
property2 some_type_2 NOT NULL,
some_quantity some_numerical_type NOT NULL,
... some other columns not required for reports ...
... some indexes ...
);
And that your reports look like
SELECT count(*), SUM(some_quantity), property1 FROM logdata WHERE dateandtime BETWEEEN some_time_range GROUP BY property1;
SELECT count(*), SUM(some_quantity), property2 FROM logdata WHERE dateandtime BETWEEEN some_time_range GROUP BY property2;
Now, as we can see, both of these reports are doing a scan of a large amount of the table, because you are reporting on a lot of rows.
The bigger the time range becomes the slower the reports will be. Moreover, if you have a lot of OTHER columns (say some varchars or blobs) which you aren't interested in reporting on, then they slow your report down too (because the server still needs to inspect the rows).
You can use several possible techniques for speeding this up:
Add covering index for each type of report, to support the columns you need and omit columns you don't. This may help a lot but slow inserts down.
Summarise data according to the dimension(s) that you want to report on. In this ficticious case, all your reports are either counting rows, or SUM()ing some_quantity.
Build mirror tables (containing the same data) which have appropriate primary keys / indexes/ columns to make the reports faster.
Use a column engine (e.g. Infobright)
Summarisation is usually an attractive option if your use-case supports it;
You may wish to ask a more detailed question with an explanation of your use-case.
The time limit can be temporarily turned off for a particular file if you know that it is going to potentially run over the time limit by calling set_time_limit (0); at the start of your script.
Other considerations such as indexing or archiving very old data to a different table should also be looked at.
Your best bet is something like MongoDB or CouchDB, both of which are non-relational databases oriented toward storing massive amounts of data. This is assuming that you've already tweaked your MySQL installation for performance and that your situation wouldn't benefit from parallelization.

Overhead for MySQL SELECTS - Better to Use One, or Many In Sequence

Is there an appreciable performance difference between having one SELECT foo, bar, FROM users query that returns 500 rows, and 500 SELECT foo, bar, FROM users WHERE id = x queries coming all at once?
In a PHP application I'm writing, I'm trying to choose between a writing clear, readable section of code that would produce about 500 SELECT statements; or writing a it in an obscure, complex way that would use only one SELECT that returns 500 rows.
I would prefer the way that uses clear, maintainable code, but I'm concerned that the connection overhead for each of the SELECTs will cause performance problems.
Background info, in case it's relevant:
1) This is a Drupal module, coded in PHP
2) The tables in question get very few INSERTs and UPDATEs, and are rarely locked
3) SQL JOINs aren't possible for reasons not relevant to the question
Thanks!
It's almost always faster to do one big batch SELECT and parse the results in your application code than doing a massive amount of SELECTs for one row. I would recommend that you implement both and profile them, though. Always strive to minimize the number of assumptions you have to make.
I would not worry about the connection overhead of mysql queries too much, especially if you are not closing the connection between every query. Consider that if your query creates a temporary table, you've already spent more time in the query than the overhead of the query took.
I love doing a complex SQL query, personally, but I have found that the size of the tables, mysql query cache and query performance of queries that need to do range checking (even against an index) all make a difference.
I suggest this:
1) Establish the simple, correct baseline. I suspect this is the zillion-query approach. This is not wrong, and very likely helfully correct. Run it a few times and watch your query cache and application performance. The ability to keep your app maintainable is very important, especially if you work with other code maintainers. Also, if you're querying really large tables, small queries will maintain scalability.
2) Code the complex query. Compare the results for accuracy, and then the time. Then use EXPECT on the query to see what the rows scanned are. I have often found that if I have a JOIN, or a WHERE x != y, or a condition that creates a temporary table, the query performance could get pretty bad, especially if I'm in a table that's always getting updated. However, I've also found that a complex query might not be correct, and also that a complex query can more easily break as an application grows. Complex queries typically scan larger sets of rows, often creating temporary tables and invoke using where scans. The larger the table, the more expensive these get. Also, you might have team considerations where complex queries don't suit your team's strengths.
3) Share the results with your team.
Complex queries are less likely to hit the mysql query cache, and if they are large enough, don't cache them. (You want to save the mysql query cache for frequently hit queries.) Also, query where predicates that have to scan the index will not do as well. (x != y, x > y, x < y). Queries like SELECT foo, bar FROM users WHERE foo != 'g' and mumble < '360' end up doing scans. (The cost of query overhead could be negligible in that case.)
Small queries can often complete without creating temporary tables just by getting all values from the index, so long as the fields you're selecting and predicating on are indexed. So the query performance of SELECT foo, bar FROM users WHERE id = x is really great (esp if columns foo and bar are indexed like, aka alter table users add index ix_a ( foo, bar );.)
Other good ways to increase performance in your application would be to cache those small query results in the application (if appropriate), or doing batch jobs of a materialized view query. Also, consider memcached or some features found in XCache.
It seems like you know what the 500 id values are, so why not do something like this:
// Assuming you have already validated that this array contains only integers
// so there is not risk of SQl injection
$ids = join(',' $arrayOfIds);
$sql = "SELECT `foo`, `bar` FROM `users` WHERE `id` IN ($ids)";

Categories