We've just built a system that rolls up its data at midnight. It must iterate through several combinations of tables in order to rollup the data it needs. Unfortunately the UPDATE queries are taking forever. We have 1/1000th of our forecasted userbase and it already takes 28 minutes to rollup our data daily with just our beta users.
Since the main lag is UPDATE queries, it may be hard to delegate servers to handle the data processing. What are some other options for optimizing millions of UPDATE queries? Is my scaling issue in the code below?:
$sql = "SELECT ab_id, persistence, count(*) as no_x FROM $query_table ftbl
WHERE ftbl.$query_col > '$date_before' AND ftbl.$query_col <= '$date_end'
GROUP BY ab_id, persistence";
$data_list = DatabaseManager::getResults($sql);
if (isset($data_list)){
foreach($data_list as $data){
$ab_id = $data['ab_id'];
$no_x = $data['no_x'];
$measure = $data['persistence'];
$sql = "SELECT ab_id FROM $rollup_table WHERE ab_id = $ab_id AND rollup_key = '$measure' AND rollup_date = '$day_date'";
if (DatabaseManager::getVar($sql)){
$sql = "UPDATE $rollup_table SET $rollup_col = $no_x WHERE ab_id = $ab_id AND rollup_key = '$measure' AND rollup_date = '$day_date'";
DatabaseManager::update($sql);
} else {
$sql = "INSERT INTO $rollup_table (ab_id, rollup_key, $rollup_col, rollup_date) VALUES ($ab_id, '$measure', $no_x, '$day_date')";
DatabaseManager::insert($sql);
}
}
}
When addressing SQL scaling issues, it is always best to benchmark your problematic SQL. Even at the PHP level is fine in this case, as you're running your queries within PHP.
If your first query could potentially return millions of records, you may be better served running that query as a MySQL stored procedure. That will minimize the amount of data that has to be transferred between database server and PHP application server. Even if both are the same machine, you can still realize a significant performance improvement.
Some questions to consider that may help to resolve your issue follow:
How long do your SELECT queries take to process without the UPDATE or INSERT statements?
What is the percentage breakdown of your queries - by both SQL selects, and the INSERT and UPDATE? It will be easier to help identify solutions with that info.
Is it possible that there may be larger bottlenecks with those that may resolve your performance issues?
Is it necessary to iterate through your data at the PHP source-code level rather than the MySQL stored procedure level?
Is there a necessity to iterate procedurally through your records, or is it possible to accomplish the same thing through set-based operations?
Does your rollup_table have an index that covers the columns from the UPDATE query?
Also, the SELECT query ran right before your UPDATE query appears to have an identical WHERE clause. It seems to be a redundancy. If you can get away with only running the WHERE clause once, you will shave a lot of time off your largest bottleneck.
If you're unfamiliar with writing MySQL stored procedures, the process is quite simple. See http://www.mysqltutorial.org/getting-started-with-mysql-stored-procedures.aspx for an example. MySQL has good documentation on this as well. A stored procedure is a program that runs within the MySQL database process, which may help to improve performance when dealing with queries that potentially return millions of rows.
Set-based database operations are often faster than procedural operations. SQL is a set-based language. You can update all rows in a database table with a single UPDATE statement, i.e. UPDATE customers SET total_owing_to_us = 1000000 updates all rows in the customers table, without the need to create a programmatic loop like you've created in your sample code. If you have 100,000,000 customer entries, the set-based update will be significantly faster than the procedural update. There are lots of useful resources online that you can read up about this. Here's a SO link to get started: Why are relational set-based queries better than cursors?.
Seems like you are doing one insert or update at a time. Have you tried how much faster it would be to have one big insert or update or batching the queries as much as possible? Here is an example http://www.stackoverflow.com/questions/3432/multiple-updates-in-mysql
Related
I'm trying to figure out the most efficient way to send multiple queries to a MySQL database with PHP. Right now I'm doing two separate queries but I know there are more efficient methods, like using mysqli_multi_query. Is mysqli_multi_query the most efficient method or are there other means?
For example, I could just write a query that puts ALL the data from ALL the tables in the database into a PHP array. Then I could sort the data using PHP, resulting in having only one query no matter what data I needed... and I could put that PHP array into a session variable so the user would never query the database again during that session. Makes sense right? Why not just do that rather than create a new query each time the page is reloaded?
It's really difficult to find resources on this so I'm just looking for advice. I plan to have massive traffic on the site that I am building so I need the code to put as little stress on the server as possible. As far as table size is concerned, we're talking about, let's say 3,000 rows in the largest table. Is it feasible to store that into one big PHP array (advantage being the client would query the database only ONCE on page load)?
$Table1Array = Array();
$Table1_result = mysqli_query($con,"SELECT * FROM Table1 WHERE column1 ='" . $somevariable . "'");
while($row = mysqli_fetch_array($Table1_result))
{
$Table1Array[] = $row;
}
// query 2
$Table2Array = Array();
$Table2_result = mysqli_query($con,"SELECT * FROM Table2 LIMIT 5");
while($row = mysqli_fetch_array($Table2_result))
{
$Table2Array[] = $row;
}
There are a few things to address here, hopefully this will make sense / be constructive...
Is mysqli_multi_query the most efficient method or are there other
means?
It depends on the specifics of what you are trying to do for a given page / query. Generally speaking though, using mysql_multi_query won't gain you much performance, as MySQL will still execute the queries you give it one after the other. mysql_multi_query's performance gains come from the fact that fewer round trips are made between PHP and MySQL. A good thing if the two are on different servers, or you are performing 1000s of queries one after the other.
For example, I could just write a query that puts ALL the data from
ALL the tables in the database into a PHP array.
Just. No. In theory you could, but unless you had one page that displayed all of the database contents at once, there would simply be no need.
Then I could sort the data using PHP
If you can sort / filter the data into the correct form using MySQL, do that. Manipulating datasets is one of the things MySQL is very good at.
Why not just [load everything into the session] rather than create a new query each time the page is reloaded?
Because the dataset would be huge, and that session data would be transferred from the client every time they made a request to your server. Apart from sending needlessly huge requests, what about the other challenges this approach would raise? I.e. What would you do if extra data had been added to the db since you created the session-based cache for this particular user? What if the size of the data got too big for a user's session? What experience would I have as a user if I denied your session cookie and thereby forced the monster query to execute on every request?
I plan to have massive traffic on the site that I am building
Don't we all! As the comments above suggest, premature optimization is a Bad Thing. At this stage you should concentrate on getting your domain logic nailed down and building a good, maintainable OO platform on which to base further development.
If i wanted to execute multiple queries on a mysql database i would use mysql stored procedures and then all u have to do is issue a simple call from php, a basic example of a procedure would be:
DELIMITER $$
create procedure multiple_queries()
Begin
SELECT * FROM TBL1 WHERE 1;
SELECT * FROM TBL2 WHERE 2;
SELECT * FROM TBL3 LEFT JOIN ON TBL4 WHERE id= '121';
END $$
DELIMITER ;
and in php you simple call the procedure and any parameter associated with it in the parenthesis
CALL multiple_queries()
Why not use the DB engine as much as possible, its well capable of handling complex solutions and we dont utilize it.
For example, I could just write a query that puts ALL the data from ALL the tables in
the database into a PHP array. Then I could sort the data using PHP, resulting in having
only one query no matter what data I needed...
I would think this would be inefficient since you've lost the value of the Database. When you consider optimization, mysql is superior to any php code that you could write.
Additionally, you're saying that running one query, pushing the data into a variable for the users may decrease resources but is that really true? If you have massive traffic, and this data are in session variables, then if 1000 users are currently logged on then you will have 1000 duplications of the entire Database on your PHP server! - you sure the server has enough memory for this?
There are 2 ways I use to run multiple queries:
$conn = mysql_connect("host", "dbuser", "password");
$query1 = "select.......";
$result1 = mysql_query($query1) or die (mysql_error()); // execute the query
while($row1 = mysql_fetch_assoc($result1))
{
// fetch the results from the query
}
$query2 = "select.......";
$result2 = mysql_query($query2) or die (mysql_error()); // execute the query
while($row2 = mysql_fetch_assoc($result2))
{
// fetch the results from the query i.e. $row2['']
}
mysql_close($conn); // Close the Database connection.
The other way is to employ the use of transactions if there are more than one queries which must be either all executed or none at all
You could try it. But if the only reason is to have 1 query thinking that it will be faster, I would think otherwise. Optimizations in Databases are supreme especially mysql
Is there any advantages to having nested queries instead of separating them?
I'm using PHP to frequently query from MySQL and would like to separate them for better organization. For example:
Is:
$query = "SELECT words.unique_attribute
FROM words
LEFT JOIN adjectives ON adjectives.word_id = words.id
WHERE adjectives = 'confused'";
return $con->query($query);
Faster/Better than saying:
$query = "SELECT word_id
FROM adjectives
WHERE adjectives = 'confused';";
$id = getID($con->query($query));
$query = "SELECT unique_attribute
FROM words
WHERE id = $id;";
return $con->query($query);
The second option would give me a way to make a select function, where I wouldn't have to repeat so much query string code, but if making so many additional calls(these can get very deeply nested) will be very bad for performance, I might keep it. Or at least look out for it.
Like most questions containing 'faster' or 'better', it's a trade-off and it depends on which part you want to speed up and what your definition of 'better' is.
Compared with the two separate queries, the combined query has the advantages of:
speed: you only need to send one query to the database system, the database only needs to parse one query string, only needs to compose one query plan, only needs to push one result back up and through the connection to PHP. The difference (when not executing these queries thousands of times) is very minimal, however.
atomicity: the query in two parts may deliver a different result from the combined query if the words table changes between the first and second query (although in this specific example this is probably not a constantly-changing table...)
At the same time the combined query also has the disadvantage of (as you already imply):
re-usability: the split queries might come in handy when you can re-use the first one and replace the second one with something that selects a different column from the words table or something from another table entirely. This disadvantage can be mitigated by using something like a query builder (not to be confused with an ORM!) to dynamically compose your queries, adding where clauses and joins as needed. For an example of a query builder, check out Zend\Db\Sql.
locking: depending on the storage engine and storage engine version you are using, tables might get locked. Most select statements do not lock tables however, and the InnoDB engine definitely doesn't. Nevertheless, if you are working with an old version of MySQL on the MyISAM storage engine and your tables are under heavy load, this may be a factor. Note that even if the combined statement locks the table, the combined query will offer faster average completion time because it is faster in total while the split queries will offer faster initial response (to the first query) while still needing a higher total time (due to the extra round trips et cetera).
It would depend on the size of those tables and where you want to place the load. If those tables are large and seeing a lot of activity, then the second version with two separate queries would minimise the lock time you might see as a result of the join. However if you've got a beefy db server with fast SSD storage, you'd be best off avoiding the overhead of dipping into the database twice.
All things being equal I'd probably go with the former - it's a database problem so it should be resolved there. I imagine those tables wouldn't be written to particularly often so I'd ensure there's plenty of MySQL cache available and keep an eye on the slow query log.
I'm having an inner debate at my company about looping queries in this matter:
$sql = "
SELECT foreign_key
FROM t1";
foreach(fetchAll($sql) as $row)
{
$sub_sql = "
SELECT *
FROM t2
WHERE t2.id = " . $row['foreign_key'];
foreach(fetchAll($sub_sql) as $sub_row)
{
// ...
}
}
Instead of using an sql join like this:
$sql = "
SELECT t2.*
FROM t2
JOIN t1
ON t1.foreign_key = t2.id";
foreach(fetchAll($sql) as $row)
{
// ...
}
Additional information about this, the database is huge, millions of rows.
I have of course searched an answer to this question, but nobody can answer this in a a good way and with a lot of up votes that makes me certain that one way is better then the other.
Question
Can somebody explain to me why one of thees methods is better then the other one?
The join method is generally considered better, if only because it reduces the overhead of sending queries back and forth to the database.
If you have appropriate indexes on the tables, then the underlying performance of the two methods will be similar. That is, both methods will use appropriate indexes to fetch the results.
From a database perspective, the join method is far superior. It consolidates the data logic in one place, making the code more transparent. It also allows the database to make optimizations that might not be apparent in application code.
Because of driver overhead, a loop is far less efficient
This is similar to another question I answered, but different enough not to cv. My full answer is here but I'll summarize the main points:
Whenever you make a connection to a database, there are three steps taken:
A connection to the database is established.
A query, or multiple queries, to the database is executed.
Data is returned for processing.
Using a loop structure, you will end up generating additional overhead with driver requests, where you will have a request and a return per loop cycle rather than a single request and single return. Even if the looped queries do not take any longer than the single large query (this is very unlikely as MySQL internals have a lot of shortcuts built in to prevent using a full repetitive loop), you will still find that the single query is faster on driver overhead.
Using a loop without TRANSACTIONS, you will also find that you run into relational data integrity issues where other operations affect the data you're iterating between loop cycles. Using transactions, again, increases overhead because the database has to maintain two persistent states.
I have this 2 mysql tables: TableA and TableB
TableA
* ColumnAId
* ColumnA1
* ColumnA2
TableB
* ColumnBId
* ColumnAId
* ColumnB1
* ColumnB2
In PHP, I wanted to have this multidimensional array format
$array = array(
array(
'ColumnAId' => value,
'ColumnA1' => value,
'ColumnA2' => value,
'TableB' => array(
array(
'ColumnBId' => value,
'ColumnAId' => value,
'ColumnB1' => value,
'ColumnB2' => value
)
)
)
);
so that I can loop it in this way
foreach($array as $i => $TableA) {
echo 'ColumnAId' . $TableA['ColumnAId'];
echo 'ColumnA1' . $TableA['ColumnA1'];
echo 'ColumnA2' . $TableA['ColumnA2'];
echo 'TableB\'s';
foreach($value['TableB'] as $j => $TableB) {
echo $TableB['...']...
echo $TableB['...']...
}
}
My problem is that, what is the best way or the proper way of querying MySQL database so that I can achieve this goal?
Solution1 --- The one I'm using
$array = array();
$rs = mysqli_query("SELECT * FROM TableA", $con);
while ($row = mysqli_fetch_assoc($rs)) {
$rs2 = mysqli_query("SELECT * FROM Table2 WHERE ColumnAId=" . $row['ColumnAId'], $con);
// $array = result in array
$row['TableB'] = $array2;
}
I'm doubting my code cause its always querying the database.
Solution2
$rs = mysqli_query("SELECT * FROM TableA JOIN TableB ON TableA.ColumnAId=TableB.ColumnAId");
while ($row = mysqli_fet...) {
// Code
}
The second solution only query once, but if I have thousand of rows in TableA and thousand of rows in TableB for each TableB.ColumnAId (1 TableA.ColumnAId = 1000 TableB.ColumnAId), thus this solution2 takes much time than the solution1?
Neither of the two solutions proposed are probably optimal, BUT solution 1 is UNPREDICTABLE and thus INHERENTLY FLAWED!
One of the first things you learn when dealing with large databases is that 'the best way' to do a query is often dependent upon factors (referred to as meta-data) within the database:
How many rows there are.
How many tables you are querying.
The size of each row.
Because of this, there's unlikely to be a silver bullet solution for your problem. Your database is not the same as my database, you will need to benchmark different optimizations if you need the best performance available.
You will probably find that applying & building correct indexes (and understanding the native implementation of indexes in MySQL) in your database does a lot more for you.
There are some golden rules with queries which should rarely be broken:
Don't do them in loop structures. As tempting as it often is, the overhead on creating a connection, executing a query and getting a response is high.
Avoid SELECT * unless needed. Selecting more columns will significantly increase overhead of your SQL operations.
Know thy indexes. Use the EXPLAIN feature so that you can see which indexes are being used, optimize your queries to use what's available and create new ones.
Because of this, of the two I'd go for the second query (replacing SELECT * with only the columns you want), but there are probably better ways to structure the query if you have the time to optimize.
However, speed should NOT be your only consideration in this, there is a GREAT reason not to use suggestion one:
PREDICTABILITY: why read-locks are a good thing
One of the other answers suggests that having the table locked for a long period of time is a bad thing, and that therefore the multiple-query solution is good.
I would argue that this couldn't be further from the truth. In fact, I'd argue that in many cases the predictability of running a single locking SELECT query is a greater argument FOR running that query than the optimization & speed benefits.
First of all, when we run a SELECT (read-only) query on a MyISAM or InnoDB database (default systems for MySQL), what happens is that the table is read-locked. This prevents any WRITE operations from happening on the table until the read-lock is surrendered (either our SELECT query completes or fails). Other SELECT queries are not affected, so if you're running a multi-threaded application, they will continue to work.
This delay is a GOOD thing. Why, you may ask? Relational data integrity.
Let's take an example: we're running an operation to get a list of items currently in the inventory of a bunch of users on a game, so we do this join:
SELECT * FROM `users` JOIN `items` ON `users`.`id`=`items`.`inventory_id` WHERE `users`.`logged_in` = 1;
What happens if, during this query operation, a user trades an item to another user? Using this query, we see the game state as it was when we started the query: the item exists once, in the inventory of the user who had it before we ran the query.
But, what happens if we're running it in a loop?
Depending on whether the user traded it before or after we read his details, and in which order we read the inventory of the two players, there are four possibilities:
The item could be shown in the first user's inventory (scan user B -> scan user A -> item traded OR scan user B -> scan user A -> item traded).
The item could be shown in the second user's inventory (item traded -> scan user A -> scan user B OR item traded -> scan user B -> scan user A).
The item could be shown in both inventories (scan user A -> item traded -> scan user B).
The item could be shown in neither of the user's inventories (scan user B -> item traded -> scan user A).
What this means is that we would be unable to predict the results of the query or to ensure relational integrity.
If you're planning to give $5,000 to the guy with item ID 1000000 at midnight on Tuesday, I hope you have $10k on hand. If your program relies on unique items being unique when snapshots are taken, you will possibly raise an exception with this kind of query.
Locking is good because it increases predictability and protects the integrity of results.
Note: You could force a loop to lock with a transaction, but it will still be slower.
Oh, and finally, USE PREPARED STATEMENTS!
You should never have a statement that looks like this:
mysqli_query("SELECT * FROM Table2 WHERE ColumnAId=" . $row['ColumnAId'], $con);
mysqli has support for prepared statements. Read about them and use them, they will help you to avoid something terrible happening to your database.
Definitely second way. Nested query is an ugly thing since you're getting all query overheads (execution, network e t.c.) every time for every nested query, while single JOIN query will be executed once - i.e. all overheads will be done only once.
Simple rule is not to use queries in cycles - in general. There could be exceptions, if one query will be too complex, so due to performance in should be split, but in a certain case that can be shown only by benchmarks and measures.
If you want to do algorithmic evaluation of your data in your application code (which I think is the right thing to do), you should not use SQL at all. SQL was made to be a human readable way to ask for computational achieved data from a relational database, which means, if you just use it to store data, and do the computations in your code, you're doing it wrong anyway.
In such a case you should prefer using a different storage/retrieving possibility like a key-value store (there are persistent ones out there, and newer versions of MySQL exposes a key-value interface as well for InnoDB, but it's still using a relational database for key-value storage, aka the wrong tool for the job).
If you STILL want to use your solution:
Benchmark.
I've often found that issuing multiple queries can be faster than a single query, because MySQL has to parse less query, the optimizer has less work to do, and more often than not the MySQL optimzer just fails (that's the reason things like STRAIGHT JOIN and index hints exist). And even if it does not fail, multiple queries might still be faster depending on the underlying storage engine as well as how many threads try to access the data at once (lock granularity - this only applies with mixing in update queries though - neither MyISAM nor InnoDB lock the whole table for SELECT queries by default). Then again, you might even get different results with the two solutions if you don't use transactions, as data might change between queries if you use multiple queries versus a single one.
In a nutshell: There's way more to your question than what you posted/asked for, and what a generic answer can provide.
Regarding your solutions: I'd prefer the first solution if you have an environment where a) data changes are common and/or b) you have many concurrent running threads (requests) accessing and updating your tables (lock granularity is better with split up queries, as is cacheability of the queries) ; if your database is on a different network, e.g. network latency is an issue, you're probably better of with the first solution (but most people I know have MySQL on the same server, using socket connections, and local socket connections normally don't have much latency).
Situation may also change depending on how often the for loop is actually executed.
Again: Benchmark
Another thing to consider is memory efficiency and algorithmic efficiency. Later one is about O(n) in both cases, but depending on the type of data you use to join, it could be worse in any of the two. E.g. if you use strings to join (you really shouldn't, but you don't say), performance in the more php dependent solution also depends on php hash map algorithm (arrays in php are effectively hash maps) and the likelyhood of a collision, while mysql string indexes are normally fixed length, and thus, depending on your data, might not be applicable.
For memory efficiency, the multi query version is certainly better, as you have the php array anyway (which is very inefficient in terms of memory!) in both solutions, but the join might use a temp table depending on several circumstances (normally it shouldn't, but there ARE cases where it does - you can check using EXPLAIN for your queries)
In some case, you should using limit for best performance
If you wanna show 1000 rows
And some single query( master data)
you should run 1000 with limit between 10-100
Then get your foreign key to master data with single query with using WHERE IN in your query. then count your unique data to limit master data.
Example
Select productID, date from transaction_product limit 100
Get all productID and make it unique
Then
Select price from master_product WHERE IN (1,2 3 4) limit 4(count from total unique)
foreach(transaction)
master_poduct[productID]
Is there an appreciable performance difference between having one SELECT foo, bar, FROM users query that returns 500 rows, and 500 SELECT foo, bar, FROM users WHERE id = x queries coming all at once?
In a PHP application I'm writing, I'm trying to choose between a writing clear, readable section of code that would produce about 500 SELECT statements; or writing a it in an obscure, complex way that would use only one SELECT that returns 500 rows.
I would prefer the way that uses clear, maintainable code, but I'm concerned that the connection overhead for each of the SELECTs will cause performance problems.
Background info, in case it's relevant:
1) This is a Drupal module, coded in PHP
2) The tables in question get very few INSERTs and UPDATEs, and are rarely locked
3) SQL JOINs aren't possible for reasons not relevant to the question
Thanks!
It's almost always faster to do one big batch SELECT and parse the results in your application code than doing a massive amount of SELECTs for one row. I would recommend that you implement both and profile them, though. Always strive to minimize the number of assumptions you have to make.
I would not worry about the connection overhead of mysql queries too much, especially if you are not closing the connection between every query. Consider that if your query creates a temporary table, you've already spent more time in the query than the overhead of the query took.
I love doing a complex SQL query, personally, but I have found that the size of the tables, mysql query cache and query performance of queries that need to do range checking (even against an index) all make a difference.
I suggest this:
1) Establish the simple, correct baseline. I suspect this is the zillion-query approach. This is not wrong, and very likely helfully correct. Run it a few times and watch your query cache and application performance. The ability to keep your app maintainable is very important, especially if you work with other code maintainers. Also, if you're querying really large tables, small queries will maintain scalability.
2) Code the complex query. Compare the results for accuracy, and then the time. Then use EXPECT on the query to see what the rows scanned are. I have often found that if I have a JOIN, or a WHERE x != y, or a condition that creates a temporary table, the query performance could get pretty bad, especially if I'm in a table that's always getting updated. However, I've also found that a complex query might not be correct, and also that a complex query can more easily break as an application grows. Complex queries typically scan larger sets of rows, often creating temporary tables and invoke using where scans. The larger the table, the more expensive these get. Also, you might have team considerations where complex queries don't suit your team's strengths.
3) Share the results with your team.
Complex queries are less likely to hit the mysql query cache, and if they are large enough, don't cache them. (You want to save the mysql query cache for frequently hit queries.) Also, query where predicates that have to scan the index will not do as well. (x != y, x > y, x < y). Queries like SELECT foo, bar FROM users WHERE foo != 'g' and mumble < '360' end up doing scans. (The cost of query overhead could be negligible in that case.)
Small queries can often complete without creating temporary tables just by getting all values from the index, so long as the fields you're selecting and predicating on are indexed. So the query performance of SELECT foo, bar FROM users WHERE id = x is really great (esp if columns foo and bar are indexed like, aka alter table users add index ix_a ( foo, bar );.)
Other good ways to increase performance in your application would be to cache those small query results in the application (if appropriate), or doing batch jobs of a materialized view query. Also, consider memcached or some features found in XCache.
It seems like you know what the 500 id values are, so why not do something like this:
// Assuming you have already validated that this array contains only integers
// so there is not risk of SQl injection
$ids = join(',' $arrayOfIds);
$sql = "SELECT `foo`, `bar` FROM `users` WHERE `id` IN ($ids)";