I have a doubt.
I have a big query to get all the products of a web store, then I proccess the data to a CSV to synchronize to another external server, in this case DooFinder.
Now I am doing the proccess in the query. Example:
- round((p.p_price*(SELECT tax_rate FROM tax_rates WHERE tax_rates.tax_rates_id=p.p_tax_class_id)/100)+p.p_price,2)
- concat('https://www.domain.de/pimg/',p.p_image) AS image, p.manufacturers_id
And the question is: What will be more efficient? Make the operations in the query or in PHP? Now I have over 20 products in a test site and it works perfect, but the objective is have +1.000 products.
This is the query ($i is for each language, so +1.000 products * number of languages):
SELECT
pd.p_name,
p.p_quantity,
concat('https://www.domain.de/pinf.php?products_id=',p.products_id) AS p_link,
pd.p_description,
p.p_id,
p.p_tax_class_id,
p.p_date_available ,
round((p.p_price*(SELECT tax_rate FROM tax_rates WHERE tax_rates.tax_rates_id=p.p_tax_class_id)/100)+p.p_price,2) AS price,
concat('https://www.domain.de/pimg/',p.p_image) AS image, p.manufacturers_id
FROM
products p,
p_description pd,
p_to_categories ptc
WHERE
p.p_id = pd.p_id
AND
pd.language_id = ".$i."
AND
p.p_status=1
AND
ptc.p_id = p.p_id
AND
ptc.categories_id != 218
GROUP by p.p_id
When deciding whether to perform the computations on the client or on the database server, you should consider the following:
Which one has more spare CPU cycles?
Will making the client do the computation require additional data transfer from the server? If doing it on the server reduces the data transfer, it may be worthwhile to make the database CPU work a bit harder.
Can you reduce the data transfer by making the client compute the result? E.g. suppose the client displays 100 computations based on one column. In that case it makes sense to fetch just that column instead of all of the computations based on it.
In your specific example, the overhead of extra computations is going to be marginal relative to the rest of the query, so if you already have it coded to do it on the server, I would leave it like that. At the same time, if you are doing it on the client, it is not that bad either, so I would still not bother refactoring.
However, there is one thing that I would fix - rewrite the tax rate sub-select as a join. This one would have more overhead than any of the computations even if properly optimized by the MySQL optimizer.
And, of course, as was suggested in the comments, benchmark your performance and make decisions based on those measurements. Proper benchmarking aside from the obvious has a side benefit of helping find odd performance and even functionality bugs that otherwise would have been found by your users at the worst possible time.
Related
I've been struggling with trying to figure out what is the best, most efficient way to handle displaying child data. In my specific case I'm using PHP and MySQL, but I feel that this is more of a "generally in any language" sort of deal.
My two thoughts are (for this example I'll be listing invoices and their line items)
Joining the child data (invoice items) to the main data (invoices) as to only have a single query My problem with this is that, say I have 500 line items on an invoice (probably not realistic, but things happen), then I would have sent 500 times the overall invoice data from the MySQL server to my PHP script and that just sounds ridiculous since I only need it the once time.
And the second option would be to, while looping through the invoices and displaying the overall invoice data, select the invoices's line items. And this, of course, is now contacting the database 500 more times.
Are there any other options for dealing with this data that makes logical sense (with the given schema)? I'm almost 100% sure there are, since I can't believe that I'm the first person to think about this issue, but I think I'm just having difficulty finding the right way to search for more information on this topic.
Joining the child data (invoice items) to the main data (invoices) as to only have a single query
That's the conventional way of handling this requirement. It does, indeed, handle redundant data, and there is some overhead.
But.
that's the reason it's possible to specify a compressed connection to a RDBMS from a client ... compression mitigates the network overhead of the redundant data.
the redundant data in a single result set costs much less than the repeated queries.
Most folks just retrieve the redundant data in this kind of application. Program products like Crystal Reports do this.
If it just won't work for you, you retrieve and save one result set for your master records ... maybe something like this.
SELECT master_id, name, address, whatever
FROM master m
WHERE m.whatever = whatever
ORDER BY whatever
Then, put those into an associative array by master_id.
Then, retrieve the detail records.
SELECT d.master_id, d.detail_id, d.item, d.unit, d.quantity
FROM detail d
JOIN master m ON d.master_id = m.master_id
WHERE m.whatever = whatever
ORDER BY d.master_id, d.detail_id, whatever
You'll get a result set with all your relevant detail records (invoice items) tagged with the master_id value. You can them match them up in your php program to the master records. You're basically doing the join in your application code.
If all that sounds like too much of a pain in the neck... just go for the redundant data and get your project done. You can always optimize later if you must.
I have this 2 mysql tables: TableA and TableB
TableA
* ColumnAId
* ColumnA1
* ColumnA2
TableB
* ColumnBId
* ColumnAId
* ColumnB1
* ColumnB2
In PHP, I wanted to have this multidimensional array format
$array = array(
array(
'ColumnAId' => value,
'ColumnA1' => value,
'ColumnA2' => value,
'TableB' => array(
array(
'ColumnBId' => value,
'ColumnAId' => value,
'ColumnB1' => value,
'ColumnB2' => value
)
)
)
);
so that I can loop it in this way
foreach($array as $i => $TableA) {
echo 'ColumnAId' . $TableA['ColumnAId'];
echo 'ColumnA1' . $TableA['ColumnA1'];
echo 'ColumnA2' . $TableA['ColumnA2'];
echo 'TableB\'s';
foreach($value['TableB'] as $j => $TableB) {
echo $TableB['...']...
echo $TableB['...']...
}
}
My problem is that, what is the best way or the proper way of querying MySQL database so that I can achieve this goal?
Solution1 --- The one I'm using
$array = array();
$rs = mysqli_query("SELECT * FROM TableA", $con);
while ($row = mysqli_fetch_assoc($rs)) {
$rs2 = mysqli_query("SELECT * FROM Table2 WHERE ColumnAId=" . $row['ColumnAId'], $con);
// $array = result in array
$row['TableB'] = $array2;
}
I'm doubting my code cause its always querying the database.
Solution2
$rs = mysqli_query("SELECT * FROM TableA JOIN TableB ON TableA.ColumnAId=TableB.ColumnAId");
while ($row = mysqli_fet...) {
// Code
}
The second solution only query once, but if I have thousand of rows in TableA and thousand of rows in TableB for each TableB.ColumnAId (1 TableA.ColumnAId = 1000 TableB.ColumnAId), thus this solution2 takes much time than the solution1?
Neither of the two solutions proposed are probably optimal, BUT solution 1 is UNPREDICTABLE and thus INHERENTLY FLAWED!
One of the first things you learn when dealing with large databases is that 'the best way' to do a query is often dependent upon factors (referred to as meta-data) within the database:
How many rows there are.
How many tables you are querying.
The size of each row.
Because of this, there's unlikely to be a silver bullet solution for your problem. Your database is not the same as my database, you will need to benchmark different optimizations if you need the best performance available.
You will probably find that applying & building correct indexes (and understanding the native implementation of indexes in MySQL) in your database does a lot more for you.
There are some golden rules with queries which should rarely be broken:
Don't do them in loop structures. As tempting as it often is, the overhead on creating a connection, executing a query and getting a response is high.
Avoid SELECT * unless needed. Selecting more columns will significantly increase overhead of your SQL operations.
Know thy indexes. Use the EXPLAIN feature so that you can see which indexes are being used, optimize your queries to use what's available and create new ones.
Because of this, of the two I'd go for the second query (replacing SELECT * with only the columns you want), but there are probably better ways to structure the query if you have the time to optimize.
However, speed should NOT be your only consideration in this, there is a GREAT reason not to use suggestion one:
PREDICTABILITY: why read-locks are a good thing
One of the other answers suggests that having the table locked for a long period of time is a bad thing, and that therefore the multiple-query solution is good.
I would argue that this couldn't be further from the truth. In fact, I'd argue that in many cases the predictability of running a single locking SELECT query is a greater argument FOR running that query than the optimization & speed benefits.
First of all, when we run a SELECT (read-only) query on a MyISAM or InnoDB database (default systems for MySQL), what happens is that the table is read-locked. This prevents any WRITE operations from happening on the table until the read-lock is surrendered (either our SELECT query completes or fails). Other SELECT queries are not affected, so if you're running a multi-threaded application, they will continue to work.
This delay is a GOOD thing. Why, you may ask? Relational data integrity.
Let's take an example: we're running an operation to get a list of items currently in the inventory of a bunch of users on a game, so we do this join:
SELECT * FROM `users` JOIN `items` ON `users`.`id`=`items`.`inventory_id` WHERE `users`.`logged_in` = 1;
What happens if, during this query operation, a user trades an item to another user? Using this query, we see the game state as it was when we started the query: the item exists once, in the inventory of the user who had it before we ran the query.
But, what happens if we're running it in a loop?
Depending on whether the user traded it before or after we read his details, and in which order we read the inventory of the two players, there are four possibilities:
The item could be shown in the first user's inventory (scan user B -> scan user A -> item traded OR scan user B -> scan user A -> item traded).
The item could be shown in the second user's inventory (item traded -> scan user A -> scan user B OR item traded -> scan user B -> scan user A).
The item could be shown in both inventories (scan user A -> item traded -> scan user B).
The item could be shown in neither of the user's inventories (scan user B -> item traded -> scan user A).
What this means is that we would be unable to predict the results of the query or to ensure relational integrity.
If you're planning to give $5,000 to the guy with item ID 1000000 at midnight on Tuesday, I hope you have $10k on hand. If your program relies on unique items being unique when snapshots are taken, you will possibly raise an exception with this kind of query.
Locking is good because it increases predictability and protects the integrity of results.
Note: You could force a loop to lock with a transaction, but it will still be slower.
Oh, and finally, USE PREPARED STATEMENTS!
You should never have a statement that looks like this:
mysqli_query("SELECT * FROM Table2 WHERE ColumnAId=" . $row['ColumnAId'], $con);
mysqli has support for prepared statements. Read about them and use them, they will help you to avoid something terrible happening to your database.
Definitely second way. Nested query is an ugly thing since you're getting all query overheads (execution, network e t.c.) every time for every nested query, while single JOIN query will be executed once - i.e. all overheads will be done only once.
Simple rule is not to use queries in cycles - in general. There could be exceptions, if one query will be too complex, so due to performance in should be split, but in a certain case that can be shown only by benchmarks and measures.
If you want to do algorithmic evaluation of your data in your application code (which I think is the right thing to do), you should not use SQL at all. SQL was made to be a human readable way to ask for computational achieved data from a relational database, which means, if you just use it to store data, and do the computations in your code, you're doing it wrong anyway.
In such a case you should prefer using a different storage/retrieving possibility like a key-value store (there are persistent ones out there, and newer versions of MySQL exposes a key-value interface as well for InnoDB, but it's still using a relational database for key-value storage, aka the wrong tool for the job).
If you STILL want to use your solution:
Benchmark.
I've often found that issuing multiple queries can be faster than a single query, because MySQL has to parse less query, the optimizer has less work to do, and more often than not the MySQL optimzer just fails (that's the reason things like STRAIGHT JOIN and index hints exist). And even if it does not fail, multiple queries might still be faster depending on the underlying storage engine as well as how many threads try to access the data at once (lock granularity - this only applies with mixing in update queries though - neither MyISAM nor InnoDB lock the whole table for SELECT queries by default). Then again, you might even get different results with the two solutions if you don't use transactions, as data might change between queries if you use multiple queries versus a single one.
In a nutshell: There's way more to your question than what you posted/asked for, and what a generic answer can provide.
Regarding your solutions: I'd prefer the first solution if you have an environment where a) data changes are common and/or b) you have many concurrent running threads (requests) accessing and updating your tables (lock granularity is better with split up queries, as is cacheability of the queries) ; if your database is on a different network, e.g. network latency is an issue, you're probably better of with the first solution (but most people I know have MySQL on the same server, using socket connections, and local socket connections normally don't have much latency).
Situation may also change depending on how often the for loop is actually executed.
Again: Benchmark
Another thing to consider is memory efficiency and algorithmic efficiency. Later one is about O(n) in both cases, but depending on the type of data you use to join, it could be worse in any of the two. E.g. if you use strings to join (you really shouldn't, but you don't say), performance in the more php dependent solution also depends on php hash map algorithm (arrays in php are effectively hash maps) and the likelyhood of a collision, while mysql string indexes are normally fixed length, and thus, depending on your data, might not be applicable.
For memory efficiency, the multi query version is certainly better, as you have the php array anyway (which is very inefficient in terms of memory!) in both solutions, but the join might use a temp table depending on several circumstances (normally it shouldn't, but there ARE cases where it does - you can check using EXPLAIN for your queries)
In some case, you should using limit for best performance
If you wanna show 1000 rows
And some single query( master data)
you should run 1000 with limit between 10-100
Then get your foreign key to master data with single query with using WHERE IN in your query. then count your unique data to limit master data.
Example
Select productID, date from transaction_product limit 100
Get all productID and make it unique
Then
Select price from master_product WHERE IN (1,2 3 4) limit 4(count from total unique)
foreach(transaction)
master_poduct[productID]
I know this has been asked before at least in this thread:
is php sort better than mysql "order by"?
However, I'm still not sure about the right option here since the performance on doing the sorting on PHP side is almost 40 times faster.
This MySQL query runs in about 350-400ms
SELECT
keywords as id,
SUM(impressions) as impressions,
SUM(clicks) as clicks,
SUM(conversions) as conversions,
SUM(not_ctr) as not_ctr,
SUM(revenue) as revenue,
SUM(cost) as cost
FROM visits WHERE campaign_id = 104 GROUP BY keywords(it's an integer) DESC
Keywords and campaign_id columns are indexed.
Using about 150k rows and returns around 1500 rows in total.
The results are then recalculated (we calculate click through rates, conversion rates, ROI etc, as well as the totals for the whole result set). The calculations are done in PHP.
Now my idea was to store the results with PHP APC for quick retrieval, however we need to be able to order these results by the columns as well as the calculated values, therefore if I wanted to order by click-through rate I'd have to use
(SUM(clicks) / (SUM(impressions) - SUM(not_ctr)) within the query which makes it around 40ms slower and the initial 400ms is a really long time already.
In addition we paginate these results, but adding LIMIT 0,200 doesn't really affect the performance.
While testing the APC approach I executed the query, did the additional calculations and stored the array in memory so it would only be executed once during the initial request and that worked like a charm. Fetching and sorting the array from memory only took around 10ms, however the script memory usage was about 25mb. Maybe it's worth loading the results into a memory table and then querying that table directly?
This is all done on my local machine(i7, 8gb ram) which has the default MySQL install and the production server is a 512MB box on Rackspace on which I haven't tested yet, so if possible ignore the server setup.
So the real question is: Is it worth using memory tables or should I just use the PHP sorting and ignore the RAM usage since I can always upgrade the RAM? What other options would you consider in optimizing the performance?
In general, you want to do sorting on the database server and not in the application. One good reason is that the database should be implementing parallel sorts and it has access to indexes. A general rule may not be applicable in all circumstances.
I'm wondering if you indexes are helping you. I would recommend that you try the query:
With no indexes
With an index only on campaign_id
With both indexes
Indexes are not always useful. One particularly important factor is called "selectivity". If you only have two campaigns in the table, then you are probably better off doing a full-table scan rather than indirectly searching through an index. This because particularly important when the table does not fit into memory (resulting in a condition where every row requires load a page into cache).
Finally, if this is going to be an application that expands beyond your single server, be careful. What is optimal on a single machine may not be optimal in a different environment.
I tried to calculate Euclidean distance in PHP using the following code.But the time it takes is very long. I want to test if I perform the same operation in C if it will be faster. The input datas should be passed from php whereas all other datas are stored in the mysql database. How can I make the operation fast as I have to calculate the distance of 30,000+ images having about 900 attributes each. So how can I make this calculation faster in C than in PHP? I have not programmed in C alot so any suggestion will be highly appreciated.
The query used in PHP for the distance calculation can be summarized as below:
SELECT tbl_img.img_id,
tbl_img.img_path,
((pow(($r[9]-coarsewt_1),2))+(pow(($r[11]-coarsewt_2),2))+ ... +(pow(($r[31]-coarsewt_12),2))+
(pow(($r[36]-finewt_$wt1),2))+(pow(($r[38]-finewt_$wt2),2))+(pow(($r[40]-finewt_$wt3),2))+
(pow(($r[43]-shape_1),2))+(pow(($r[44]-shape_2),2))+ ... +(pow(($r[462]-shape_420),2))+
(pow(($r[465]-texture_1),2))+(pow(($r[466]-texture_2),2))+ ... +(pow(($r[883]-texture_419),2))+(pow(($r[884]-texture_420),2)))
as distance
FROM tbl_img
INNER JOIN tbl_coarsewt
ON tbl_img.img_id=tbl_coarsewt.img_id
INNER JOIN tbl_finewt
ON tbl_img.img_id=tbl_finewt.img_id
INNER JOIN tbl_shape
ON tbl_img.img_id=tbl_shape.img_id
INNER JOIN tbl_texture
ON tbl_img.img_id=tbl_texture.img_id
WHERE tbl_img.img_id>=1 AND tbl_img.img_id<=31930
ORDER BY distance ASC LIMIT 6
Your problem is not with the language, as Arash Kordi put it. That SQL is going to be executed by your SQL server, and thanks to the algorithm used, that server going to be your bottleneck, not the language your script is written in. If you switch to C, you won't gain any significant speed, unless you also change your strategy.
Basic rules of thumb for optimization:
Don't use database for your calculations. Use database to fetch relevant data and then carry out calculations in PHP or C.
(Pre-calculated?) Look-up arrays: Analyze your data and see if you can build a look-up array of -- say -- pow() results instead of calculating each value again each time. This is helpful if you have lot of repetitive data.
Avoid serialization -- Could you run multiple instances of your script parallel on different sections of your data to maximize throughput?
Consider using server-side prepared statements -- they may speed things up a little.
Is there an appreciable performance difference between having one SELECT foo, bar, FROM users query that returns 500 rows, and 500 SELECT foo, bar, FROM users WHERE id = x queries coming all at once?
In a PHP application I'm writing, I'm trying to choose between a writing clear, readable section of code that would produce about 500 SELECT statements; or writing a it in an obscure, complex way that would use only one SELECT that returns 500 rows.
I would prefer the way that uses clear, maintainable code, but I'm concerned that the connection overhead for each of the SELECTs will cause performance problems.
Background info, in case it's relevant:
1) This is a Drupal module, coded in PHP
2) The tables in question get very few INSERTs and UPDATEs, and are rarely locked
3) SQL JOINs aren't possible for reasons not relevant to the question
Thanks!
It's almost always faster to do one big batch SELECT and parse the results in your application code than doing a massive amount of SELECTs for one row. I would recommend that you implement both and profile them, though. Always strive to minimize the number of assumptions you have to make.
I would not worry about the connection overhead of mysql queries too much, especially if you are not closing the connection between every query. Consider that if your query creates a temporary table, you've already spent more time in the query than the overhead of the query took.
I love doing a complex SQL query, personally, but I have found that the size of the tables, mysql query cache and query performance of queries that need to do range checking (even against an index) all make a difference.
I suggest this:
1) Establish the simple, correct baseline. I suspect this is the zillion-query approach. This is not wrong, and very likely helfully correct. Run it a few times and watch your query cache and application performance. The ability to keep your app maintainable is very important, especially if you work with other code maintainers. Also, if you're querying really large tables, small queries will maintain scalability.
2) Code the complex query. Compare the results for accuracy, and then the time. Then use EXPECT on the query to see what the rows scanned are. I have often found that if I have a JOIN, or a WHERE x != y, or a condition that creates a temporary table, the query performance could get pretty bad, especially if I'm in a table that's always getting updated. However, I've also found that a complex query might not be correct, and also that a complex query can more easily break as an application grows. Complex queries typically scan larger sets of rows, often creating temporary tables and invoke using where scans. The larger the table, the more expensive these get. Also, you might have team considerations where complex queries don't suit your team's strengths.
3) Share the results with your team.
Complex queries are less likely to hit the mysql query cache, and if they are large enough, don't cache them. (You want to save the mysql query cache for frequently hit queries.) Also, query where predicates that have to scan the index will not do as well. (x != y, x > y, x < y). Queries like SELECT foo, bar FROM users WHERE foo != 'g' and mumble < '360' end up doing scans. (The cost of query overhead could be negligible in that case.)
Small queries can often complete without creating temporary tables just by getting all values from the index, so long as the fields you're selecting and predicating on are indexed. So the query performance of SELECT foo, bar FROM users WHERE id = x is really great (esp if columns foo and bar are indexed like, aka alter table users add index ix_a ( foo, bar );.)
Other good ways to increase performance in your application would be to cache those small query results in the application (if appropriate), or doing batch jobs of a materialized view query. Also, consider memcached or some features found in XCache.
It seems like you know what the 500 id values are, so why not do something like this:
// Assuming you have already validated that this array contains only integers
// so there is not risk of SQl injection
$ids = join(',' $arrayOfIds);
$sql = "SELECT `foo`, `bar` FROM `users` WHERE `id` IN ($ids)";