Unite two MySQL queries with a UNION or programmatically

Unite two MySQL queries with a UNION or programmatically - php

I've got two MySQL queries that both insert data into a table. Both have the following format:
CREATE TABLE IF NOT EXISTS `data` (
`id` BIGINT NOT NULL AUTO_INCREMENT UNIQUE,
PRIMARY KEY (`id`)
)
SELECT `field1`, `field2`
WHERE `active` = 1
The only differences between the two queries are how field1 and field2 are determined, and some minor differences in the conditions clause. Both run up to 12K and more records.
Now, what will be more efficient:
A. Run both queries separately:
if (mysql_query($query1)) {
return mysql_query($query2);
}
return false;
B. OR combine the two queries with a UNION, and run once:
$query = 'SELECT `field1`, `field2` WHERE `active` = 1
UNION
SELECT DO_ONE(`field1`), DO_TWO(`field2`) WHERE `active` = 1
ORDER BY `field1`';
return mysql_query('CREATE TABLE IF NOT EXISTS `data` (
`id` BIGINT NOT NULL AUTO_INCREMENT UNIQUE,
PRIMARY KEY (`id`)
) ' . $query)
The data from the one query is useless without the data from the other, so both need to succeed. DO_ONE and DO_TWO are user defined MySQL functions that change the field data according to some specs.

Aaronmccall's answer is probably the best in general -- the UNION approach does it all in one SQL call. In general that will be the most "efficient", but there could be side issues that could come into play and affect the measure of "efficient" for your particular application.
Specifically, if the UNION requires a temporary table to gather the intermediate results and you are working with very large sets of data, then doing two separate straight SELECTs into the new table might turn out being more efficient in your particular case. This would depend on the internal workings, optimizations done, etc within the database engine (which could change depending on the version of the database engine you are using).
Ultimately, the only way to answer your question on such a specific question like this might be to do timings for your particular application and environment.
You also might want to consider that the difference between the time required for two separate queries vs an "all in one" query might be insignificant in the grand scheme of things... you are probably talking about a difference of a few milliseconds (or even microseconds?) unless your mysql database is on a separate server with huge latency issues. If you are doing thousands of these calls in one shot, then the difference might be significant, but if you are only doing one or two of these calls and your application is spending 99.99% of its time executing other things, then the difference between the two probably won't even be noticed.
---Lawrence

The UNION approach should definitely be faster due to the expense of making two mysql api calls from php vs. one.

Your options do different things. First one returns the results from the second query if the first query executes correctly (which is BTW independent of the results that it returns, it can be returning an empty rowset). Second one returns the results from the first query and the second query together. First option seems to me pretty useless, probably what you want to achieve is what you did with the UNION (unless I missunderstood you).
EDIT: After reading your comment, I think you are after something like this:
SELECT true where (EXISTS(SELECT field1, field2 ...) AND EXISTS (SELECT Field1, field2 ...)).
That way you will have only one query to the DB, which scales better, takes less resources from the connection pool and doesn't double the impact of latency if you have your DB engine in a different server, but you will still interrupt the query if the first condition fails, which is the performance improvement that you where looking for with the nested separated queries.
As an optimization, try to have first the condition that will execute faster, in case they are not the same. I assume that if one of them requires those field calculations would be slower.

Related

Select query takes too long

These 2 querys take too long to produce a result (sometimes 1 min or even sometime end up on some error) and put really heavy load on the server:
("SELECT SUM(`rate`) AS `today_earned` FROM `".PREFIX."traffic_stats` WHERE `userid` = ?i AND from_unixtime(created) > CURRENT_DATE ORDER BY created DESC", $user->data->userid)
("SELECT COUNT(`userid`) AS `total_clicks` FROM `".PREFIX."traffic_stats` WHERE `userid` = ?i", $user->data->userid)
The table has about 4 million rows.
This is the table structure:
I have one index on traffic_id:
If you select anything from traffic_stats table it will take forever, however inserting to this table is normal.
Is it possible to reduce the time spent on executing this query? I use PDO and I am new to all this.

ORDER BY will take a lot of time and since you only need aggregate data (adding numbers or counting numbers is commutative), the ORDER BY will do a lot of useless sorting, costing you time and server power.
You will need to make sure that your indexing is right, you will probably need an index for user_id and for (user_id, created).
Is user_id numeric? If not, then you might consider converting it into numeric type, int for example.
These are improving your query and structure. But let's improve the concept as well. Are insertions and modifications very frequent? Do you absolutely need real-time data, or you can do with quasi-realtime data as well?
If insertions/modifications are not very frequent, or you can do with older data, or the problem is causing huge trouble, then you could do this by running periodically a cron job which would calculate these values and cache them. The application would read them from the cache.

I'm not sure why you accepted an answer, when you really didn't get to the heart of your problem.
I also want to clarify that this is a mysql question, and the fact that you are using PDO or PHP for that matter is not important.
People advised you to utilize EXPLAIN. I would go one further and tell you that you need to use EXPLAIN EXTENDED possibly with the format=json option to get a full picture of what is going on. Looking at your screen shot of the explain, what should jump out at you is that the query looked at over 1m rows to get an answer. This is why your queries are taking so long!
At the end of the day, if you have properly indexed your tables, your goal should be in a large table like this, to have number of rows examined be fairly close to the final result set.
So let's look at the 2nd query, which is quite simple:
("SELECT COUNT(`userid`) AS `total_clicks` FROM `".PREFIX."traffic_stats` WHERE `userid` = ?i", $user->data->userid)
In this case the only thing that is really important is that you have an index on traffic_stats.userid.
I would recommend, that, if you are uncertain at this point, drop all indexes other than the original primary key (traffic_id) index, and start with only an index on the userid column. Run your query. What is the result, and how long does it take? Look at the EXPLAIN EXTENDED. Given the simplicity of the query, you should see that only the index is being used and the rows should match the result.
Now to your first query:
("SELECT SUM(`rate`) AS `today_earned` FROM `".PREFIX."traffic_stats` WHERE `userid` = ?i AND from_unixtime(created) > CURRENT_DATE ORDER BY created DESC", $user->data->userid)
Looking at the WHERE clause there are these criteria:
userid =
from_unixtime(created) > CURRENT_DATE
You already have an index on userid. Despite the advice given previously, it is not necessarily correct to have an index on userid, created, and in your case it is of no value whatsoever.
The reason for this is that you are utilizing a mysql function from_unixtime(created) to transform the raw value of the created column.
Whenever you do this, an index can't be used. You would not have any concerns in doing a comparison with the CURRENT_DATE if you were using the native TIMESTAMP type but in this case, to handle the mismatch, you simply need to convert CURRENT_DATE rather than the created column.
You can do this by passing CURRENT_DATE as a parameter to UNIX_TIMESTAMP.
mysql> select UNIX_TIMESTAMP(), UNIX_TIMESTAMP(CURRENT_DATE);
+------------------+------------------------------+
| UNIX_TIMESTAMP() | UNIX_TIMESTAMP(CURRENT_DATE) |
+------------------+------------------------------+
| 1490059767 | 1490054400 |
+------------------+------------------------------+
1 row in set (0.00 sec)
As you can see from this quick example, UNIX_TIMESTAMP by itself is going to be the current time, but CURRENT_DATE is essentially the start of day, which is apparently what you are looking for.
I'm willing to bet that the number of rows for the current date are going to be fewer in number than the total rows for a user over the history of the system, so this is why you would not want an index on user, created as previously advised in the accepted answer. You might benefit from an index on created, userid.
My advice would be to start with an individual index on each of the columns separately.
("SELECT SUM(`rate`) AS `today_earned` FROM `".PREFIX."traffic_stats` WHERE `userid` = ?i AND created > UNIX_TIMESTAMP(CURRENT_DATE)", $user->data->userid)
And with your re-written query, again assuming that the result set is relatively small, you should see a clean EXPLAIN with rows matching your final result set.
As for whether or not you should apply an ORDER BY, this shouldn't be something you eliminate for performance reasons, but rather because it isn't relevant to your desired result. If you need or want the results ordered by user, then leave it. Unless you are producing a large result set, it shouldn't be a major problem.
In the case of that particular query, since you are doing a SUM(), there is no value of ORDERING the data, because you are only going to get one row back, so in that case I agree with Lajos, but there are many times when you might be utilizing a GROUP BY, and in that case, you might want the final results ordered.

PHP & MySQL web app - Selecting a single field (vs) select * from table

I am working on converting a prototype web application into something that can be deployed. There are some locations where the prototype has queries that select all the fields from a table although only one field is needed or the query is just being used for checking the existence of the record. Most of the cases are single row queries.
I'm considering changing these queries to queries that only get what is really relevant, i.e.:
select * from users_table where <some condition>
vs
select name from users_table where <some condition>
I have a few questions:
Is this a worthy optimization in general?
In which kind of queries might this change be particularly good? For example, would this improve queries where joins are involved?
Besides the SQL impact, would this change be good at the PHP level? For example, the returned array will be smaller (a single column vs multiple columns with data).
Thanks for your comments.

If I were to answer all of your three questions in a single word, I would definitely say YES.

You probably wanted more than just "Yes"...
SELECT * is "bad practice": If you read the results into a PHP non-associative array; then add a column; now the array subscripts are possibly changed.
If the WHERE is complex enough, or you have GROUP BY or ORDER BY, and the optimizer decides to build a tmp table, then * may lead to several inefficiencies: having to use MyISAM instead of MEMORY; the tmp table will be bulkier; etc.
EXISTS SELECT * FROM ... comes back with 0 or 1 -- even simpler.
You may be able to combine EXISTS (or a suitable equivalent JOIN) to other queries, thereby avoiding an extra roundtrip to the server.

Should one use/create as many indices as possible in MySQL?

I realized, that the response to a MySQL query becomes much faster, when creating an index for the column you use for "ORDER BY", e.g.
SELECT username FROM table ORDER BY registration_date DESC
Now I'm wondering which indices I should create to optimize the request time.
For example I frequently use the following queries:
SELECT username FROM table WHERE
registration_date > ".(time() - 10000)."
SELECT username FROM table WHERE
registration_date > ".(time() - 10000)."
&& status='active'
SELECT username FROM table WHERE
status='active'
SELECT username FROM table ORDER BY registration_date DESC
SELECT username FROM table WHERE
registration_date > ".(time() - 10000)."
&& status='active'
ORDER BY birth_date DESC
Question 1:
Should I set up separate indices for the first three request types? (i.e. one index for the column "registration_date", one index for the column "status", and another column for the combination of both?)
Question 2:
Are different indices independently used for "WHERE" and for "ORDER BY"? Say, I have a combined index for the columns "status" and "registration_date", and another index only for the column "birth_date". Should I setup another combined index for the three columns ("status", "registration_date" and "birth_date")?

There are no hard-and-fast rules for indices or query optimization. Each case needs to be considered and examined.
Generally speaking, however, you can and should add indices to columns that you frequently sort by or use in WHERE statements. (Answer to Question 2 -- No, the same indices are potentially used for ORDER BY and WHERE) Whether to do a multi-column index or a single-column one depends on the frequency of queries. Also, you should note that single-column indices may be combined by mySQL using the Index Merge Optimization:
The Index Merge method is used to retrieve rows with several range
scans and to merge their results into one. The merge can produce
unions, intersections, or unions-of-intersections of its underlying
scans. This access method merges index scans from a single table; it
does not merge scans across multiple tables.
(more reading: http://dev.mysql.com/doc/refman/5.0/en/index-merge-optimization.html)
Multi-column indices also require that you take care to structure your queries in such a way that your use of indexed columns matches the column order in the index:
MySQL cannot use an index if the columns do not form a leftmost
prefix of the index. Suppose that you have the SELECT statements shown
here:
SELECT * FROM tbl_name WHERE col1=val1; SELECT * FROM tbl_name WHERE
col1=val1 AND col2=val2;
SELECT * FROM tbl_name WHERE col2=val2; SELECT * FROM tbl_name WHERE
col2=val2 AND col3=val3;
If an index exists on (col1, col2, col3), only the first two queries
use the index. The third and fourth queries do involve indexed
columns, but (col2) and (col2, col3) are not leftmost prefixes of
(col1, col2, col3).
Bear in mind that indices DO have a performance consideration of their own -- it is possible to "over-index" a table. Each time a record is inserted or an indexed column is modified, the index/indices will have to be rebuilt. This does demand resources, and depending on the size and structure of your table, it may cause a decrease in responsiveness while the index building operations are active.
Use EXPLAIN to find out exactly what is happening in your queries. Analyze, experiment, and don't over-do it. The shotgun approach is not appropriate for database optimization.
Documentation
MySQL EXPLAIN - http://dev.mysql.com/doc/refman/5.0/en/explain.html
How MySQL uses indices - http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
Index Merge Optimization - http://dev.mysql.com/doc/refman/5.0/en/index-merge-optimization.html

To quote this page:
[Indices] will slow down your updates and inserts.
That's the tradeoff you have to calculate. To optimize your table, you should put indices only in the columns you are most likely to apply conditions to - the more indices you have, the slower your data-changing operations become. In that sense, I personally don't see much merit in creating combined indices - if you create all 7 possible permutations of indices for 3 columns, you are most definitely putting more drag on your updates and inserts than just using 3 indices for 3 columns (and even that can be debatable). On the other hand, if the data is being edited much, much less than it is being SELECTed, then indices can really help you speed things up.
Something else to take into consideration (again quoting the above page):
If your table is very small [...] it's worse to use an index than to leave it out and just let it do a table scan. Indexes really only come in handy with tables that have a lot of rows.

Yes, it is a good idea to have indexes on your column you often use, both for order by and in your where clauses.
But be aware: UPDATES, INSERTS and DELETE slow down if you have indexes.
That is because after such an operation, the index must be updated too.
So, as a rule-of-thumb: If your application is read-intensive, use the indexes where you think they help.
If your application is often updating the data, be careful, because that may get slow because of the indexes.
When in doubt, you must simply get dirty hands, and study the results of EXPLAIN.
http://dev.mysql.com/doc/refman/5.6/en/explain.html

As for the first two examples, you can satisfy them with one index: {registration_date, status}. Such an index can support filters on the first item (registration_date) or on both.
It does not work for status alone, however. The question on status is how selective is the status. That is, what proportion of records have status = "active". If this is a high proportion (so, on average, every database page would have an active record), then an index may not help very much.
The order by's are trickier. I don't know if mysql uses indexes for this purpose. Often, using an index for sorting entire records is less efficient than just sorting the records. Using the index causes a random access pattern to the records in the pages, which can cause major performance problems for tables larger than the page cache.

Use the explain function on your select statements to determine where your joins are slowing down (the more rows that are referenced, the slower it will be). Then apply your indices to those columns.
EXPLAIN SELECT * FROM table JOIN table 2 ON a = b WHERE conditions;

php and MySQL: 2 requests or 1 request?

I'm building a wepage in php using MySQL as my database.
Which way is faster?
2 requests to MySQL with the folling query.
SELECT points FROM data;
SELECT sum(points) FROM data;
1 request to MySQL. Hold the result in a temporary array and calcuale the sum in php.
$data = SELECT points FROM data;
EDIT -- the data is about 200-500 rows

It's really going to depend on a lot of different factors. I would recommend trying both methods and seeing which one is faster.

Since Phill and Kibbee have answered this pretty effectively, I'd like to point out that premature optimization is a Bad Thing (TM). Write what's simplest for you and profile, profile, profile.

How much data are we talking about? I'd say MySQL is probably faster at doing those kind of operations in the majority of cases.
Edit: with the kind of data that you're talking about, it probably won't make masses of difference. But databases tend to be optimised for those kind of queries, whereas PHP isn't. I think the second DB query is probably worth it.

If you want to do it in one line, use a running total like this:
SET #total=0;
SELECT points, #total:=#total+points AS RunningTotal FROM data;

I wouldn't worry about it until I had an issue with performance.

If you go with two separate queries, you need to watch out for the possibility of the data changing between getting the rows & getting their sum. Until there's an observable performance problem, I'd stick to doing my own summation to keep the page consistent.

The general rule of thumb for efficiency with mySQL is to try to minimize the number of SQL requests. Every call to the database adds overhead and is "expensive" in terms of time required.
The optimization done by mySQL is quite good. It can take very complex requests with many joins, nestings and computations, and make it run efficiently.
But it can only optimize individual requests. It cannot check the relationship between two different SQL statements and optimize between them.
In your example 1, the two statements will make two requests to the database and the table will be scanned twice.
Your example 2 where you save the result and compute the sum yourself would be faster than 1. This would only be one database call, and looping through the data in PHP to get the sum is faster than a second call to the database.

Just for the fun of it.
SELECT COUNT(points) FROM `data`
UNION
SELECT points FROM `data`
The first row will be the total, the next rows will be the data.
NOTE: Union can be slow, but its an option.
Could also do more fun and this supports you sorting the rows.
SELECT 'total' AS name, COUNT(points) FROM `data`
UNION
SELECT 'points' AS name, points FROM `data`
Then selecting through PHP
while($row = mysql_fetch_assoc($query))
{
if($row["data"] == "points")
{
echo $row["points"];
}
if($row["data"] == "total")
{
echo "Total is: ".$row["points"];
}
}

You can use union like this:
(select points, null as total from data) union (select null, sum(points) from data group by points);
The result will look something like this:
point total
2 null
5 null
...
null 7
you can figure out how to handle it.

do it the mySQL way. let the database manager do its work.
mySQL is optimized for such tasks

Optimizing a PHP page: MySQL bottleneck

I have a page that is taking 37 seconds to load. While it is loading it pegs MySQL's CPU usage through the roof. I did not write the code for this page and it is rather convoluted so the reason for the bottleneck is not readily apparent to me.
I profiled it (using kcachegrind) and find that the bulk of the time on the page is spent doing MySQL queries (90% of the time is spent in 25 different mysql_query calls).
The queries take the form of the following with the tag_id changing on each of the 25 different calls:
SELECT * FROM tbl_news WHERE news_id
IN (select news_id from
tbl_tag_relations WHERE tag_id = 20)
Each query is taking around 0.8 seconds to complete with a few longer delays thrown in for good measure... thus the 37 seconds to completely load the page.
My question is, is it the way the query is formatted with that nested select that is causing the problem? Or could it be any one of a million other things? Any advice on how to approach tackling this slowness is appreciated.
Running EXPLAIN on the query gives me this (but I'm not clear on the impact of these results... the NULL on primary key looks like it would be bad, yes? The number of results returned seems high to me as well as only a handful of results are returned in the end):
1 PRIMARY tbl_news ALL NULL NULL NULL NULL 1318 Using where
2 DEPENDENT SUBQUERY tbl_tag_relations ref FK_tbl_tag_tags_1 FK_tbl_tag_tags_1 4 const 179 Using where

I'e addressed this point in Database Development Mistakes Made by AppDevelopers. Basically, favour joins to aggregation. IN isn't aggregation as such but the same principle applies. A good optimize will make these two queries equivalent in performance:
SELECT * FROM tbl_news WHERE news_id
IN (select news_id from
tbl_tag_relations WHERE tag_id = 20)
and
SELECT tn.*
FROM tbl_news tn
JOIN tbl_tag_relations ttr ON ttr.news_id = tn.news_id
WHERE ttr.tag_id = 20
as I believe Oracle and SQL Server both do but MySQL doesn't. The second version is basically instantaneous. With hundreds of thousands of rows I did a test on my machine and got the first version to sub-second performance by adding appropriate indexes. The join version with indexes is basically instantaneous but even without indexes performs OK.
By the way, the above syntax I use is the one you should prefer for doing joins. It's clearer than putting them in the WHERE clause (as others have suggested) and the above can do certain things in an ANSI SQL way with left outer joins that WHERE conditions can't.
So I would add indexes on the following:
tbl_news (news_id)
tbl_tag_relations (news_id)
tbl_tag_relations (tag_id)
and the query will execute almost instantaneously.
Lastly, don't use * to select all the columns you want. Name them explicitly. You'll get into less trouble as you add columns later.

The SQL Query itself is definitely your bottleneck. The query has a sub-query in it, which is the IN(...) portion of the code. This is essentially running two queries at once. You can likely halve (or more!) your SQL times with a JOIN (similar to what d03boy mentions above) or a more targeted SQL query. An example might be:
SELECT *
FROM tbl_news, tbl_tag_relations
WHERE tbl_tag_relations.tag_id = 20 AND
tbl_news.news_id = tbl_tag_relations.news_id
To help SQL run faster you also want to try to avoid using SELECT *, and only select the information you need; also put a limiting statement at the end. eg:
SELECT news_title, news_body
...
LIMIT 5;
You also will want to look into the database schema itself. Make sure you are indexing all of the commonly referred to columns so that the queries will run faster. In this case, you probably want to check your news_id and tag_id fields.
Finally, you will want to take a look at the PHP code and see if you can make one single all-encompassing SQL query instead of iterating through several seperate queries. If you post more code we can help with that, and it will probably be the single greatest time savings for your posted problem. :)

If I understand correctly, this is just listing the news stories for a specific set of tags.
First of all, you really shouldn't
ever SELECT *
Second, this can probably be
accomplished within a single query,
thus reducing the overhead cost of
multiple queries. It seems like it
is getting fairly trivial data so
it could be retrieved within a
single call instead of 20.
A better approach to using IN might be to use a JOIN with a WHERE condition instead. When using an IN it will basically be a lot of OR statements.
Your tbl_tag_relations should definitely have an index on tag_id

select *
from tbl_news, tbl_tag_relations
where
tbl_tag_relations.tag_id = 20 and
tbl_news.news_id = tbl_tag_relations.news_id
limit 20
I think this gives the same results, but I'm not 100% sure. Sometimes simply limiting the results helps.

Unfortunately MySQL doesn't do very well with uncorrelated subqueries like your case shows. The plan is basically saying that for every row on the outer query, the inner query will be performed. This will get out of hand quickly. Rewriting as a plain old join as others have mentioned will work around the problem but may then cause the undesired affect of duplicate rows.
For instance the original query would return 1 row for each qualifying row in the tbl_news table but this query:
SELECT news_id, name, blah
FROM tbl_news n
JOIN tbl_tag_relations r ON r.news_id = n.news_id
WHERE r.tag_id IN (20,21,22)
would return 1 row for each matching tag. You could stick DISTINCT on there which should only have a minimal performance impact depending on the size of the dataset.
Not to troll too badly, but most other databases (PostgreSQL, Firebird, Microsoft, Oracle, DB2, etc) would handle the original query as an efficient semi-join. Personally I find the subquery syntax to be much more readable and easier to write, especially for larger queries.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.