Working in Drupal 6, PHP 5.3, and MySQL, I'm building a query that looks roughly like this:
SELECT val from table [and some other tables joined in below]
where [a bunch of clauses, including getting all the tables joined up]
and ('foo' not in (select ...))
and (('bar' in (select...) and x = y)
or ('baz' in (select ...) and p = q))
That's not a great representation of what I'm trying to do, but hopefully it will be enough. The point is that, in the middle of the query there is an embedded SELECT that is used a number of times. It's always the same. It's not completely self-contained -- it relies on a value pulled from one of the tables at the top level of the query.
I'm feeling a little guilty/unclean for just repeating the query every time it's needed, but I don't see any other way to compute the value once and reuse it as needed. Since it refers to the value from a top level table, I can't compute it once outside the query and just insert the value into the query, either through a MySQL variable or by monkeying around with the query string. Or, so I think, anyway.
Is there anything I can do about this? Or, maybe it's a non-issue from a performance perspective: the code might be nasty, but parhaps MySQL is smart enough to cache the value itself and avoid executing the query over and over again? Any advice? Thanks!
You should be able to alias the result by doing SELECT ... AS alias, and then using in alias in the other queries, since the SELECT is really just a table.
Related
I am working on converting a prototype web application into something that can be deployed. There are some locations where the prototype has queries that select all the fields from a table although only one field is needed or the query is just being used for checking the existence of the record. Most of the cases are single row queries.
I'm considering changing these queries to queries that only get what is really relevant, i.e.:
select * from users_table where <some condition>
vs
select name from users_table where <some condition>
I have a few questions:
Is this a worthy optimization in general?
In which kind of queries might this change be particularly good? For example, would this improve queries where joins are involved?
Besides the SQL impact, would this change be good at the PHP level? For example, the returned array will be smaller (a single column vs multiple columns with data).
Thanks for your comments.
If I were to answer all of your three questions in a single word, I would definitely say YES.
You probably wanted more than just "Yes"...
SELECT * is "bad practice": If you read the results into a PHP non-associative array; then add a column; now the array subscripts are possibly changed.
If the WHERE is complex enough, or you have GROUP BY or ORDER BY, and the optimizer decides to build a tmp table, then * may lead to several inefficiencies: having to use MyISAM instead of MEMORY; the tmp table will be bulkier; etc.
EXISTS SELECT * FROM ... comes back with 0 or 1 -- even simpler.
You may be able to combine EXISTS (or a suitable equivalent JOIN) to other queries, thereby avoiding an extra roundtrip to the server.
Have searched but can't find an answer which suits the exact needs for this mysql query.
I have the following quires on multiple tables to generate "stats" for an application:
SELECT COUNT(id) as count FROM `mod_**` WHERE `published`='1';
SELECT COUNT(id) as count FROM `mod_***` WHERE `published`='1';
SELECT COUNT(id) as count FROM `mod_****`;
SELECT COUNT(id) as count FROM `mod_*****`;
pretty simple just counts the rows sometimes based on a status.
however in the pursuit of performance i would love to get this into 1 query to save resources.
I'm using php to fetch this data with simple mysql_fetch_assoc and retrieving $res[count] if it makes a difference (pro isn't guaranteed, so plain old mysql here).
The overhead of sending a query and getting a single-row response is very small.
There is nothing to gain here by combining the queries.
If you don't have indexes yet an INDEX on the published column will greatly speed up the first two queries.
You can use something like
SELECT SUM(published=1)
for some of that. MySQL will take the boolean result of published=1 and translate it to an integer 0 or 1, which can be summed up.
But it looks like you're dealing with MULTIPLE tables (if that's what the **, *** etc... are), in which case you can't really. You could use a UNION query, e.g.:
SELECT ...
UNION ALL
SELECT ...
UNION ALL
SELECT ...
etc...
That can be fired off as one single query to the DB, but it'll still execute each sub-query as its own query, and simply aggregate the individual result sets into one larger set.
Disagreeing with #Halcyon I think there is an appreciable difference, especially if the MySQL server is on a different machine, as every single query uses at least one network packet.
I recommend you UNION the queries with a marker field to protect against the unexpected.
As #Halcyon said there is not much to gain here. You can anyway do several UNIONS to get all the result in one query
I have a quite interesting task. But I don't know how to call it in one word in order to search for related topics. Even this topic title might not reflect what I need. So, if somebody has better title - welcome.
I'll try to explain my problem.
I have about 100,000 rows in MySQL db table. And I need to "compare" entries from the table.
"compare" doesn't mean just equal. There is an algorithm for calculation comparison level. I have weight coefficient for each table column. Means that if entry#1's column1 equals to entry#2's column2 then I give, say, 5 point to this pair. And so on for each column.
The most straight forward way to do this - apply calculation rules for each couple of entries. Why am I afraid of this? 100,000 entries means about 5 billion "compare" operations. For sure, I can calculate this on demand and store the result somewhere in cache. But I believe that the most obvious way is not the most effective.
So, my first question is: Is there any other better way to achive my goal except of brute force?
My second question is related to tool which is better for calculations.
Application language is PHP. Hence, I need to load into memory whole
table and iterate over the data.
Create stored procedure in MySQL.
Using MongoDB's aggregation framework or MapReduce.
The least of all I like the first way. The most of all - the last.
I'm looking for any suggestion or advice from people who have experience in such sort of cases.
Since, I don't know how to ask google for help, any links will be appreciated.
UPDATE:
Calculation rules are a bit more complicated then I described...
Table has a set of related columns which are to be used at once as group(not one by one).
Let's assume:
table has fields, say, tag_1, tag_2, .., tag_n.
row_1 and row_2 - entries in the table.
The rule(pseudo-code):
if(row_1.tag_1==row_2.tag_1)
{
// gives 10 points
}
elseif(row_1.tag_1 is in row_2.tags && row_1.tag_1!=row_2.tag_1)
{
// gives 5 points
}
....
// and so on
Basically, I need to check find intersection of two arrays. If it is not empty - points are given. If indexes of tags in two rows match the additional points are given.
I'm wondering, how this can be accomplished using Stored Procedures Language? Because it can be done pretty easy using any programming language.
If stored procedure can do this then it is my choice.
If you have a static table, then it doesn't make a difference which you choose, so long as you store the results somewhere (presumably back in the database).
If your data is changing, then you need to compare each new row to all rows, which is essentially a full-table scan. This is probably best done in a database.
If the data fits into memory (and 500,000 rows should fit into memory), then (2) will probably be faster than (3) on equivalent hardware. "Equivalent hardware" is a very important consideration.
In most cases, I would opt for (2). It sounds like the query is something like:
select t.id, t2.id,
((case when t1.col1 = t2.col1 then 5 else 0 end) +
(case when t2.col2 = t2.col2 then 7 else 0 end) +
. . .
)
from t cross join t2
If you are much more comfortable with map-reduce, then you might find it easier to code there. I know both languages and prefer SQL for something like this.
Can't you do something like this:
UPDATE table SET points = points+5 WHERE column1 = column2
If you have too check for a specific value, you could try something like this:
UPDATE table SET points = points+5 WHERE column1 = 'somevalue' AND column2 = 'somevalue'
the query i'd like to speed up (or replace with another process):
UPDATE en_pages, keywords
SET en_pages.keyword = keywords.keyword
WHERE en_pages.keyword_id = keywords.id
table en_pages has the proper structure but only has non-unique page_ids and keyword_ids in it. i'm trying to add the actual keywords(strings) to this table where they match keyword_ids. there are 25 million rows in table en_pages that need updating.
i'm adding the keywords so that this one table can be queried in real time and return keywords (the join is obviously too slow for "real time").
we apply this query (and some others) to sub units of our larger dataset. we do this frequently to create custom interfaces for specific sub units of our data for different user groups (sorry if that's confusing).
this all works fine if you give it an hour to run, but i'm trying to speed it up.
is there a better way to do this that would be faster using php and/or mysql?
I actually don't think you can speed up the process.
You can still add brutal power to your database by cluserting new servers.
Maybe I'm wrong or missunderstood the question but...
Couldn't you use TRIGGERS ?
Like... when a new INSERT is detected on "en_pages", doing a UPDATE after on that same row?
(I don't know how frequent INSERTS are in that table)
This is just an idea.
How often does "en_pages.keyword" and "en_pages.keyword_id" changes after being inserted ?!?!?
I don't know about mySQL but usually this sort of thing runs faster in SQL Server if you process a limited number of batches of records (say a 1000) at a time in a loop.
You might also consider a where clause (I don't know what mySQL uses for "not equal to" so I used the SQL Server verion):
WHERE en_pages.keyword <> keywords.keyword
That way you are only updating records that have a difference in the field you are updating not all of the them.
I have a page that is taking 37 seconds to load. While it is loading it pegs MySQL's CPU usage through the roof. I did not write the code for this page and it is rather convoluted so the reason for the bottleneck is not readily apparent to me.
I profiled it (using kcachegrind) and find that the bulk of the time on the page is spent doing MySQL queries (90% of the time is spent in 25 different mysql_query calls).
The queries take the form of the following with the tag_id changing on each of the 25 different calls:
SELECT * FROM tbl_news WHERE news_id
IN (select news_id from
tbl_tag_relations WHERE tag_id = 20)
Each query is taking around 0.8 seconds to complete with a few longer delays thrown in for good measure... thus the 37 seconds to completely load the page.
My question is, is it the way the query is formatted with that nested select that is causing the problem? Or could it be any one of a million other things? Any advice on how to approach tackling this slowness is appreciated.
Running EXPLAIN on the query gives me this (but I'm not clear on the impact of these results... the NULL on primary key looks like it would be bad, yes? The number of results returned seems high to me as well as only a handful of results are returned in the end):
1 PRIMARY tbl_news ALL NULL NULL NULL NULL 1318 Using where
2 DEPENDENT SUBQUERY tbl_tag_relations ref FK_tbl_tag_tags_1 FK_tbl_tag_tags_1 4 const 179 Using where
I'e addressed this point in Database Development Mistakes Made by AppDevelopers. Basically, favour joins to aggregation. IN isn't aggregation as such but the same principle applies. A good optimize will make these two queries equivalent in performance:
SELECT * FROM tbl_news WHERE news_id
IN (select news_id from
tbl_tag_relations WHERE tag_id = 20)
and
SELECT tn.*
FROM tbl_news tn
JOIN tbl_tag_relations ttr ON ttr.news_id = tn.news_id
WHERE ttr.tag_id = 20
as I believe Oracle and SQL Server both do but MySQL doesn't. The second version is basically instantaneous. With hundreds of thousands of rows I did a test on my machine and got the first version to sub-second performance by adding appropriate indexes. The join version with indexes is basically instantaneous but even without indexes performs OK.
By the way, the above syntax I use is the one you should prefer for doing joins. It's clearer than putting them in the WHERE clause (as others have suggested) and the above can do certain things in an ANSI SQL way with left outer joins that WHERE conditions can't.
So I would add indexes on the following:
tbl_news (news_id)
tbl_tag_relations (news_id)
tbl_tag_relations (tag_id)
and the query will execute almost instantaneously.
Lastly, don't use * to select all the columns you want. Name them explicitly. You'll get into less trouble as you add columns later.
The SQL Query itself is definitely your bottleneck. The query has a sub-query in it, which is the IN(...) portion of the code. This is essentially running two queries at once. You can likely halve (or more!) your SQL times with a JOIN (similar to what d03boy mentions above) or a more targeted SQL query. An example might be:
SELECT *
FROM tbl_news, tbl_tag_relations
WHERE tbl_tag_relations.tag_id = 20 AND
tbl_news.news_id = tbl_tag_relations.news_id
To help SQL run faster you also want to try to avoid using SELECT *, and only select the information you need; also put a limiting statement at the end. eg:
SELECT news_title, news_body
...
LIMIT 5;
You also will want to look into the database schema itself. Make sure you are indexing all of the commonly referred to columns so that the queries will run faster. In this case, you probably want to check your news_id and tag_id fields.
Finally, you will want to take a look at the PHP code and see if you can make one single all-encompassing SQL query instead of iterating through several seperate queries. If you post more code we can help with that, and it will probably be the single greatest time savings for your posted problem. :)
If I understand correctly, this is just listing the news stories for a specific set of tags.
First of all, you really shouldn't
ever SELECT *
Second, this can probably be
accomplished within a single query,
thus reducing the overhead cost of
multiple queries. It seems like it
is getting fairly trivial data so
it could be retrieved within a
single call instead of 20.
A better approach to using IN might be to use a JOIN with a WHERE condition instead. When using an IN it will basically be a lot of OR statements.
Your tbl_tag_relations should definitely have an index on tag_id
select *
from tbl_news, tbl_tag_relations
where
tbl_tag_relations.tag_id = 20 and
tbl_news.news_id = tbl_tag_relations.news_id
limit 20
I think this gives the same results, but I'm not 100% sure. Sometimes simply limiting the results helps.
Unfortunately MySQL doesn't do very well with uncorrelated subqueries like your case shows. The plan is basically saying that for every row on the outer query, the inner query will be performed. This will get out of hand quickly. Rewriting as a plain old join as others have mentioned will work around the problem but may then cause the undesired affect of duplicate rows.
For instance the original query would return 1 row for each qualifying row in the tbl_news table but this query:
SELECT news_id, name, blah
FROM tbl_news n
JOIN tbl_tag_relations r ON r.news_id = n.news_id
WHERE r.tag_id IN (20,21,22)
would return 1 row for each matching tag. You could stick DISTINCT on there which should only have a minimal performance impact depending on the size of the dataset.
Not to troll too badly, but most other databases (PostgreSQL, Firebird, Microsoft, Oracle, DB2, etc) would handle the original query as an efficient semi-join. Personally I find the subquery syntax to be much more readable and easier to write, especially for larger queries.