The creator of No`orm shows that is possible to decompose a join of 3 tables into 3 faster queries: http://www.notorm.com/#performance.
Do you think that is possible to avoid joins and use multiple queries by putting the IDs in the IN statement?
That library (NoORM) do not support join for the above mentioned reason, do you think I could give up using joins and just use that library? It seems strange to me that it is so easy to avoid joins.
For a relatively small scale application, this can be feasible. But in situations where you might be passing many arguments (equivalent to your join condition matching on many rows) this will fail you. The best example I can think of why, is a limitation with Informix, for example, (I don't know if this is true of the latest versions) where prepared statements did not allow for more than 20 (not exact) or so arguments to be passed.
One reason to use joins over individual queries is that it allows the database's optimizer to come up with the best plan for the tables involved, based on the existing data. If you break a joined query out into individual queries, you're effectively hard-coding the execution plan. Among other things, that assumes that you're going to spend some time thinking about the selectivity of your queries when you write your code and that the selectivity will remain the same for the lifetime of your application. Those are pretty big assumptions, in my opinion.
Also, if you're using your database as more than a dumb receptacle, you'll likely find that there are queries that don't break down into individual queries quite so easily (e.g. just about anytime you use aggregate functions).
Related
I'm building a PHP app which will get a lot of traffic. I have always learned that I should limit the number of SQL queries (right now it's about 15 per visitor). I could limit this to about half, using PHP to filter and sort data, but would this be faster? Alternatively, I could use "JOIN" in the query to reduce the number of queries, but is this really faster than executing multiple queries? Thanks!
If you have 15 queries per visitor, you most likely did something wrong, unless your application is pretty big.
Using PHP to sort and filter data instead of MySQL
Doing the sorting and filtering in PHP will not make your application faster, it will make it slower for sure, MySQL is optimized to be very fast given the right indexes, so you should certainly use it when you can.
Learn about database indexing and it will make your life easier and increase your application's performance. Here is a link :
http://www.mysqltutorial.org/mysql-create-drop-index.aspx
Joining versus multiple queries
You ask if using join's would be faster than executing multiple queries.
The answer is yes, it will always be faster to use join's given the right indexes. Which also comes back to the first topic, indexing.
However, keep in mind that joining is as efficient as you make it to be, it will be extremely efficient if done right, and extremely slow if done wrong, so you must know the basics to do it right.
You can look here for an introduction to join's :
http://blog.sqlauthority.com/2009/04/13/sql-server-introduction-to-joins-basic-of-joins/
I am writing backend code in PHP where there are client requests and I have to show them a collection of users based on priority amongst many thousand users.
My Problem is this: since the queries are too heavy requiring a lot of inner joins and comparisons, that will make the code very slow and the response will consume a lot of time.
Can you suggest a method by which I can accumulate all the users, perform all queries and calculations without affecting the request and response times?
Queries with a lot of JOINS and CONDITIONS will have an effect on performance but is very minimal, as long as you have a good database design (Database Normalization). What mostly affects the performance of your code is the size of the tables (number of records) used in the query, but still there are INDEXES (How to use Indexes) and other tools which will help in optimizing your query.
I personally suggest that you use Stored Procedures in your queries, which applies encapsulation. And as per implementation, will help to avoid SQL Injection.
You can try a technique called Materialized views, i.e. doing precomputations into some tables containing agregates and using the agregates to speed up your online query. On the other hand, if the data you're basing your computation on, is changing often, this technique won't be of much help.
Look here for some more details on the subject: http://www.fromdual.com/mysql-materialized-views
Use Mysql Views to avoid complex queries and long time!
In the company where I came to work, they run a PHP/MySQL relational database. I had always thought that if I needed to pull different info from different tables, that I could just do a simple join to pull in the data such as....
SELECT table_1.id, table_2.id FROM table_1 LEFT JOIN table_2 ON table_1.sub_id = table_2.id
When I got to where I currently work, this is what they do.
<?php $query = mysql_query("SELECT sub_id FROM table_1");
while($rs = mysql_fetch_assoc($query)) {
$query_2 = mysql_fetch_assoc(mysql_query("SELECT * FROM table_2 WHERE id = '{$rs['sub_id']}'"));
//blah blah blah more queries
?>
When I asked why the did it the second way, they said that it actually ran faster than a join. They manage a database that has millions of records on different tables and some of the tables are a little wide (row-wise). They said that they wanted to avoid joins in the case that a poorly executed query could lock up a table (or several of them). One other thing to keep in mind is that there is a massive report builder attached to this database that a client can use to build their own report and if they go crazy and build a big report, it could cause some havoc.
I was confused so I thought I'd throw this out there for the general programming public. This could be a matter of opinion, but is it really faster to do the while statement (one larger query to pull a lot of rows, followed by a lot of small tiny sub-queries if you will) or to do a join (pull a larger query one time to get all the data you need). As long as indexes are done properly, does it matter? One other thing to consider is that the current DB is in InnoDB format.
Thanks!
Update 8/28/14
So I thought I'd throw up an update to this one and what has worked more long term. After this discussion I decided to rebuild the report generator here at work. I don't have definitive result numbers, but I thought I'd share what the result was.
I think went a little overkill because I turned the entire report (it's pretty dynamic as far as the data that's returned) into a massive join fest. Most of the joins, if not all are joining a value to a primary key so they all run really really fast. If the report had lets say 30 columns of data to pull and it pulled 2000 records, every single field was running a query to fetch the data (because that piece of data could be on a different field). 30 x 2000 = 60000 and even under a sweet query time of 0.0003 seconds per query, that was still 18 seconds of just query time (which is pretty much what I remember it being). Now that I rebuilt the query as a massive join on a bunch of primary keys (where possible), that same report loaded in about 2-3 seconds, and most of that time was downloading the html. Each record that returns runs between 0-4 extra queries depending on the data that's needed (may not need any data if it can fetch it in the joins, which happens 75% of the time). So the same 2000 records would return an additional 0-8000 queries, (much better than 60000).
I would say that the while statement is useful in some cases, but as stated below in the comments, benchmarking is what it's all about. In my case, joins were the better option, but in other areas of my site, a while statement is more useful. In one instance I have a report where a client could request several categories to pull by and only return data for those categories. What happened was I had a category_id IN(...,...,..,.., etc etc etc) with 50-500 IDs and the index would choke and die in my arms as I was holding it in it's final moments. So what I did was spread out the ids in groups of 10 and ran the same query x / 10 times and my results were fetch way faster than before because the index likes dealing with 10 IDs, not 500, so I saw a great improvement on my queries then because of doing the while statement.
If the indexes are properly used, then it is almost always more efficient to use a JOIN. The emphasis is added because best efficiency does not always equal best performance.
There isn't really a one-size-fits all answer, though; you should analyze a query using EXPLAIN to ensure that the indexes are indeed being used, that there is no unnecessary temp table use, etc. In some cases, conditions conspire to create a query that just can't use indexes. In those cases, it might be faster to separate the queries into pieces in the fashion you've indicated.
If I encountered such code in an existing project, I would question it: check the query, think of different ways to perform the query, make sure that these things have been considered, build a scientific, fact-supported case for or against the practice. Make sure that the original developers did their due diligence, since not using a JOIN superficially points to poor database or query design. In the end, though, the results speak loudly and if all the optimizations and corrections still result in a slower join than using query fragments provides, then the faster solution prevails. Benchmark and act on the results of the benchmark; there is no case in software design that you should trade poor performance for adhesion to arbitrary rules about what you should or should not do. The best-performing method is the best method.
It should be better to do the big query, if the indexes are well placed.
The logic behind it:
1 query = 1 call to the DB server, wich then processes the query (optimizer and all) and finally returns the result. N queries mean N calls to the database, including N calls to the optimizer and, in a bad case, I/O.
MySQL has optimizations wich work on JOINs. Those optimizations can not work if you do a while.
As stated in previous answers, check with EXPLAIN if there is something wich isn't using an index in case you use the JOIN. Also, you should check the memory wich is given to the InnoDB cache, and the memory given to MySQL to parse a given query. Maybe it's because of those parameters that the database goes slower when doing the JOINs.
I would say the answer is, it depends. Normally, I'd say joins are the answer, and doing multiple queries in a loop is bad practise, however, it depends entirely on what is being done.
Is it the case for you? Without detailed table structures and info on indexes as well as use of foreign keys etc, we can't say for sure. Best idea if you want to check, is try it and see. Get their queries, EXPLAIN them, write your own, and do an EXPLAIN on that, see which is more efficient.
I'm not sure about huge databases, but in my projects I always try to keep the queries to a minimum. Queries use harddrive access and (if not on same host) network access, which are slow. If there are many entries in that first query, you could be running thousands of queries per page which is going to be slow.
Benchmark to find out the actual answer.
With the example you provided, it is highly unlikely that (with equivalent data) a join by the database will use more resources than setting up a new connection and perform the exact same operation (after all: you're still connecting the data in the same way as a join, even if it is externally done): if it was, the engine could simply be rewritten to use that external route to improve performance.
When joins use more resources (apart from indexing problems), it mostly comes from the downsides of retrieving the data per row, which means that information of the parent table will be duplicated in every row, even when this is redundant.
This may cause performance problems that can be helped by splitting queries if:
there are many children to one parent AND
you fetch lots of data from the parent (many columns or large fields)
In my experience, reducing the number of queries almost always benefits performance (I've optimized by combining queries far more than picking them apart).
The correct use of indices is good advice of course, but at first sight I don't think it will account for differences between those two scenarios, as the same indices (or lack of) would apply in both cases.
I am not professional programmer so i can not be sure about this.How many mysql queries your scripts send at one page and what is your optimal query number .For example at stackoverflow's homepage it lists questions shows authors of these questions . is stackoverflow sends mysql query foreach question to get information of author. or it sends 1 query and gets all user data and match it with questions ?
I like to keep mine under 8.
Seriously though, that's pretty meaningless. If hypothetically there was a reason for you to have 800 queries in a page, then you could go ahead and do it. You'll probably find that the number of queries per page will simply be dependant on what you're doing, though in normal circumstances I'd be surprised to see over 50 (though these days, it can be hard to realise just how many you're doing if you are abstracting your DB calls away).
Slow queries matter more
I used to be frustrated at a certain PHP based forum software which had 35 queries in a page and ran really slow, but that was a long time ago and I know now that the reason that particular installation ran slow had nothing to do with having 35 queries in a page. For example, only one or two of those queries took most of the time. It just had a couple of really slow queries, that were fixed by well-placed indexes.
I think that identifying and fixing slow queries should come before identifying and eliminating unnecessary queries, as it can potentially make a lot more difference.
Consider even that three fast queries might be significantly quicker than one slow query - number of queries does not necessarily relate to speed.
I have one page (which is actually kind of a test case/diagnostic tool designed to be run only by an admin) which has over 800 queries but it runs in a matter of seconds. I guess they are all really simple queries.
Try caching
There are various ways to cache parts of your application which can really cut down on the number of queries you do, without reducing functionality. Libraries like memcached make this trivially easy these days and yet run really fast. This can also help improve performance a lot more than reducing the number of queries.
If queries are really unnecessary, and the performance really is making a difference, then remove/combine them
Just consider looking for slow queries and optimizing them, or caching their results, first.
Don't focus on the number of queries. This is not a useful metric. Instead, you need to look at a few other things:
how many queries are duplicated?
how many queries have intersecting datasets? or are a subset of another?
how long do they take to run? have you profiled the common ones to check indices?
how many are unnecessarily complex?
Numerous times I've seen three simpler queries together execute in a tenth of the time of one complex one that returned the same information. By the same token, SQL is powerful, but don't go mad trying to do something in SQL that would be easier and simpler in a loop in PHP.
how much progressive processing are you doing?
If you can't avoid longer queries with large datasets, try to re-arrange the algorithm so that you can process the dataset as it comes from the database. This lets you use an unbuffered query in MySQL and that improves your memory usage. And if you can provide output whilst you're doing this, you can improve your page's perceived speed by provinding first output sooner.
how much can you cache some of this data? Even caching it for a few seconds can help immensely.
There really is no optimal number of queries. Obviously the less queries you make the better.
If you are using some kind of ORM like Hibernate, Propel, Doctrine, etc they will generate queries differently than if you were to write the SQL by hand. So if StackOverflow uses an ORM they might have more than one query accessing the questions and the users that created the questions. Or they might just use a join with straight SQL.
It really depends on the technology you are using and what it actually does behind the scenes to generate the SQL.
Things you should be researching to understand this better:
Lazy loading
Object Relational Mapping
I recently started refactoring some older code of mine and I realised that I had used a lot of queries inside loops because back then I didn't know how to write SQL queries with subqueries and joins, etc. So I went and integrated these nested queries into one query so I could retrieve all the data at once and then loop over it in a nested way.
In some cases this made the page load significantly faster.
Ergo: It's definitely worth learning about the possibilities of SQL so you can start doing more with SQL and less with PHP.
I would not say that there is an optimal number of queries to be on any given script, but rather you have a goal when optimising; ordinarily time is the main concern, among other things.
If time is the only concern, you could optimise you queries such that you could have queries that are executed in less time than one other queries.
This is how I view optimosation, I have an objective, how best do I achieve it. Is there any information that you can cache? Based on you indexes, would a particular order of filters in your query perform better.....
My point, optimisation is best done on both the Db end and the application end.
You may want to read more on database optimisation.
As few as you need and no more. There is no rule of thumb here. Some websites require a lot of db access and others don't.
SO actually has only a few db calls if its written as I think. On a page like this one, an answer to a question, there would:
1) session verification, if you are logged in.
2) current user info, to get the user bar at the top of the screen and you medal count.
3) get the question info as well as the questioner's/last editor's info.
4) retrieve a count of tags used in this question.
5) select all responses and responder data in one shot.
And that's about it. The fun part is how much is keyed off the question:
// this returns one row per revision
select q.*, u.name, u.u_id, u.points, u.gmedal, u.smedals, u.bmedals
from questions q left outer join users on q.u_id = u.u_id
where q_id = :q_id;
// this used to display the tags below the question and the tag counts on the right
select t.name, count(*)
from tags t left join tags q on q.tagid = t.tagid
where t.q_id = :q_id
// this can also get multiple revisions
select a.*, u.name, u.u_id, u.points, u.gmedal, u.smedals, u.bmedals
from answers a left outer join users on a.u_id = u.u_id
where a.q_id = :q_id
This assumes that the various counts (vote-ups, favored question) are cached on the table as well as stored separately.
The optimal number is as many as you need to display the information the user expects. I always try to keep it in the single digits. For information that takes a few queries, but rarely changes, I cache the results in a generic cache table so it only takes one query. Store it as a serialized array to retain an easy to access structure.
When I first installed WordPress, I was appalled that the base install did over 20 queries! Plugins would increase that number (some by quite a bit). But with caching, that could be reduced to zero (SuperCache). If your content changes every 10 minutes, why generate it dynamically every hit?
At the very extreme is a platform like Facebook, where every page is unique content, customized to the user viewing it. You have to query every time.
But regardless, I rarely see the need to hit double digits query counts.
0 would be optimal if you are prioritizing speed.
I'm building a PHP page with data sent from MySQL.
Is it better to have
1 SELECT query with 4 table joins, or
4 small SELECT queries with no table join; I do select from an ID
Which is faster and what is the pro/con of each method? I only need one row from each tables.
You should run a profiling tool if you're truly worried cause it depends on many things and it can vary but as a rule its better to have fewer queries being compiled and fewer round trips to the database.
Make sure you filter things as well as you can using your where and join on clauses.
But honestly, it usually doesn't matter since you're probably not going to be hit all that hard compared to what the database can do, so unless optimization is your spec you should not do it prematurely and do whats simplest.
Generally, it's better to have one SELECT statement. One of the main reasons to have databases is that they are fast at processing information, particularly if it is in the format of query.
If there is any drawback to this approach, it's that there are some kinds of analysis that you can't do with one big SELECT statement. RDBMS purists will insist that this is a database design problem, in which case you are back to my original suggestion.
When you use JOINs instead of multiple queries, you allow the database to apply its optimizations. You also are potentially retrieving rows that you don't need (if you were to replace an INNER join with multiple selects), which increases the network traffic between your app server and database server. Even if they're on the same box, this matters.
It might depend on what you do with the data after you fetch it from the DB. If you use each of the four results independently, then it would be more logical and clear to have four separate SELECT statements. On the other hand, if you use all the data together, like to create a unified row in a table or something, then I would go with the single SELECT and JOINs.
I've done a bit of PHP/MySQL work, and I find that even for queries on huge tables with tons of JOINs, the database is pretty good at optimizing - if you have smart indexes. So if you are serious about performance, start reading up on query optimization and indexing.
I would say 1 query with the join. This way you need to hit the server only once. And if your tables are joined with indexes, it should be fast.
Well under Oracle you'd want to take advantage of the query caching, and if you have a lot of small queries you are doing in your sequential processing, it would suck if the last query pushed the first one out of the cache...just in time for you to loop around and run that first query again (with different parameter values obviously) on the next pass.
We were building an XML output file using Java stored procedures and definitely found the round trip times for each individual query were eating us alive. We found it was much faster to get all the data in as few queries as possible, then plug those values into the XML DOM as needed.
The only downside is that the Java code was a bit less elegant, as the data fetch was now remote from its usage. But we had to generate a large complex XML file in as close to zero time as possible, so we had to optimize for speed.
Be careful when dealing with a merge table however. It has been my experience that although a single join can be good in most situations, when merge tables are involved you can run into strange situations.