JOINS vs. while statements - php

In the company where I came to work, they run a PHP/MySQL relational database. I had always thought that if I needed to pull different info from different tables, that I could just do a simple join to pull in the data such as....
SELECT table_1.id, table_2.id FROM table_1 LEFT JOIN table_2 ON table_1.sub_id = table_2.id
When I got to where I currently work, this is what they do.
<?php $query = mysql_query("SELECT sub_id FROM table_1");
while($rs = mysql_fetch_assoc($query)) {
$query_2 = mysql_fetch_assoc(mysql_query("SELECT * FROM table_2 WHERE id = '{$rs['sub_id']}'"));
//blah blah blah more queries
?>
When I asked why the did it the second way, they said that it actually ran faster than a join. They manage a database that has millions of records on different tables and some of the tables are a little wide (row-wise). They said that they wanted to avoid joins in the case that a poorly executed query could lock up a table (or several of them). One other thing to keep in mind is that there is a massive report builder attached to this database that a client can use to build their own report and if they go crazy and build a big report, it could cause some havoc.
I was confused so I thought I'd throw this out there for the general programming public. This could be a matter of opinion, but is it really faster to do the while statement (one larger query to pull a lot of rows, followed by a lot of small tiny sub-queries if you will) or to do a join (pull a larger query one time to get all the data you need). As long as indexes are done properly, does it matter? One other thing to consider is that the current DB is in InnoDB format.
Thanks!
Update 8/28/14
So I thought I'd throw up an update to this one and what has worked more long term. After this discussion I decided to rebuild the report generator here at work. I don't have definitive result numbers, but I thought I'd share what the result was.
I think went a little overkill because I turned the entire report (it's pretty dynamic as far as the data that's returned) into a massive join fest. Most of the joins, if not all are joining a value to a primary key so they all run really really fast. If the report had lets say 30 columns of data to pull and it pulled 2000 records, every single field was running a query to fetch the data (because that piece of data could be on a different field). 30 x 2000 = 60000 and even under a sweet query time of 0.0003 seconds per query, that was still 18 seconds of just query time (which is pretty much what I remember it being). Now that I rebuilt the query as a massive join on a bunch of primary keys (where possible), that same report loaded in about 2-3 seconds, and most of that time was downloading the html. Each record that returns runs between 0-4 extra queries depending on the data that's needed (may not need any data if it can fetch it in the joins, which happens 75% of the time). So the same 2000 records would return an additional 0-8000 queries, (much better than 60000).
I would say that the while statement is useful in some cases, but as stated below in the comments, benchmarking is what it's all about. In my case, joins were the better option, but in other areas of my site, a while statement is more useful. In one instance I have a report where a client could request several categories to pull by and only return data for those categories. What happened was I had a category_id IN(...,...,..,.., etc etc etc) with 50-500 IDs and the index would choke and die in my arms as I was holding it in it's final moments. So what I did was spread out the ids in groups of 10 and ran the same query x / 10 times and my results were fetch way faster than before because the index likes dealing with 10 IDs, not 500, so I saw a great improvement on my queries then because of doing the while statement.

If the indexes are properly used, then it is almost always more efficient to use a JOIN. The emphasis is added because best efficiency does not always equal best performance.
There isn't really a one-size-fits all answer, though; you should analyze a query using EXPLAIN to ensure that the indexes are indeed being used, that there is no unnecessary temp table use, etc. In some cases, conditions conspire to create a query that just can't use indexes. In those cases, it might be faster to separate the queries into pieces in the fashion you've indicated.
If I encountered such code in an existing project, I would question it: check the query, think of different ways to perform the query, make sure that these things have been considered, build a scientific, fact-supported case for or against the practice. Make sure that the original developers did their due diligence, since not using a JOIN superficially points to poor database or query design. In the end, though, the results speak loudly and if all the optimizations and corrections still result in a slower join than using query fragments provides, then the faster solution prevails. Benchmark and act on the results of the benchmark; there is no case in software design that you should trade poor performance for adhesion to arbitrary rules about what you should or should not do. The best-performing method is the best method.

It should be better to do the big query, if the indexes are well placed.
The logic behind it:
1 query = 1 call to the DB server, wich then processes the query (optimizer and all) and finally returns the result. N queries mean N calls to the database, including N calls to the optimizer and, in a bad case, I/O.
MySQL has optimizations wich work on JOINs. Those optimizations can not work if you do a while.
As stated in previous answers, check with EXPLAIN if there is something wich isn't using an index in case you use the JOIN. Also, you should check the memory wich is given to the InnoDB cache, and the memory given to MySQL to parse a given query. Maybe it's because of those parameters that the database goes slower when doing the JOINs.

I would say the answer is, it depends. Normally, I'd say joins are the answer, and doing multiple queries in a loop is bad practise, however, it depends entirely on what is being done.
Is it the case for you? Without detailed table structures and info on indexes as well as use of foreign keys etc, we can't say for sure. Best idea if you want to check, is try it and see. Get their queries, EXPLAIN them, write your own, and do an EXPLAIN on that, see which is more efficient.

I'm not sure about huge databases, but in my projects I always try to keep the queries to a minimum. Queries use harddrive access and (if not on same host) network access, which are slow. If there are many entries in that first query, you could be running thousands of queries per page which is going to be slow.

Benchmark to find out the actual answer.
With the example you provided, it is highly unlikely that (with equivalent data) a join by the database will use more resources than setting up a new connection and perform the exact same operation (after all: you're still connecting the data in the same way as a join, even if it is externally done): if it was, the engine could simply be rewritten to use that external route to improve performance.
When joins use more resources (apart from indexing problems), it mostly comes from the downsides of retrieving the data per row, which means that information of the parent table will be duplicated in every row, even when this is redundant.
This may cause performance problems that can be helped by splitting queries if:
there are many children to one parent AND
you fetch lots of data from the parent (many columns or large fields)
In my experience, reducing the number of queries almost always benefits performance (I've optimized by combining queries far more than picking them apart).
The correct use of indices is good advice of course, but at first sight I don't think it will account for differences between those two scenarios, as the same indices (or lack of) would apply in both cases.

Related

Is querying a MySQL database dozens of times in order to allow effective caching better than a few large queries with less caching?

In my application, I try to grab all the data I need in as few queries as possible. This usually leads to large queries with many joins. This places limits on what you can cache using software like Memcache or Redis (as far as I know). With large queries, you don't know what parts might already be cached. It seems like you have to query everything in smaller parts so that these small parts can be cached individually. The idea would be that you only have to do dozens of small queries in order to populate caches and that most of the time you would hit the caches rather than query. Is this how high traffic PHP/MySQL websites handle this? Is there a good way to cache effectively even if you have large queries with many joins?
Example:
SELECT user.name, user.birthday
FROM follower
INNER JOIN user ON (user.id = follower.user)
WHERE follower.following = '1'
The results of this query include the names and birthdays of any users following user 1. The results of this query could be cached, but that would only be useful when getting followers of user 1.
The alternative:
SELECT follower.user
FROM follower
WHERE follower.following = '1'
For each result with ? populated by follower.user from the previous query:
SELECT name, birthday FROM user where user.id = ?
In this case, we can check to see if user ?'s name and birthday are cached before querying for them from MySQL. If they aren't cached, or some are cached and some not, then grab the missing ones and cache them. You could also cache the list of follower IDs and then none of the queries need to be run the next time. The difference is that the name and birthdays of the users will be useful to any other user that ends up need information about these followers in any other context.
Am I missing something on caching with larger queries? Or is the second way the right way?
The correct answer is: It depends.
Caching is a way of optimizing a recognized use pattern by shortcutting producing repeatedly expensive data with re-using the data from a previous run.
So the first question you should answer is: It there an observed repeated use pattern that has a noticable "expensive" step of producing data? If not: Don't use caching that you still do not need, wait until you can observe something.
The second question you should be able to answer is: Can you measure how long it takes with and without cache, and is the difference noticable?
And the third important question to answer is: How can you clean the cache from outdated information if the original data gets changed, and you want that new data to be displayed instantly?
So in your case you are asking if using a cache for plenty of small, but seemingly more universal queries that then get combined is more beneficial than caching one big query. There is no theoretical answer, because it depends on how much faster a cache hit for a big result is compared to multiple cache hits for the combined result. Making multiple requests to the cache may very well be SLOWER than fetching the data from the original source, and combining the data into the needed complex result might also be slower than fetching ONE complex result directly from the cache.
Also, if using multiple cache entries for a combined result, you'll now have to deal with plenty of cases where only parts of the information are outdated, while others are not. So the result just gets more unreliable - you cannot really be sure if every part of the result is up to date, or how old it is.
#Sven you make the point! I add few more raw suggestions.
#Barakat big queries usually are not a big deal for MySql, well designed db, indexes and tuning the engine parameters usually give high performances.
Do many little queries induces a lot of overhead (cached or not), I usually avoid that.
If your big query gives big results (hundred/thousands of row), may be you can avoid it paging the results or limit the answers to best scores.
A very simple and effecting tool to tune your mysql server is MysqlTuner.pl, because you can use the MySql internal cache without worry about coherence!

JS, PHP, and MySQL to get large data

I am using Ajax to send query to PHP server, which then run the SQL query to get data. Because the query involves three tables (two large ones), so JOIN the three tables is very slow.
Then I split the SQL query to three queries. It improves the efficiency (for small dataset). But for large dataset, because the PHP program runs the three queries one by one, and processes the result after each, there will be 30 second timeout (by default). I don't want to remove this default setting.
To avoid timeout, I am also considering running the three query and returning the result to JS, and let client side to do processing.
Is there other way to do that?
add
Basically, I want three output, title, extviews, allviews, for each item, WHERE extviews>somevalue. title is from one small table, extviews and allviews are aggregated from two different large tables. I have all the fields indexed, but joining the two big tables still requires a long time.
So I first aggregate one table to get extviews for each item, and also a list of item id. The results are organized as an array for JSON output to JS. Then using the list of id, I get the title for each item, and aggregate the other table to get allviews. Then I update the array with the new results.
Unless your mysql server is really overloaded, it's usually quickier to use joins. I guess you've already defined indexes on your tables? (for fields used in join condition & where clauses)
Doing the processing on the client side might also be a problem, since you'll have to send a lot of data in order to do the join...
Edit:
If all "easy" optimisation is done, then you have 2 choices... The one you just described (doing it on client size, if it's possible - what is the size (in bytes) of the json arrays you send to the client?)
Your other choice is to do the processing in the background (via cron) & cache somehow the results.
As already indicated by other people responding to your post, you should give us an idea of the structure of your three tables and the intent of each. Based upon that information, you may be able to get significant performance improvements by optimizing your database structure. To make it easier to understand, let's assume that someone had a website running off an intelligently designed database. I could easily make that application perform ten times worse solely by modifying the structure of the database.
Now, maybe there's some reason why you need to have three distinct tables, but I can't make that judgment without knowing what the fields in the database are, what you're aggregating, and what your web application is doing in the first place. Is it read heavy or write heavy? The solution may be as simple as denormalizing your database so that you don't need to use any joins.
I can say from a cursory glance at your description of what you're doing, that this application can't possibly scale efficiently and that you really need to reconsider your design. The first warning sign for me is the fact that you stated that one of the joins is just to link the title to two other tables. To me, being forced to do a join just to get a title of an object seems indicative of over-normalization. Some data redundancy is not necessarily a bad thing, and in some situations it's absolutely mandatory. Also, you say that you have two large tables that you use aggregate functions on and then join everything together. I can tell you right now that you're going to run into some serious performance issues if every hit to your application involves using a triple join and two aggregate functions, I'm assuming count.
Ultimately, we'll be able to give you a better response once you provide more information as to what you're trying to accomplish, and the general structure of the database you set up for it.

Will more MySql tables slow down searches on MySql database?

I have a classifieds website, and I am thinking about redesigning the database a bit.
Currently I have 7 tables in the db. One table for each "MAIN CATEGORY".
For example, I have a "VEHICLES" table which holds all information about the following categories of classifieds:
cars
mc
mopeds/scooters
trucks
boats
etc etc
However, users on the website usually search in specific categories. For example, the user chooses the "cars" category to search in, and enters a keyword.
My code today, will search the entire VEHICLES table for all records with the field "category" equal to "cars", and then get their details:
"SELECT * IN vehicles WHERE category='cars' AND alot of other conditions" // just for example, not tested
I am thinking about making a table now, for each of these "sub-categories".
Ie, one for cars, one for mc, one for trucks etc, so that search isn't done through information which isn't needed.
Will this increase search speed? Because I have calculated that I will need atleast 30 or so tables for this.
Thanks
With a properly indexed table and a "reasonable" number of rows, you will not gain much speed from this approach. Anything you gain in speed of execution you will lose in time-to-market because your programming will become more complicated.
Do not perform this optimization unless and until you encounter a performance problem in testing with a representative set of data.
It will increase the speed of a search within the same category. It will potentially slow down queries where you need aggregate information from the different categories. You need to decide which is the best option for your site.
How many records do you have in total in the vehicles table. Its quite likely that adding proper indexes will greatly increase the speed of your searches.
Check out the 'EXPLAIN' query option in MySQL. Understanding this will help you optimize your database a lot with indices.
Performance optimization is as much art as science, and to really understand what's the best option requires that you do some benchmarking; anyone offering a definitive answer given the available information is just wrong. That said, a few thoughts on your situation:
You don't say what type your category column is now, but if it's a string type, it's probably using more space than other options, thus making the table larger. Proper indexing can help tremendously with speed, but a larger table with larger indexes will always work to do just the opposite.
As already mentioned by someone else, your queries within a category will be faster in the simple case of a category search. How much faster depends on how much data you have in your current table, and the increases may be negated if you have to join in other tables to satisfy the need for all the other conditions to which you alluded. OTOH, it may actually speed things up in certain join cases (e.g., if you were doing self-joins with your all-encompassing table).
If you're working with a lot of data, splitting into multiple tables can greatly ease backups.
Splitting into multiple tables may also make it easier to shard your data across multiple servers for performance reasons. Similarly, it may make replication setups easier to keep running.
If you're tracking data that's category-specific, separate tables enables you to better normalize your database and likely reap some nice performance as a result of using much smaller tables.
Splitting obviously means modifying your code. If your code is of the old, creaky type, you may very well achieve a performance gain from the clean-up. Of course, there's also the risk that you'll break something....
Check your indexes. Bad indexes are a very common cause of poor performance but are relatively easy to fix with a bit of quality time spent on self-education. MySQL's EXPLAIN can tell you whether your queries are using the indexes, and the index stats (look in the docs) can tell you how efficiently your indexes are working.
Finally, speaking of code, check yours. Try experimenting with a few approaches, regardless of how the database is set up. For example, it may be quicker to do a couple of separate queries and join the results in code than to do the join in the database. Likewise, it's often quicker to do things like sorts in code, particularly in cases where a join or something means the database would have to create a temporary file/table. Again, check the EXPLAIN output, and if you can't eliminate a problem area in your queries, see if it helps to simplify the queries and do more work in the code. This can be particularly beneficial in the common case where the web server has more resources to spare than the database server.
There are many more factors to consider. Ultimately, though, the best way to make these decisions is not to spend time pondering theories but to put both methods to the test. Create some test databases and benchmark the sort of queries you'd run most often, with and without simulated load. You'll get your answer.
if you are using php try something like
$query = mysql_query($sql);
while($row = mysql_fetch_assoc($query)){
$tempvalue[]=$row;
}
and then to loop the info use for like sentence
foreach($tempvalue as $key => $value){
write the table .....
}
maybe mysql isnt slow and the problem is in the code
test dont kill anyone =)

What is the optimal MYSQL query number in php script?

I am not professional programmer so i can not be sure about this.How many mysql queries your scripts send at one page and what is your optimal query number .For example at stackoverflow's homepage it lists questions shows authors of these questions . is stackoverflow sends mysql query foreach question to get information of author. or it sends 1 query and gets all user data and match it with questions ?
I like to keep mine under 8.
Seriously though, that's pretty meaningless. If hypothetically there was a reason for you to have 800 queries in a page, then you could go ahead and do it. You'll probably find that the number of queries per page will simply be dependant on what you're doing, though in normal circumstances I'd be surprised to see over 50 (though these days, it can be hard to realise just how many you're doing if you are abstracting your DB calls away).
Slow queries matter more
I used to be frustrated at a certain PHP based forum software which had 35 queries in a page and ran really slow, but that was a long time ago and I know now that the reason that particular installation ran slow had nothing to do with having 35 queries in a page. For example, only one or two of those queries took most of the time. It just had a couple of really slow queries, that were fixed by well-placed indexes.
I think that identifying and fixing slow queries should come before identifying and eliminating unnecessary queries, as it can potentially make a lot more difference.
Consider even that three fast queries might be significantly quicker than one slow query - number of queries does not necessarily relate to speed.
I have one page (which is actually kind of a test case/diagnostic tool designed to be run only by an admin) which has over 800 queries but it runs in a matter of seconds. I guess they are all really simple queries.
Try caching
There are various ways to cache parts of your application which can really cut down on the number of queries you do, without reducing functionality. Libraries like memcached make this trivially easy these days and yet run really fast. This can also help improve performance a lot more than reducing the number of queries.
If queries are really unnecessary, and the performance really is making a difference, then remove/combine them
Just consider looking for slow queries and optimizing them, or caching their results, first.
Don't focus on the number of queries. This is not a useful metric. Instead, you need to look at a few other things:
how many queries are duplicated?
how many queries have intersecting datasets? or are a subset of another?
how long do they take to run? have you profiled the common ones to check indices?
how many are unnecessarily complex?
Numerous times I've seen three simpler queries together execute in a tenth of the time of one complex one that returned the same information. By the same token, SQL is powerful, but don't go mad trying to do something in SQL that would be easier and simpler in a loop in PHP.
how much progressive processing are you doing?
If you can't avoid longer queries with large datasets, try to re-arrange the algorithm so that you can process the dataset as it comes from the database. This lets you use an unbuffered query in MySQL and that improves your memory usage. And if you can provide output whilst you're doing this, you can improve your page's perceived speed by provinding first output sooner.
how much can you cache some of this data? Even caching it for a few seconds can help immensely.
There really is no optimal number of queries. Obviously the less queries you make the better.
If you are using some kind of ORM like Hibernate, Propel, Doctrine, etc they will generate queries differently than if you were to write the SQL by hand. So if StackOverflow uses an ORM they might have more than one query accessing the questions and the users that created the questions. Or they might just use a join with straight SQL.
It really depends on the technology you are using and what it actually does behind the scenes to generate the SQL.
Things you should be researching to understand this better:
Lazy loading
Object Relational Mapping
I recently started refactoring some older code of mine and I realised that I had used a lot of queries inside loops because back then I didn't know how to write SQL queries with subqueries and joins, etc. So I went and integrated these nested queries into one query so I could retrieve all the data at once and then loop over it in a nested way.
In some cases this made the page load significantly faster.
Ergo: It's definitely worth learning about the possibilities of SQL so you can start doing more with SQL and less with PHP.
I would not say that there is an optimal number of queries to be on any given script, but rather you have a goal when optimising; ordinarily time is the main concern, among other things.
If time is the only concern, you could optimise you queries such that you could have queries that are executed in less time than one other queries.
This is how I view optimosation, I have an objective, how best do I achieve it. Is there any information that you can cache? Based on you indexes, would a particular order of filters in your query perform better.....
My point, optimisation is best done on both the Db end and the application end.
You may want to read more on database optimisation.
As few as you need and no more. There is no rule of thumb here. Some websites require a lot of db access and others don't.
SO actually has only a few db calls if its written as I think. On a page like this one, an answer to a question, there would:
1) session verification, if you are logged in.
2) current user info, to get the user bar at the top of the screen and you medal count.
3) get the question info as well as the questioner's/last editor's info.
4) retrieve a count of tags used in this question.
5) select all responses and responder data in one shot.
And that's about it. The fun part is how much is keyed off the question:
// this returns one row per revision
select q.*, u.name, u.u_id, u.points, u.gmedal, u.smedals, u.bmedals
from questions q left outer join users on q.u_id = u.u_id
where q_id = :q_id;
// this used to display the tags below the question and the tag counts on the right
select t.name, count(*)
from tags t left join tags q on q.tagid = t.tagid
where t.q_id = :q_id
// this can also get multiple revisions
select a.*, u.name, u.u_id, u.points, u.gmedal, u.smedals, u.bmedals
from answers a left outer join users on a.u_id = u.u_id
where a.q_id = :q_id
This assumes that the various counts (vote-ups, favored question) are cached on the table as well as stored separately.
The optimal number is as many as you need to display the information the user expects. I always try to keep it in the single digits. For information that takes a few queries, but rarely changes, I cache the results in a generic cache table so it only takes one query. Store it as a serialized array to retain an easy to access structure.
When I first installed WordPress, I was appalled that the base install did over 20 queries! Plugins would increase that number (some by quite a bit). But with caching, that could be reduced to zero (SuperCache). If your content changes every 10 minutes, why generate it dynamically every hit?
At the very extreme is a platform like Facebook, where every page is unique content, customized to the user viewing it. You have to query every time.
But regardless, I rarely see the need to hit double digits query counts.
0 would be optimal if you are prioritizing speed.

Should I use one big SQL Select statement or several small ones?

I'm building a PHP page with data sent from MySQL.
Is it better to have
1 SELECT query with 4 table joins, or
4 small SELECT queries with no table join; I do select from an ID
Which is faster and what is the pro/con of each method? I only need one row from each tables.
You should run a profiling tool if you're truly worried cause it depends on many things and it can vary but as a rule its better to have fewer queries being compiled and fewer round trips to the database.
Make sure you filter things as well as you can using your where and join on clauses.
But honestly, it usually doesn't matter since you're probably not going to be hit all that hard compared to what the database can do, so unless optimization is your spec you should not do it prematurely and do whats simplest.
Generally, it's better to have one SELECT statement. One of the main reasons to have databases is that they are fast at processing information, particularly if it is in the format of query.
If there is any drawback to this approach, it's that there are some kinds of analysis that you can't do with one big SELECT statement. RDBMS purists will insist that this is a database design problem, in which case you are back to my original suggestion.
When you use JOINs instead of multiple queries, you allow the database to apply its optimizations. You also are potentially retrieving rows that you don't need (if you were to replace an INNER join with multiple selects), which increases the network traffic between your app server and database server. Even if they're on the same box, this matters.
It might depend on what you do with the data after you fetch it from the DB. If you use each of the four results independently, then it would be more logical and clear to have four separate SELECT statements. On the other hand, if you use all the data together, like to create a unified row in a table or something, then I would go with the single SELECT and JOINs.
I've done a bit of PHP/MySQL work, and I find that even for queries on huge tables with tons of JOINs, the database is pretty good at optimizing - if you have smart indexes. So if you are serious about performance, start reading up on query optimization and indexing.
I would say 1 query with the join. This way you need to hit the server only once. And if your tables are joined with indexes, it should be fast.
Well under Oracle you'd want to take advantage of the query caching, and if you have a lot of small queries you are doing in your sequential processing, it would suck if the last query pushed the first one out of the cache...just in time for you to loop around and run that first query again (with different parameter values obviously) on the next pass.
We were building an XML output file using Java stored procedures and definitely found the round trip times for each individual query were eating us alive. We found it was much faster to get all the data in as few queries as possible, then plug those values into the XML DOM as needed.
The only downside is that the Java code was a bit less elegant, as the data fetch was now remote from its usage. But we had to generate a large complex XML file in as close to zero time as possible, so we had to optimize for speed.
Be careful when dealing with a merge table however. It has been my experience that although a single join can be good in most situations, when merge tables are involved you can run into strange situations.

Categories