How do i prevent duplicate output among multiple, nested sql queries?

How do i prevent duplicate output among multiple, nested sql queries? - php

I have a query which lists 10 different items in a database.
Within this query, i have a nested query which, for each of items 1-10 listed above, finds related subcategories in another table.
So ultimately, 11 queries occur. 1 to iterate through the major categories, and the other 10 to query and output for each of those categories.
Problem is, collectively, they output duplicated values.
Since it is done over 10 queries, i cant use DISTINCT, because even if the output is distinct within its own query, it is not distinct in the overall group.
So how can i make sure that i multiquery list like this is unique? Does js or php have a built in function that can do such?

Your code is not really scaleable. You're already reaching large numbers of queries, imagine if you had 100 items...
Instead, consider creating a subquery within the original query, since this would allow you to only run one query and the MySQL engine can do all the crunch work more easily (since it knows what you're actually asking for).
Make use of JOINs if possible, and pay close attention to indexes. I can't really help more without seeing some code, but this should help because DISTINCT would suddenly be usable again.

Actually, I found a way to solve it without reworking my code. I just put the entire output from all queries into the same array using a non-resetting incremental $array[$i] and then used php's array_unique. :). PHP is a fantastically versatile language. My humble (noob) opinion.

Related

Order heirarchical data alphabetically? (I've a working query that can be optimized too)

Alright, so I've looked EVERYWHERE about how to sort/order hierarchical data with some sort of arbitrary value, like alphabetically, highest votes, etc. Do any of you have any solutions for a nested set or something else?
Here's a couple example queries that a guy made on reddit in my thread and I added to (unioning a lot of selects.. help?):
http://sqlfiddle.com/#!9/6dd457/10 - alphabetically and still in order
http://sqlfiddle.com/#!9/786d89/36 - in order of highest votes first
These are adjacency lists at heart, and the mysql creates a materialized path based on your sorting parameter on the fly. This would be generated every time someone goes to viewtopic.php as each user has their own custom sort.
Now there are a few issues with this approach. First, seeing all of those selects and joins makes me a bit worried. I assume there's a much more optimized way to make all of this.
Php can be used to work out the highest parent node value, how many layers down we want to generate from that, and the formatting of some other things, but there's the problem with this crap in there too:
(1000 - l1.TotalVotes)+(2 * l1.Negatives) - having to be repeated like 5 times in the whole query
I tried to do ((1000 - l1.TotalVotes)+(2 * l1.Negatives)) AS l1.Votes in its respected select query, but I got an error :/ that would be one step to making it cleaner.
My ultimate goal is making a query that is able to sort in the order that I want, so I can just use templating with depth information to generate the html without having to use php any more than I have to, seeing as how sql is more efficient and meant for these sorts of things. I'm mostly worried that a query looking like that will be bad for larger sites.
The end goal is to have something as fast as reddit: nested comments with the ability to sort by post date, vote total, and some other parameters, all using as much mysql and as little php as possible. If you have any solutions to this using closure tables, nested sets, or just plain materialized paths, I'd love to hear them. The solution I have here looks optimized and scary.

Would a SUM query or looping through an array of objects and summing a data member be more efficient?

Suppose I have 50 objects in an array and each object has the property cost. Also, all of the objects are in a MySQL database.
Would it be more efficient to loop through all of the objects and adding up their cost member, or make a query such as SELECT sum(cost) FROM ...?
EDIT: A PDO database object has already been established.

depends on array size, whether you have already established a database connection etc. I'd go for the looping part in general, since its only 50 items. however the ultimate answer is a benchmark you can do for yourself..

In general, it's far more efficient to have MySQL do it for you. However, if it really is just 50 objects, other factors such as network connection to the database may come into play. However, it's never a good idea to write code that assumes the application will not scale; so use MySQL :-)

SUM query. Hands-down. If it's not way more efficient than looping through an array in your app code, then you have some database tuning to do. :)

Sum query. It's going to be way more efficient than you trying to do it yourself.
In any case, they are going to be similar enough that it won't matter, which would mean you would want to use sum as it is sometimes fast and sometimes about the same. No reason to risk it by executing the code yourself.

I have tried with a table with 74601 records in php and found that, sql SUM query is approx 10 times faster than looping in php
Check result here

JOINS vs. while statements

In the company where I came to work, they run a PHP/MySQL relational database. I had always thought that if I needed to pull different info from different tables, that I could just do a simple join to pull in the data such as....
SELECT table_1.id, table_2.id FROM table_1 LEFT JOIN table_2 ON table_1.sub_id = table_2.id
When I got to where I currently work, this is what they do.
<?php $query = mysql_query("SELECT sub_id FROM table_1");
while($rs = mysql_fetch_assoc($query)) {
$query_2 = mysql_fetch_assoc(mysql_query("SELECT * FROM table_2 WHERE id = '{$rs['sub_id']}'"));
//blah blah blah more queries
?>
When I asked why the did it the second way, they said that it actually ran faster than a join. They manage a database that has millions of records on different tables and some of the tables are a little wide (row-wise). They said that they wanted to avoid joins in the case that a poorly executed query could lock up a table (or several of them). One other thing to keep in mind is that there is a massive report builder attached to this database that a client can use to build their own report and if they go crazy and build a big report, it could cause some havoc.
I was confused so I thought I'd throw this out there for the general programming public. This could be a matter of opinion, but is it really faster to do the while statement (one larger query to pull a lot of rows, followed by a lot of small tiny sub-queries if you will) or to do a join (pull a larger query one time to get all the data you need). As long as indexes are done properly, does it matter? One other thing to consider is that the current DB is in InnoDB format.
Thanks!
Update 8/28/14
So I thought I'd throw up an update to this one and what has worked more long term. After this discussion I decided to rebuild the report generator here at work. I don't have definitive result numbers, but I thought I'd share what the result was.
I think went a little overkill because I turned the entire report (it's pretty dynamic as far as the data that's returned) into a massive join fest. Most of the joins, if not all are joining a value to a primary key so they all run really really fast. If the report had lets say 30 columns of data to pull and it pulled 2000 records, every single field was running a query to fetch the data (because that piece of data could be on a different field). 30 x 2000 = 60000 and even under a sweet query time of 0.0003 seconds per query, that was still 18 seconds of just query time (which is pretty much what I remember it being). Now that I rebuilt the query as a massive join on a bunch of primary keys (where possible), that same report loaded in about 2-3 seconds, and most of that time was downloading the html. Each record that returns runs between 0-4 extra queries depending on the data that's needed (may not need any data if it can fetch it in the joins, which happens 75% of the time). So the same 2000 records would return an additional 0-8000 queries, (much better than 60000).
I would say that the while statement is useful in some cases, but as stated below in the comments, benchmarking is what it's all about. In my case, joins were the better option, but in other areas of my site, a while statement is more useful. In one instance I have a report where a client could request several categories to pull by and only return data for those categories. What happened was I had a category_id IN(...,...,..,.., etc etc etc) with 50-500 IDs and the index would choke and die in my arms as I was holding it in it's final moments. So what I did was spread out the ids in groups of 10 and ran the same query x / 10 times and my results were fetch way faster than before because the index likes dealing with 10 IDs, not 500, so I saw a great improvement on my queries then because of doing the while statement.

If the indexes are properly used, then it is almost always more efficient to use a JOIN. The emphasis is added because best efficiency does not always equal best performance.
There isn't really a one-size-fits all answer, though; you should analyze a query using EXPLAIN to ensure that the indexes are indeed being used, that there is no unnecessary temp table use, etc. In some cases, conditions conspire to create a query that just can't use indexes. In those cases, it might be faster to separate the queries into pieces in the fashion you've indicated.
If I encountered such code in an existing project, I would question it: check the query, think of different ways to perform the query, make sure that these things have been considered, build a scientific, fact-supported case for or against the practice. Make sure that the original developers did their due diligence, since not using a JOIN superficially points to poor database or query design. In the end, though, the results speak loudly and if all the optimizations and corrections still result in a slower join than using query fragments provides, then the faster solution prevails. Benchmark and act on the results of the benchmark; there is no case in software design that you should trade poor performance for adhesion to arbitrary rules about what you should or should not do. The best-performing method is the best method.

It should be better to do the big query, if the indexes are well placed.
The logic behind it:
1 query = 1 call to the DB server, wich then processes the query (optimizer and all) and finally returns the result. N queries mean N calls to the database, including N calls to the optimizer and, in a bad case, I/O.
MySQL has optimizations wich work on JOINs. Those optimizations can not work if you do a while.
As stated in previous answers, check with EXPLAIN if there is something wich isn't using an index in case you use the JOIN. Also, you should check the memory wich is given to the InnoDB cache, and the memory given to MySQL to parse a given query. Maybe it's because of those parameters that the database goes slower when doing the JOINs.

I would say the answer is, it depends. Normally, I'd say joins are the answer, and doing multiple queries in a loop is bad practise, however, it depends entirely on what is being done.
Is it the case for you? Without detailed table structures and info on indexes as well as use of foreign keys etc, we can't say for sure. Best idea if you want to check, is try it and see. Get their queries, EXPLAIN them, write your own, and do an EXPLAIN on that, see which is more efficient.

I'm not sure about huge databases, but in my projects I always try to keep the queries to a minimum. Queries use harddrive access and (if not on same host) network access, which are slow. If there are many entries in that first query, you could be running thousands of queries per page which is going to be slow.

Benchmark to find out the actual answer.
With the example you provided, it is highly unlikely that (with equivalent data) a join by the database will use more resources than setting up a new connection and perform the exact same operation (after all: you're still connecting the data in the same way as a join, even if it is externally done): if it was, the engine could simply be rewritten to use that external route to improve performance.
When joins use more resources (apart from indexing problems), it mostly comes from the downsides of retrieving the data per row, which means that information of the parent table will be duplicated in every row, even when this is redundant.
This may cause performance problems that can be helped by splitting queries if:
there are many children to one parent AND
you fetch lots of data from the parent (many columns or large fields)
In my experience, reducing the number of queries almost always benefits performance (I've optimized by combining queries far more than picking them apart).
The correct use of indices is good advice of course, but at first sight I don't think it will account for differences between those two scenarios, as the same indices (or lack of) would apply in both cases.

Applying iterative algorithm to a set of rows from database

this question may seem too basic to some, but please bear with be, it's been a while since I dealt with decent database programming.
I have an algorithm that I need to program in PHP/MySQL to work on a website. It performs some computations iteratively on an array of objects (it ranks the objects based on their properties). In each iteration the algorithm runs through all collection a couple of times, accessing various data from different places of the whole collection. The algorithm needs several hundred iterations to complete. The array comes from a database.
The straightforward solution that I see is to take the results of a database query and create an object for each row of the query, put the objects to an array and pass the array to my algorithm.
However, I'm concerned with efficacy of such solution when I have to work with an array of several thousand of items because what I do is essentially mirror the results of a query to memory.
On the other hand, making database query a couple of times on each iteration of the algorithm also seems wrong.
So, my question is - what is the correct architectural solution for a problem like this? Is it OK to mirror the query results to memory? If not, which is the best way to work with query results in such an algorithm?
Thanks!
UPDATE: The closest problem that I can think of is ranking of search results by a search engine - I need to do something similar to that. Each result is represented as a row of a database and all results of the set are regarded when the rank is computed.

Don't forget, premature optimization is the root of all evil. Give it a shot copying everything to memory. If that uses too much mem, then optimize for memory.

Memory seems like the best way to go - iff you can scale up to meet it. Otherwise you'll have to revise your algorithm to maybe use a divide and conquer type of approach - do something like a merge sort.

It really depends on the situation at hand. It's probably rarely required to do such a thing, but it's very difficult to tell based off of the information you've given.
Try to isolate the data as much as possible. For instance, if you need to perform some independent action on the data that doesn't have data dependencies amongst iterations of the loop, you can write a query to update the affected rows rather than loading them all into memory, only to write them back.
In short, it is probably avoidable but it's hard to tell until you give us more information :)

If you are doing a query to the database, when the results come back, they are already "mirrored to memory". When you get your results using mysql_fetch_assoc (or equiv) you have your copy. Just use that as the cache.

Is the computation of one object dependent on another, or are they all independent? If they are independent, you could load just a small number of rows from the database, converting them to objects as you describe. Then run your hundreds of iterations on these, and then output the result for that block. You then proceed to the next block of items.
This keeps memory usage down, since you are only dealing with a small number of items rather than the whole data set, and avoids running multiple queries on the database.
The SQL keywords LIMIT and OFFSET can help you step through the data block by block.

Writing ranking queries with MySQL is possible as well, you just need to play with user-defined variables a bit. If you will provide some input data and the result you are going to achieve, the replies will be more detailed

can you use a cron job to do your ranking, say once per day, hour, or whatever you need, and then save the items ranking to a field in its row?
that way when you call your rows up you could just order them by the ranking field.

What is the optimal MYSQL query number in php script?

I am not professional programmer so i can not be sure about this.How many mysql queries your scripts send at one page and what is your optimal query number .For example at stackoverflow's homepage it lists questions shows authors of these questions . is stackoverflow sends mysql query foreach question to get information of author. or it sends 1 query and gets all user data and match it with questions ?

I like to keep mine under 8.
Seriously though, that's pretty meaningless. If hypothetically there was a reason for you to have 800 queries in a page, then you could go ahead and do it. You'll probably find that the number of queries per page will simply be dependant on what you're doing, though in normal circumstances I'd be surprised to see over 50 (though these days, it can be hard to realise just how many you're doing if you are abstracting your DB calls away).
Slow queries matter more
I used to be frustrated at a certain PHP based forum software which had 35 queries in a page and ran really slow, but that was a long time ago and I know now that the reason that particular installation ran slow had nothing to do with having 35 queries in a page. For example, only one or two of those queries took most of the time. It just had a couple of really slow queries, that were fixed by well-placed indexes.
I think that identifying and fixing slow queries should come before identifying and eliminating unnecessary queries, as it can potentially make a lot more difference.
Consider even that three fast queries might be significantly quicker than one slow query - number of queries does not necessarily relate to speed.
I have one page (which is actually kind of a test case/diagnostic tool designed to be run only by an admin) which has over 800 queries but it runs in a matter of seconds. I guess they are all really simple queries.
Try caching
There are various ways to cache parts of your application which can really cut down on the number of queries you do, without reducing functionality. Libraries like memcached make this trivially easy these days and yet run really fast. This can also help improve performance a lot more than reducing the number of queries.
If queries are really unnecessary, and the performance really is making a difference, then remove/combine them
Just consider looking for slow queries and optimizing them, or caching their results, first.

Don't focus on the number of queries. This is not a useful metric. Instead, you need to look at a few other things:
how many queries are duplicated?
how many queries have intersecting datasets? or are a subset of another?
how long do they take to run? have you profiled the common ones to check indices?
how many are unnecessarily complex?
Numerous times I've seen three simpler queries together execute in a tenth of the time of one complex one that returned the same information. By the same token, SQL is powerful, but don't go mad trying to do something in SQL that would be easier and simpler in a loop in PHP.
how much progressive processing are you doing?
If you can't avoid longer queries with large datasets, try to re-arrange the algorithm so that you can process the dataset as it comes from the database. This lets you use an unbuffered query in MySQL and that improves your memory usage. And if you can provide output whilst you're doing this, you can improve your page's perceived speed by provinding first output sooner.
how much can you cache some of this data? Even caching it for a few seconds can help immensely.

There really is no optimal number of queries. Obviously the less queries you make the better.
If you are using some kind of ORM like Hibernate, Propel, Doctrine, etc they will generate queries differently than if you were to write the SQL by hand. So if StackOverflow uses an ORM they might have more than one query accessing the questions and the users that created the questions. Or they might just use a join with straight SQL.
It really depends on the technology you are using and what it actually does behind the scenes to generate the SQL.
Things you should be researching to understand this better:
Lazy loading
Object Relational Mapping

I recently started refactoring some older code of mine and I realised that I had used a lot of queries inside loops because back then I didn't know how to write SQL queries with subqueries and joins, etc. So I went and integrated these nested queries into one query so I could retrieve all the data at once and then loop over it in a nested way.
In some cases this made the page load significantly faster.
Ergo: It's definitely worth learning about the possibilities of SQL so you can start doing more with SQL and less with PHP.

I would not say that there is an optimal number of queries to be on any given script, but rather you have a goal when optimising; ordinarily time is the main concern, among other things.
If time is the only concern, you could optimise you queries such that you could have queries that are executed in less time than one other queries.
This is how I view optimosation, I have an objective, how best do I achieve it. Is there any information that you can cache? Based on you indexes, would a particular order of filters in your query perform better.....
My point, optimisation is best done on both the Db end and the application end.
You may want to read more on database optimisation.

As few as you need and no more. There is no rule of thumb here. Some websites require a lot of db access and others don't.
SO actually has only a few db calls if its written as I think. On a page like this one, an answer to a question, there would:
1) session verification, if you are logged in.
2) current user info, to get the user bar at the top of the screen and you medal count.
3) get the question info as well as the questioner's/last editor's info.
4) retrieve a count of tags used in this question.
5) select all responses and responder data in one shot.
And that's about it. The fun part is how much is keyed off the question:
// this returns one row per revision
select q.*, u.name, u.u_id, u.points, u.gmedal, u.smedals, u.bmedals
from questions q left outer join users on q.u_id = u.u_id
where q_id = :q_id;
// this used to display the tags below the question and the tag counts on the right
select t.name, count(*)
from tags t left join tags q on q.tagid = t.tagid
where t.q_id = :q_id
// this can also get multiple revisions
select a.*, u.name, u.u_id, u.points, u.gmedal, u.smedals, u.bmedals
from answers a left outer join users on a.u_id = u.u_id
where a.q_id = :q_id
This assumes that the various counts (vote-ups, favored question) are cached on the table as well as stored separately.

The optimal number is as many as you need to display the information the user expects. I always try to keep it in the single digits. For information that takes a few queries, but rarely changes, I cache the results in a generic cache table so it only takes one query. Store it as a serialized array to retain an easy to access structure.
When I first installed WordPress, I was appalled that the base install did over 20 queries! Plugins would increase that number (some by quite a bit). But with caching, that could be reduced to zero (SuperCache). If your content changes every 10 minutes, why generate it dynamically every hit?
At the very extreme is a platform like Facebook, where every page is unique content, customized to the user viewing it. You have to query every time.
But regardless, I rarely see the need to hit double digits query counts.

0 would be optimal if you are prioritizing speed.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.