Optimaze Mysql Query Group Concat with Left Join

Optimaze Mysql Query Group Concat with Left Join - php

here is my query
select image_tree.id, image_tree.parent_id, image_tree.url,
CONCAT(
'[',
GROUP_CONCAT(
JSON_OBJECT(
'lang', image_tree_title.lang,
'title', image_tree_title.title
)
), ']'
) AS lang from `image_tree`
left join `image_tree_title` on `image_tree_title`.`image_tree_id` = `image_tree`.`id`
group by `image_tree`.`url`, `image_tree`.`id`
order by url + 0 ASC
I was given advice to remove GROUP_CONCAT and sort data in PHP. Will GROUP_CONCAT have significant impact on performance, What is better approach, i think that it way better to prepare all data in MYSQL and no need for additional loops in PHP.

You really are better off just querying for the fields you need and assembling them client side. Your current GROUP_CONCAT AND JSON_OBJECT approach puts more work on the database server, locks the tables longer, and increased the amount of data that has to be packaged up and sent from the database to client code. Unless MySQL has some magic compression going on within the connection or special packing for JSON objects, it is going to be passing back numerous, redundant copies of your column name (along with formatting) for each data value you are retrieving.
I wouldn't say never use GROUP_CONCAT, it can even be used to do work that could be too expensive on particularly weak clients.... but it does come with overhead; a raw INT like 10000 can basically be sent in 4 bytes, but best case scenario a string 10000 takes 5 bytes. JSON adds even more overhead; while GROUP_CONCAT expands smaller datatypes, and adds a little more for separators, JSON does that and adds redundant field naming syntax.
Sorting in SQL is fine though, it's what it was built for (though I think that is even avoided when possible in extremely high demand systems).

Related

the maximum limit of items a 'In' clause can handle

what are the limits of data i can pass to a database in a programing language(like php).
suppose i have 1 million records in my database and I have 1 million data in my hand which i want to do a exist checking. if i used a query like
select id from table where id in (array of 1 million data)
what will happen? will this request even reach database?
if it reaches, what are the posibilities ,will it returns a data a better speed than a million querys to db searching id's or a full select data call with millions of for loops.
just for curiosity!.

There isn't a specific number, however, the documentation specifies you'll likely to have problems once you have "thousands" of values. IN (Transact-SQL) - Remarks:
Explicitly including an extremely large number of values (many
thousands of values separated by commas) within the parentheses, in an
IN clause can consume resources and return errors 8623 or 8632. To
work around this problem, store the items in the IN list in a table,
and use a SELECT subquery within an IN clause.
Error 8623:
The query processor ran out of internal resources and could not produce a query plan. This is a rare event and only expected for
extremely complex queries or queries that reference a very large
number of tables or partitions. Please simplify the query. If you
believe you have received this message in error, contact Customer
Support Services for more information.
Error 8632:
Internal error: An expression services limit has been reached. Please look for potentially complex expressions in your query, and try
to simplify them.
To quote my comment I made:
If you need to pass a large number of values to a query, I suggest a Table-Type parameter. But if you really need to pass 1M+ values then it sounds like something is wrong with your design. You may even be better off listing the values you don't want.
Edit: To add to my comment, many (including myself) prefer to use EXISTS instead of IN. So instead of a query like:
FROM YourTable YT
WHERE YT.YourColumn IN (SELECT OT.YourColumn
FROM OtherTable OT)
You would have the query:
FROM YourTable YT
WHERE EXISTS (SELECT 1
FROM OtherTable OT
WHERE OT.YourColumn = YT.YourColumn)

MySQL+PHP: How to paginate data from complex query with ORDER BY on user-selected column

I have a table with currently ~1500 rows which is expected to grow over time (can't say how much, but still), the website is read-only and lets users do complex queries through the use of some forms, then the search query is completely URL-encoded since it's a public database. It's important to know that users can select what column data must be sorted by.
I'm not concerned about putting some indexes and slowing down INSERTs and UPDATEs (just performed occasionally by admins) since it's basically heavy-reading, but I need to paginate results as some popular queries can return 900+ results and that takes up too much space and RAM on client-side (results are further processed to create a quite rich <div> HTML element with an <img> for each result, btw).
I'm aware of the use of OFFSET {$m} LIMIT {$n} but would like to avoid it
I'm aware of the use of this
Query
SELECT *
FROM table
WHERE {$filters} AND id > {$last_id}
ORDER BY id ASC
LIMIT {$results_per_page}
and that's what I'd like to use, but that requires rows to be sorted only by their ID!
I've come up with (what I think is) a very similar query to custom sort results and allow efficient pagination.
Query:
SELECT *
FROM table
WHERE {$filters} AND {$column_id} > {$last_column_id}
ORDER BY {$column} ASC
LIMIT {$results_per_page}
but that unfortunately requires to have a {$last_column_id} value to pass between pages!
I know indexes (especially unique indexes) are basically automatically-updated integer-based columns that "rank" a table by values of a column (be it integer, varchar etc.), but I really don't know how to make MySQL return the needed $last_column_id for that query to work!
The only thing I can come up with is to put an additional "XYZ_id" integer column next to every "XYZ" column users can sort results by, then update values periodically through some scripts, but is it the only way to make it work? Please help.

(Too many comments to fit into a 'comment'.)
Is the query I/O bound? Or CPU bound? It seems like a mere 1500 rows would lead to being CPU-bound and fast enough.
What engine are you using? How much RAM? What are the settings of key_buffer_size and innodb_buffer_pool_size?
Let's see SHOW CREATE TABLE. If the table is full of big BLOBs or TEXT fields, we need to code the query to avoid fetching those bulky fields only to throw them away because of OFFSET. Hint: Fetch the LIMIT IDs, then reach back into the table to get the bulky columns.
The only way for this to be efficient:
SELECT ...
WHERE x = ...
ORDER BY y
LIMIT 100,20
is to have INDEX(x,y). But, even that, will still have to step over 100 cow paddies.
You have implied that there are many possible WHERE and ORDER BY clauses? That would imply that adding enough indexes to cover all cases is probably impractical?
"Remembering where you left off" is much better than using OFFSET, so try to do that. That avoids the already-discussed problem with OFFSET.
Do not use WHERE (a,b) > (x,y); that construct used not to be optimized well. (Perhaps 5.7 has fixed it, but I don't know.)
My blog on OFFSET discusses your problem. (However, it may or may not help your specific case.)

MySQL ENUM VS INT

I have few tables that have columns that can either be ENUM type or INT type. I tend to always use integer type assuming that it will be faster to perform search based on it.
For example one of my table has a column: StatusType which can have only 4 possible values: Completed, In Progress, Failed, Todo.
Instead of storing above as ENUM strings I store them as:
1, 2, 3, 4 respectively. And then in my PHP code I have constant variables that define these values like this:
define('COMPLETED', 1);
define('IN_PROGRESS', 2);
define('FAILED', 3);
define('TODO', 4);
Now my question is, am I doing it right way or I should just change it to be ENUM type and use strings to compare in queries? I have many other columns that can only have set of max 4-5 possible values.

Enum values look really cool in MySQL, yet I am not a fan of them. They are limited to 255 values, so if you decide to add more values, then you might run into a limit. Also, as you describe, you need to synchronize the values in your application code with the values in the database -- something that seems potentially dangerous.
In addition, they make certain future changes more difficult. For instance, other databases do not support enums. And, if you want to add multi-lingual support, having codes embedded in data type definitions in the database is a bit hard to deal with.
The more standard method is one or more reference tables, where you use join to get the values. You can use a hybrid approach where you use a reference table in the database. Then you can load the reference table into the application to get the mapping from numbers to strings so you can avoid the joins in your code.

You are half-correct. Enum is very bad from a performance perspective: MySQL Enum performance advantage?
That said, binding the definitions of the INTs to your code is also not a great thing. Ideally, if you were to follow the correct Data Normalization patterns, you would define the values of the INTs in the Database as well, in another table, and use the index of the definition as the value for the assignment.
See: http://en.wikipedia.org/wiki/Database_normalization#Normal_forms
The reason for this is so the data is portable, and useful without requiring the Codebase to read it (you can easily dump a CSV for Excel by executing a join).
God Speed.
Example SQL:
SELECT *, state.name AS state FROM students
JOIN states ON student.state_id = states.id
Just to get state names.
Or to filter:
SELECT * FROM students
JOIN states ON student.state_id = states.id
WHERE state.name = 'Maine' OR state.code = 'ME'
Yeah, strange example, but the idea is that INTs are TINY, and VARCHAR are... variable... Storing 'Maine' as opposed to '16' adds up over millions of rows. Further, the indexing on INT is MUCH faster than VARCHAR, so your look-ups are going to be much faster. Particularly if you inherently know the number ahead of time and build your query without the JOIN. This is not advisable as a common practice, but could be done if you wanted to make something even faster and you can ensure the validity of the assumed value.

Reasons not to use GROUP_CONCAT?

I just discovered this amazingly useful MySQL function GROUP_CONCAT. It appears so useful and over-simplifying for me that I'm actually afraid of using it. Mainly because it's been quite some time since I started in web-programming and I've never seen it anywhere. A sample of awesome usage would be the following
Table clients holds clients ( you don't say... ) one row per client with unique IDs.
Table currencies has 3 columns client_id, currency and amount.
Now if I wanted to get user 15's name from the clients table and his balances, with the "old" method of array overwriting I would have to do use the following SQL
SELECT id, name, currency, amount
FROM clients LEFT JOIN currencies ON clients.id = client_id
WHERE clients.id = 15
Then in php I would have to loop through the result set and do an array overwrite ( which I'm really not a big fan of, especially in massive result sets ) like
$result = array();
foreach($stmt->fetchAll() as $row){
$result[$row['id']]['name'] = $row['name'];
$result[$row['id']]['currencies'][$row['currency']] = $row['amount'];
}
However with the newly discovered function I can use this
SELECT id, name, GROUP_CONCAT(currency) as currencies GROUP_CONCAT(amount) as amounts
FROM clients LEFT JOIN currencies ON clients.id = client_id
WHERE clients.id = 15
GROUP BY clients.id
Then on application level things are so awesome and pretty
$results = $stmt->fetchAll();
foreach($results as $k => $v){
$results[$k]['currencies'] = array_combine(explode(',', $v['currencies']), explode(',', $v['amounts']));
}
The question I would like to ask is are there any drawbacks to using this function in performance or anything at all, because to me it just looks like pure awesomeness, which makes me think that there must be a reason for people not to be using it quite often.
EDIT:
I want to ask, eventually, what are the other options besides array overwriting to end up with a multidimensional array from a MySQL result set, because if I'm selecting 15 columns it's a really big pain in the neck to write that beast..

Using GROUP_CONCAT() usually invokes the group-by logic and creates temporary tables, which are usually a big negative for performance. Sometimes you can add the right index to avoid the temp table in a group-by query, but not in every case.
As #MarcB points out, the default length limit of a group-concatenated string is pretty short, and many people have been confused by truncated lists. You can increase the limit with group_concat_max_len.
Exploding a string into an array in PHP does not come for free. Just because you can do it in one function call in PHP doesn't mean it's the best for performance. I haven't benchmarked the difference, but I doubt you have either.
GROUP_CONCAT() is a MySQLism. It is not supported widely by other SQL products. In some cases (e.g. SQLite), they have a GROUP_CONCAT() function, but it doesn't work exactly the same as in MySQL, so this can lead to confusing bugs if you have to support multiple RDBMS back-ends. Of course, if you don't need to worry about porting, this is not an issue.
If you want to fetch multiple columns from your currencies table, then you need multiple GROUP_CONCAT() expressions. Are the lists guaranteed to be in the same order? That is, does the third field in one list correspond to the third field in the next list? The answer is no -- not unless you specify the order with an ORDER BY clause inside the GROUP_CONCAT().
I usually favor your first code format, use a conventional result set, and loop over the results, saving to a new array indexed by client id, appending the currencies to an array. This is a straightforward solution, keeps the SQL simple and easier to optimize, and works better if you have multiple columns to fetch.
I'm not trying to say GROUP_CONCAT() is bad! It's really useful in many cases. But trying to make any one-size-fits-all rule to use (or to avoid) any function or language feature is simplistic.

The biggest problem that I see with GROUP_CONCAT is that it is highly specific to MySql: if you want to port your code to run against any other platform, you would have to rewrite all queries that use GROUP_CONCAT. For example, your first query is a lot more portable - you can probably run it against any major RDBMS engine without changing a single character in it.
If you are fine with working only with MySql (say, because you are writing a tool that is meant to be specific to MySql) the queries with GROUP_CONCAT would probably go faster, because the RDBMS would do more work for you, saving on the size of the data transfer.

Should I use a JOIN function or run several queries in a loop structure?

I have this 2 mysql tables: TableA and TableB
TableA
* ColumnAId
* ColumnA1
* ColumnA2
TableB
* ColumnBId
* ColumnAId
* ColumnB1
* ColumnB2
In PHP, I wanted to have this multidimensional array format
$array = array(
array(
'ColumnAId' => value,
'ColumnA1' => value,
'ColumnA2' => value,
'TableB' => array(
array(
'ColumnBId' => value,
'ColumnAId' => value,
'ColumnB1' => value,
'ColumnB2' => value
)
)
)
);
so that I can loop it in this way
foreach($array as $i => $TableA) {
echo 'ColumnAId' . $TableA['ColumnAId'];
echo 'ColumnA1' . $TableA['ColumnA1'];
echo 'ColumnA2' . $TableA['ColumnA2'];
echo 'TableB\'s';
foreach($value['TableB'] as $j => $TableB) {
echo $TableB['...']...
echo $TableB['...']...
}
}
My problem is that, what is the best way or the proper way of querying MySQL database so that I can achieve this goal?
Solution1 --- The one I'm using
$array = array();
$rs = mysqli_query("SELECT * FROM TableA", $con);
while ($row = mysqli_fetch_assoc($rs)) {
$rs2 = mysqli_query("SELECT * FROM Table2 WHERE ColumnAId=" . $row['ColumnAId'], $con);
// $array = result in array
$row['TableB'] = $array2;
}
I'm doubting my code cause its always querying the database.
Solution2
$rs = mysqli_query("SELECT * FROM TableA JOIN TableB ON TableA.ColumnAId=TableB.ColumnAId");
while ($row = mysqli_fet...) {
// Code
}
The second solution only query once, but if I have thousand of rows in TableA and thousand of rows in TableB for each TableB.ColumnAId (1 TableA.ColumnAId = 1000 TableB.ColumnAId), thus this solution2 takes much time than the solution1?

Neither of the two solutions proposed are probably optimal, BUT solution 1 is UNPREDICTABLE and thus INHERENTLY FLAWED!
One of the first things you learn when dealing with large databases is that 'the best way' to do a query is often dependent upon factors (referred to as meta-data) within the database:
How many rows there are.
How many tables you are querying.
The size of each row.
Because of this, there's unlikely to be a silver bullet solution for your problem. Your database is not the same as my database, you will need to benchmark different optimizations if you need the best performance available.
You will probably find that applying & building correct indexes (and understanding the native implementation of indexes in MySQL) in your database does a lot more for you.
There are some golden rules with queries which should rarely be broken:
Don't do them in loop structures. As tempting as it often is, the overhead on creating a connection, executing a query and getting a response is high.
Avoid SELECT * unless needed. Selecting more columns will significantly increase overhead of your SQL operations.
Know thy indexes. Use the EXPLAIN feature so that you can see which indexes are being used, optimize your queries to use what's available and create new ones.
Because of this, of the two I'd go for the second query (replacing SELECT * with only the columns you want), but there are probably better ways to structure the query if you have the time to optimize.
However, speed should NOT be your only consideration in this, there is a GREAT reason not to use suggestion one:
PREDICTABILITY: why read-locks are a good thing
One of the other answers suggests that having the table locked for a long period of time is a bad thing, and that therefore the multiple-query solution is good.
I would argue that this couldn't be further from the truth. In fact, I'd argue that in many cases the predictability of running a single locking SELECT query is a greater argument FOR running that query than the optimization & speed benefits.
First of all, when we run a SELECT (read-only) query on a MyISAM or InnoDB database (default systems for MySQL), what happens is that the table is read-locked. This prevents any WRITE operations from happening on the table until the read-lock is surrendered (either our SELECT query completes or fails). Other SELECT queries are not affected, so if you're running a multi-threaded application, they will continue to work.
This delay is a GOOD thing. Why, you may ask? Relational data integrity.
Let's take an example: we're running an operation to get a list of items currently in the inventory of a bunch of users on a game, so we do this join:
SELECT * FROM `users` JOIN `items` ON `users`.`id`=`items`.`inventory_id` WHERE `users`.`logged_in` = 1;
What happens if, during this query operation, a user trades an item to another user? Using this query, we see the game state as it was when we started the query: the item exists once, in the inventory of the user who had it before we ran the query.
But, what happens if we're running it in a loop?
Depending on whether the user traded it before or after we read his details, and in which order we read the inventory of the two players, there are four possibilities:
The item could be shown in the first user's inventory (scan user B -> scan user A -> item traded OR scan user B -> scan user A -> item traded).
The item could be shown in the second user's inventory (item traded -> scan user A -> scan user B OR item traded -> scan user B -> scan user A).
The item could be shown in both inventories (scan user A -> item traded -> scan user B).
The item could be shown in neither of the user's inventories (scan user B -> item traded -> scan user A).
What this means is that we would be unable to predict the results of the query or to ensure relational integrity.
If you're planning to give $5,000 to the guy with item ID 1000000 at midnight on Tuesday, I hope you have $10k on hand. If your program relies on unique items being unique when snapshots are taken, you will possibly raise an exception with this kind of query.
Locking is good because it increases predictability and protects the integrity of results.
Note: You could force a loop to lock with a transaction, but it will still be slower.
Oh, and finally, USE PREPARED STATEMENTS!
You should never have a statement that looks like this:
mysqli_query("SELECT * FROM Table2 WHERE ColumnAId=" . $row['ColumnAId'], $con);
mysqli has support for prepared statements. Read about them and use them, they will help you to avoid something terrible happening to your database.

Definitely second way. Nested query is an ugly thing since you're getting all query overheads (execution, network e t.c.) every time for every nested query, while single JOIN query will be executed once - i.e. all overheads will be done only once.
Simple rule is not to use queries in cycles - in general. There could be exceptions, if one query will be too complex, so due to performance in should be split, but in a certain case that can be shown only by benchmarks and measures.

If you want to do algorithmic evaluation of your data in your application code (which I think is the right thing to do), you should not use SQL at all. SQL was made to be a human readable way to ask for computational achieved data from a relational database, which means, if you just use it to store data, and do the computations in your code, you're doing it wrong anyway.
In such a case you should prefer using a different storage/retrieving possibility like a key-value store (there are persistent ones out there, and newer versions of MySQL exposes a key-value interface as well for InnoDB, but it's still using a relational database for key-value storage, aka the wrong tool for the job).
If you STILL want to use your solution:
Benchmark.
I've often found that issuing multiple queries can be faster than a single query, because MySQL has to parse less query, the optimizer has less work to do, and more often than not the MySQL optimzer just fails (that's the reason things like STRAIGHT JOIN and index hints exist). And even if it does not fail, multiple queries might still be faster depending on the underlying storage engine as well as how many threads try to access the data at once (lock granularity - this only applies with mixing in update queries though - neither MyISAM nor InnoDB lock the whole table for SELECT queries by default). Then again, you might even get different results with the two solutions if you don't use transactions, as data might change between queries if you use multiple queries versus a single one.
In a nutshell: There's way more to your question than what you posted/asked for, and what a generic answer can provide.
Regarding your solutions: I'd prefer the first solution if you have an environment where a) data changes are common and/or b) you have many concurrent running threads (requests) accessing and updating your tables (lock granularity is better with split up queries, as is cacheability of the queries) ; if your database is on a different network, e.g. network latency is an issue, you're probably better of with the first solution (but most people I know have MySQL on the same server, using socket connections, and local socket connections normally don't have much latency).
Situation may also change depending on how often the for loop is actually executed.
Again: Benchmark
Another thing to consider is memory efficiency and algorithmic efficiency. Later one is about O(n) in both cases, but depending on the type of data you use to join, it could be worse in any of the two. E.g. if you use strings to join (you really shouldn't, but you don't say), performance in the more php dependent solution also depends on php hash map algorithm (arrays in php are effectively hash maps) and the likelyhood of a collision, while mysql string indexes are normally fixed length, and thus, depending on your data, might not be applicable.
For memory efficiency, the multi query version is certainly better, as you have the php array anyway (which is very inefficient in terms of memory!) in both solutions, but the join might use a temp table depending on several circumstances (normally it shouldn't, but there ARE cases where it does - you can check using EXPLAIN for your queries)

In some case, you should using limit for best performance
If you wanna show 1000 rows
And some single query( master data)
you should run 1000 with limit between 10-100
Then get your foreign key to master data with single query with using WHERE IN in your query. then count your unique data to limit master data.
Example
Select productID, date from transaction_product limit 100
Get all productID and make it unique
Then
Select price from master_product WHERE IN (1,2 3 4) limit 4(count from total unique)
foreach(transaction)
master_poduct[productID]

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Optimaze Mysql Query Group Concat with Left Join - php

Related

the maximum limit of items a 'In' clause can handle

MySQL+PHP: How to paginate data from complex query with ORDER BY on user-selected column

MySQL ENUM VS INT

Reasons not to use GROUP_CONCAT?

Should I use a JOIN function or run several queries in a loop structure?

Categories

Resources