MYSQL IN ordering DESC or random? - php

Is the ordering of the numbers in the WHERE xxx IN xxx important?
I mean, is (ordered lowest to highest):
SELECT this FROM table1 WHERE id IN (1,3,54,778,98456)
Faster than (random ordering):
SELECT this FROM table1 WHERE id IN (3,778,54,98456,1)
The id is the primary key of table1, and is int(11).

It makes no difference.
An IN with a list of numbers is evaluated like:
id IN (3,778,54,98456,1)
becomes
id = 3 OR id = 778 OR id = 54 OR id = 98456 OR id = 1
If the IN was a subquery, then indexes matter because this is form of JOIN (semi-join), whereas a static IN list is simply a shorthand for a series or OR filters

MySQL is smart enough to figure out the "best" way to use your IN, it might sort it for you if that makes it better for it, and this is going to be a cheap operation (not as cheap as doing nothing though, but almost, unless your IN list is huge). In terms of SQL standard, the order has absolutely no importance (it'll not change the result, that is), so you could put them in any order.
I generally like having them sorted in ascending order, more for style reasons than performance itself, I find it makes it easier to maintain the list.

Theoretically it may save a few cycles if you order the values in the order the database is sorted. However, you'd have to sort them, which takes extra cycles :-) I'd say that not using sort() is faster.

Related

MySQL+PHP: How to paginate data from complex query with ORDER BY on user-selected column

I have a table with currently ~1500 rows which is expected to grow over time (can't say how much, but still), the website is read-only and lets users do complex queries through the use of some forms, then the search query is completely URL-encoded since it's a public database. It's important to know that users can select what column data must be sorted by.
I'm not concerned about putting some indexes and slowing down INSERTs and UPDATEs (just performed occasionally by admins) since it's basically heavy-reading, but I need to paginate results as some popular queries can return 900+ results and that takes up too much space and RAM on client-side (results are further processed to create a quite rich <div> HTML element with an <img> for each result, btw).
I'm aware of the use of OFFSET {$m} LIMIT {$n} but would like to avoid it
I'm aware of the use of this
Query
SELECT *
FROM table
WHERE {$filters} AND id > {$last_id}
ORDER BY id ASC
LIMIT {$results_per_page}
and that's what I'd like to use, but that requires rows to be sorted only by their ID!
I've come up with (what I think is) a very similar query to custom sort results and allow efficient pagination.
Query:
SELECT *
FROM table
WHERE {$filters} AND {$column_id} > {$last_column_id}
ORDER BY {$column} ASC
LIMIT {$results_per_page}
but that unfortunately requires to have a {$last_column_id} value to pass between pages!
I know indexes (especially unique indexes) are basically automatically-updated integer-based columns that "rank" a table by values of a column (be it integer, varchar etc.), but I really don't know how to make MySQL return the needed $last_column_id for that query to work!
The only thing I can come up with is to put an additional "XYZ_id" integer column next to every "XYZ" column users can sort results by, then update values periodically through some scripts, but is it the only way to make it work? Please help.
(Too many comments to fit into a 'comment'.)
Is the query I/O bound? Or CPU bound? It seems like a mere 1500 rows would lead to being CPU-bound and fast enough.
What engine are you using? How much RAM? What are the settings of key_buffer_size and innodb_buffer_pool_size?
Let's see SHOW CREATE TABLE. If the table is full of big BLOBs or TEXT fields, we need to code the query to avoid fetching those bulky fields only to throw them away because of OFFSET. Hint: Fetch the LIMIT IDs, then reach back into the table to get the bulky columns.
The only way for this to be efficient:
SELECT ...
WHERE x = ...
ORDER BY y
LIMIT 100,20
is to have INDEX(x,y). But, even that, will still have to step over 100 cow paddies.
You have implied that there are many possible WHERE and ORDER BY clauses? That would imply that adding enough indexes to cover all cases is probably impractical?
"Remembering where you left off" is much better than using OFFSET, so try to do that. That avoids the already-discussed problem with OFFSET.
Do not use WHERE (a,b) > (x,y); that construct used not to be optimized well. (Perhaps 5.7 has fixed it, but I don't know.)
My blog on OFFSET discusses your problem. (However, it may or may not help your specific case.)

MySQL Select efficient first and last row

I want to get two rows from my table in a MySQL databse. these two rows must be the first one and last one after I ordered them. To achieve this i made two querys, these two:
SELECT dateBegin, dateTimeBegin FROM worktime ORDER BY dateTimeBegin ASC LIMIT 1;";
SELECT dateBegin, dateTimeBegin FROM worktime ORDER BY dateTimeBegin DESC LIMIT 1;";
I decided to not get the entire set and pick the first and last in PHP to avoid possibly very large arrays. My problem is, that I have two querys and I do not really know how efficient this is. I wanted to combine them for example with UNION, but then I would still have to order an unsorted list twice which I also want to avoid, because the second sorting does exactly the same as the first
I would like to order once and then select the first and last value of this ordered list, but I do not know a more efficient way then the one with two querys. I know the perfomance benefit will not be gigantic, but nevertheless I know that the lists are growing and as they get bigger and bigger and I execute this part for some tables I need the most efficient way to do this.
I found a couple of similar topics, but none of them adressed this particular perfomance question.
Any help is highly appreciated.
(This is both an "answer" and a rebuttal to errors in some of the comments.)
INDEX(dateTimeBegin)
will facilitate SELECT ... ORDER BY dateTimeBegin ASC LIMIT 1 and the corresponding row from the other end, using DESC.
MAX(dateTimeBegin) will find only the max value for that column; it will not directly find the rest of the columns in that row. That would require a subquery or JOIN.
INDEX(... DESC) -- The DESC is ignored by MySQL. This is almost never a drawback, since the optimizer is willing to go either direction through an index. The case where it does matter is ORDER BY x ASC, y DESC cannot use INDEX(x, y), nor INDEX(x ASC, y DESC). This is a MySQL deficiency. (Other than that, I agree with Gordon's 'answer'.)
( SELECT ... ASC )
UNION ALL
( SELECT ... DESC )
won't provide much, if any, performance advantage over two separate selects. Pick the technique that keeps your code simpler.
You are almost always better off having a single DATETIME (or TIMESTAMP) field than splitting out the DATE and/or TIME. SELECT DATE(dateTimeBegin), dateTimeBegin ... works simply, and "fast enough". See also the function DATE_FORMAT(). I recommend dropping the dateBegin column and adjusting the code accordingly. Note that shrinking the table may actually speed up the processing more than the cost of DATE(). (The diff will be infinitesimal.)
Without an index starting with dateTimeBegin, any of the techniques would be slow, and get slower as the table grows in size. (I'm pretty sure it can find both the MIN() and MAX() in only one full pass, and do it without sorting. The pair of ORDER BYs would take two full passes, plus two sorts; 5.6 may have an optimization that almost eliminates the sorts.)
If there are two rows with exactly the same min dateTimeBegin, which one you get will be unpredictable.
Your queries are fine. What you want is an index on worktime(dateTimeBegin). MySQL should be smart enough to use this index for both the ASC and DESC sorts. If you test it out, and it is not, then you'll want two indexes: worktime(dateTimeBegin asc) and worktime(dateTimeBegin desc).
Whether you run one query or two is up to you. One query (connected by UNION ALL) is slightly more efficient, because you have only one round-trip to the database. However, two might fit more easily into your code, and the difference in performance is unimportant for most purposes.

Count rows until value is reached

I am trying to find a way to count the number of users until the number is reached. Here's somewhat of how my table is setup.
ID Quantity
1 10
2 30
3 20
4 28
Basically, I want to organize the row quantity to be in order from greatest to least. Then I want it to count how many rows it takes from going from the highest quantity to whatever ID you supply it with. So for example, If I was looking for the ID #4, It would look through the quantity from from greatest to least, then tell me that it is row #2 because it took only 2 rows to reach it since it contains the 2nd highest quantity.
There is another way I can code this, but I feel it is too demanding of a resource and involves PHP. I can do a loop on my database based on the greatest to least, and every time it goes through another loop, I add +1. So, that way, I could do an IF statement to determine when it reaches my value. However, when I have thousands of values it would have to go through, I feel like that would be too resource demanding.
Overall, this is a simple sort problem. Any data structure can give you the row of an item, with minor modifications in some cases.
If you are planning on using this operation multiple times, it is possible to beat the theoretical O(n log(n)) running time with an amortized O(log(n)) by maintaining a separate sorted copy of your table sorted by quantity. This reduces the problem to a binary search.
A third alternative is to maintain a virtual linked list of table entries in the new sort order. This would increase the insert times into the table to O(n), but would reduce this problem to O(1)
A fourth solution would be to maintain a virtual balanced tree, however, despite the good theoretical performance, this solution is likely to be extremely hard to implement.
It might not be the answer you are expecting but: you can't "stop" the execution of a query after you reach a certain value. MySQL always generate the full result set before you can analyse it. This is because, it order to sort the results by Quantity, MySQL needs to have all the rows.
So if you want to do this is pure MySQL, you need to count the row numbers (as explained here MySQL - Get row number on select) in a temporary table and then select your ID from there.
Example:
SET #rank = 0;
SELECT *
FROM (
SELECT Id, Quantity, #rank := #rank + 1 as rank
FROM table
ORDER BY Quantity
) as ordered_table
WHERE Id = 4;
If performance is an issue, you could probably speed this up a bit with an index on Quantity (to be tested). Otherwise the best way is to store the "rank" value in a separate table (containing only 2 columns: Id and Rank), possibly with a trigger to refresh the table on insert/update.

Compare table rows, big data amount

I have a quite interesting task. But I don't know how to call it in one word in order to search for related topics. Even this topic title might not reflect what I need. So, if somebody has better title - welcome.
I'll try to explain my problem.
I have about 100,000 rows in MySQL db table. And I need to "compare" entries from the table.
"compare" doesn't mean just equal. There is an algorithm for calculation comparison level. I have weight coefficient for each table column. Means that if entry#1's column1 equals to entry#2's column2 then I give, say, 5 point to this pair. And so on for each column.
The most straight forward way to do this - apply calculation rules for each couple of entries. Why am I afraid of this? 100,000 entries means about 5 billion "compare" operations. For sure, I can calculate this on demand and store the result somewhere in cache. But I believe that the most obvious way is not the most effective.
So, my first question is: Is there any other better way to achive my goal except of brute force?
My second question is related to tool which is better for calculations.
Application language is PHP. Hence, I need to load into memory whole
table and iterate over the data.
Create stored procedure in MySQL.
Using MongoDB's aggregation framework or MapReduce.
The least of all I like the first way. The most of all - the last.
I'm looking for any suggestion or advice from people who have experience in such sort of cases.
Since, I don't know how to ask google for help, any links will be appreciated.
UPDATE:
Calculation rules are a bit more complicated then I described...
Table has a set of related columns which are to be used at once as group(not one by one).
Let's assume:
table has fields, say, tag_1, tag_2, .., tag_n.
row_1 and row_2 - entries in the table.
The rule(pseudo-code):
if(row_1.tag_1==row_2.tag_1)
{
// gives 10 points
}
elseif(row_1.tag_1 is in row_2.tags && row_1.tag_1!=row_2.tag_1)
{
// gives 5 points
}
....
// and so on
Basically, I need to check find intersection of two arrays. If it is not empty - points are given. If indexes of tags in two rows match the additional points are given.
I'm wondering, how this can be accomplished using Stored Procedures Language? Because it can be done pretty easy using any programming language.
If stored procedure can do this then it is my choice.
If you have a static table, then it doesn't make a difference which you choose, so long as you store the results somewhere (presumably back in the database).
If your data is changing, then you need to compare each new row to all rows, which is essentially a full-table scan. This is probably best done in a database.
If the data fits into memory (and 500,000 rows should fit into memory), then (2) will probably be faster than (3) on equivalent hardware. "Equivalent hardware" is a very important consideration.
In most cases, I would opt for (2). It sounds like the query is something like:
select t.id, t2.id,
((case when t1.col1 = t2.col1 then 5 else 0 end) +
(case when t2.col2 = t2.col2 then 7 else 0 end) +
. . .
)
from t cross join t2
If you are much more comfortable with map-reduce, then you might find it easier to code there. I know both languages and prefer SQL for something like this.
Can't you do something like this:
UPDATE table SET points = points+5 WHERE column1 = column2
If you have too check for a specific value, you could try something like this:
UPDATE table SET points = points+5 WHERE column1 = 'somevalue' AND column2 = 'somevalue'

My(SQL) selecting random rows, novel way, help to evaluate how good is it?

I again run into problem of selecting random subset of rows. And as many probably know ORDER BY RAND() is quite inefficient, or at least thats the consensus. I have read that mysql generates random value for every row in table, then filters then orders by these random values and then returns set. The biggest performance impact seems to be from the fact that there as many random numbers generated as there are rows in a table. So i was looking for possibly better way to return random subset of results for such query:
SELECT id FROM <table> WHERE <some conditions> LIMIT 10;
Of course simplest and easiest way to do what i want would be the one witch I try to avoid:
SELECT id FROM <table> WHERE <some conditions> ORDER BY RAND() LIMIT 10; (a)
Now after some thinking i came up with other option for this task:
SELECT id FROM <table> WHERE (<some conditions>) AND RAND() > x LIMIT 10; (b)
(Of course we can use < instead of >) Here we take x from range 0.0 - 1.0. Now I'm not exactly sure how MySQL handles this but my guess is that it first selects rows matching <some conditions> (using index[es]?) and then generates random value and sees if it should return or discard row. But what do i know:) thats why i ask here. So some observations about this method:
first it does not guarantee that asked number of rows will be returned even if there is much more matching rows than needed, especially true for x values close to 1.0 (or close to 0.0 if we use <)
returned object don't really have random ordering, they are just objects selected randomly, order by index used or by the way they are stored(?) (of course this might not matter in some cases at all)
we probably need to choose x according to size of result set, since if we have large result set and x is lets say 0.1, it will be very likely that only some random first results will be returned most of the time; on the other hand if have small result set and choose large x it is likely that we might get less object than we want, although there are enough of them
Now some words about performance. I did a little testing using jmeter. <table> has about 20k rows, and there are about 2-3k rows matching <some conditions>. I wrote simple PHP script that executes query and print_r's the result. Then I setup test using jmeter that starts 200 threads, so that is 200 requests per second, and requests said PHP script. I ran it until about 3k requests were done (average response time stabilizes well before this). Also I executed all queries with SQL_NO_CACHE to prevent query cache having some effect. Average response times were:
~30ms for query (a)
13-15ms for query (b) with x = 0.1
17-20ms for query (b) with x = 0.9, as expected larger x is slower since it has to discard more rows
So my questions are: what do you think about this method of selecting random rows? Maybe you have used it or tried it and see that it did not work out? Maybe you can better explain how MySQL handles such query? What could be some caveats that I'm not aware of?
EDIT: I probably was not clear enough, the point is that i need random rows of query not simply table, thus I included <some conditions> and there are quite some. Moreover table is guaranteed to have gaps in id, not that it matters much since this is not random rows from table but from query, and there will be quite a lot such queries so those suggestions involving querying table multiple times do not sound appealing. <some conditions> will vary at least a bit between requests, meaning that there will be requests with different conditions.
From my own experience, I've always used ORDER BY RAND(), but this has it's own performance implications on larger datasets. For example, if you had a table that was too big to fit in memory then MySQL will create a temporary table on disk, and then perform a file sort to randomise the dataset (storage engine permitting). Your LIMIT 10 clause will have no effect on the execution time of the query AFAIK, but it will reduce the size of the data to send back to the client obviously.
Basically, the limit and order by happen after the query has been executed (full table scan to find matching records, then it is ordered and then it is limited). Any rows outside of your LIMIT 10 clause are discarded.
As a side note, adding in the SQL_NO_CACHE will disable MySQL's internal query cache, but will does not prevent your operating system from caching the data (due to the random nature of this query I don't think it would have much of an effect on your execution time anyway).
Hopefully someone can correct me here if I have made any mistakes but I believe that is the general idea.
An alternative way which probably would not be faster, but might who knows :)
Either use a table status query to determine the next auto_increment, or the row count, or use (select count(*)). Then you can decide ahead of time the auto_increment value of a random item and then select that unique item.
This will fail if you have gaps in the auto_increment field, but if it is faster than your other methods, you could loop a few times or fall back to a failsafe method in the case of zero rows returned. Best case might be a big savings, worst case would be comparable to your current method.
You're using the wrong data structure.
The usual method is something like this:
Find out the number of elements n — something like SELECT count(id) FROM tablename.
Choose r distinct randomish numbers in the interval [0,n). I usually recommend a LCG with suitably-chosen parameters, but simply picking r randomish numbers and discarding repeats also works.
Return those elements. The hard bit.
MySQL appears to support indexed lookups with something like SELECT id from tablename ORDER BY id LIMIT :i,1 where :i is a bound-parameter (I forget what syntax mysqli uses); alternative syntax LIMIT 1 OFFSET :i. This means you have to make r queries, but this might be fast enough (it depends on per-statement overheads and how efficiently it can do OFFSET).
An alternative method, which should work for most databases, is a bit like interval-bisection:
SELECT count(id),max(id),min(id) FROM tablename. Then you know rows [0,n-1] have ids [min,max].
So rows [a,b] have ids [min,max]. You want row i. If i == a, return min. If i == b, return max. Otherwise, bisect:
Choose split = min+(max-min)/2 (avoiding integer overflow).
SELECT count(id),max(id) WHERE :min < id AND id < split and SELECT count(id),min(id) WHERE :split <= id and id < :max. The two counts should equal b-a+1 if the table hasn't been modified...
Figure out which range i is in, and update a, b, min, and max appropriately. Repeat.
There are plenty of edge cases (I've probably included some off-by-one errors) and a few potential optimizations (you can do this for all the indexes at once, and you don't really need to do two queries per iteration if you don't assume that i == b implies id = max). It's not really worth doing if SELECT ... OFFSET is even vaguely efficient.

Categories