I have about 100,000 merchants stored in my database, the table contains their Name, Lat, Lon
Right now my search function takes a lat/lon input and finds the stores within 10 miles radius. However, a lot of the 100,000 merchants are franchises (ie. groups of associated stores). For the franchises, I'd like to show only the Closest store, with the rest stores of the same franchise hidden (if there are multiple locations within 10 miles radius).
I've tried doing:
SELECT SQL_CALC_FOUND_ROWS *, xxxxx as Distance
FROM table
WHERE isActive = 1
GROUP BY Name
HAVING Distance < 10
ORDER BY Distance ASC
xxxxx is the function that calculates distance based on input lat/lon:
((ACOS(SIN('$lat'*PI()/180)*SIN(`Latitude`*PI()/180) + COS('$lat'*PI()/180)*COS(`Latitude`*PI()/180)*COS(('$long'-`Longitude`)*PI()/180))*180/PI())*60*1.1515)
However it's not returning the correct results. I'm getting significantly less results regardless of franchised or unfranchised stores when comparing with the same query without the "GROUP BY" clause. I wonder what's the problem?
Also, the speed is really slow. I have Name column indexed. But I think the "GROUP BY Name" is the bottleneck since MySQL must be doing a lot of string comparison? Assuming GROUP BY bug can be fixed, I'm wondering what are my options to make this faster. Is it worth while to setup a "Business_Group" column and pre-process the stores so the franchised stores would be assigned a Business_Group ID, this way GROUP BY would be faster since it's comparing int?
Make a virtual table using views of calculated column xxxxx
and use join of table and view.
That will be faster and optimized.
Try to group derived table
SELECT * FROM (
SELECT *, xxxxx as Distance FROM table WHERE isActive = 1
HAVING Distance < 10 ORDER BY Distance ASC
) AS all
GROUP BY Name
In not grouped query HAVING applies to every record as instant filter, and with GROUP BY it applies to whole groups, so groups are first filled with all data and then filtered with HAVING.
Related
Lets start by saying that I cant use INDEXING as I need the INSERT, DELETE and UPDATE for this table to be super fast, which they are.
I have a page that displays a summary of order units collected in a database table. To populate the table an order number is created and then individual units associated with that order are scanned into the table to recored which units are associated with each order.
For the purposes of this example the table has the following columns.
id, UID, order, originator, receiver, datetime
The individual unit quantities can be in the 1000's per order and the entire table is growing to hundreds of thousands of units.
The summary page displays the number of units per order and the first and last unit number for each order. I limit the number of orders to be displayed to the last 30 order numbers.
For example:
Order 10 has 200 units. first UID 1510 last UID 1756
Order 11 has 300 units. first UID 1922 last UID 2831
..........
..........
Currently the response time for the query is about 3 seconds as the code performs the following:
Look up the last 30 orders by by id and sort by order number
While looking at each order number in the array
-- Count the number of database rows that have that order number
-- Select the first UID from all the rows as first
-- Select the last UID from all the rows as last
Display the result
I've determined the majority of the time is taken by the Count of the number of units in each order ~1.8 seconds and then determining the first and last numbers in each order ~1 second.
I am really interested in if there is a way to speed up these queries without INDEXING. Here is the code with the queries.
First request selects the last 30 orders processed selected by id and grouped by order number. This gives the last 30 unique order numbers.
$result = mysqli_query($con, "SELECT order, ANY_VALUE(receiver) AS receiver, ANY_VALUE(originator) AS originator, ANY_VALUE(id) AS id
FROM scandb
GROUP BY order
ORDER BY id
DESC LIMIT 30");
While fetching the last 30 order numbers count the number of units and the first and last UID for each order.
while($row=mysqli_fetch_array($result)){
$count = mysqli_fetch_array(mysqli_query($con, "SELECT order, COUNT(*) as count FROM scandb WHERE order ='".$row['order']."' "));
$firstLast = mysqli_fetch_array(mysqli_query($con, "SELECT (SELECT UID FROM scandb WHERE orderNumber ='".$row['order']."' ORDER BY UID LIMIT 1) as 'first', (SELECT UID FROM barcode WHERE order ='".$row['order']."' ORDER BY UID DESC LIMIT 1) as 'last'"));
echo "<td align= center>".$count['count']."</td>";
echo "<td align= center>".$firstLast['first']."</td>";
echo "<td align= center>".$firstLast['last']."</td>";
}
With 100K lines in the database this whole query is taking about 3 seconds. The majority of the time is in the $count and $firstlast queries. I'd like to know if there is a more efficient way to get this same data in a faster time without Indexing the table. Any special tricks that anyone has would be greatly appreciated.
Design your database with caution
This first tip may seems obvious, but the fact is that most database problems come from badly-designed table structure.
For example, I have seen people storing information such as client info and payment info in the same database column. For both the database system and developers who will have to work on it, this is not a good thing.
When creating a database, always put information on various tables, use clear naming standards and make use of primary keys.
Know what you should optimize
If you want to optimize a specific query, it is extremely useful to be able to get an in-depth look at the result of a query. Using the EXPLAIN statement, you will get lots of useful info on the result produced by a specific query, as shown in the example below:
EXPLAIN SELECT * FROM ref_table,other_table WHERE ref_table.key_column=other_table.column;
Don’t select what you don’t need
A very common way to get the desired data is to use the * symbol, which will get all fields from the desired table:
SELECT * FROM wp_posts;
Instead, you should definitely select only the desired fields as shown in the example below. On a very small site with, let’s say, one visitor per minute, that wouldn’t make a difference. But on a site such as Cats Who Code, it saves a lot of work for the database.
SELECT title, excerpt, author FROM wp_posts;
Avoid queries in loops
When using SQL along with a programming language such as PHP, it can be tempting to use SQL queries inside a loop. But doing so is like hammering your database with queries.
This example illustrates the whole “queries in loops” problem:
foreach ($display_order as $id => $ordinal) {
$sql = "UPDATE categories SET display_order = $ordinal WHERE id = $id";
mysql_query($sql);
}
Here is what you should do instead:
UPDATE categories
SET display_order = CASE id
WHEN 1 THEN 3
WHEN 2 THEN 4
WHEN 3 THEN 5
END
WHERE id IN (1,2,3)
Use join instead of subqueries
As a programmer, subqueries are something that you can be tempted to use and abuse. Subqueries, as show below, can be very useful:
SELECT a.id,
(SELECT MAX(created)
FROM posts
WHERE author_id = a.id)
AS latest_post FROM authors a
Although subqueries are useful, they often can be replaced by a join, which is definitely faster to execute.
SELECT a.id, MAX(p.created) AS latest_post
FROM authors a
INNER JOIN posts p
ON (a.id = p.author_id)
GROUP BY a.id
Source: http://20bits.com/articles/10-tips-for-optimizing-mysql-queries-that-dont-suck/
This question already has answers here:
Doing a while / loop to get 10 random results
(3 answers)
Closed 9 years ago.
I have this table (PERSONS) with 25M rows:
ID int(10) PK
points int(6) INDEX
some other columns
I want to show the user 4 random rows which are somewhat close to each other in points. I found this query after some searching and tuning to generate random rows which is impressive fast:
SELECT person_id, points
FROM persons AS r1 JOIN
(SELECT (RAND() *
(SELECT MAX(person_id)
FROM persons)) AS id)
AS r2
WHERE r1.person_id>= r2.id and points > 0
ORDER BY r1.person_id ASC
LIMIT 4
So I query this in the PHP. Which gives me great and fast results (below 0.05 seconds when warmed up). But these rows are really just random (with at least 1 point since the points > 0). I would like to show some rows which are a little bit close, doesn't have to be every time, but let's say I do this query with limit 50 and than select a random row in PHP and the 3 closest rows (based on points) next to it. I would think you would need to sort the result, pick a random row and show the rows after/before it. But i have no idea how I can make this, since I am quite new to PHP.
Anyone suggestions, all feedback is welcome :)
Build an index on your points column (if it does not already exist), then perform your randomisation logic on that:
ALTER TABLE persons ADD INDEX (points);
SELECT person_id, points
FROM persons JOIN (
SELECT RAND() * MAX(points) AS pivot
FROM persons
WHERE points > 0
) t ON t.pivot <= points
ORDER BY points
LIMIT 4
Note that this approach will select the pivot using a uniform probability distribution over the range of points values; if points are very non-uniform, you can end up pivoting on some values a lot more often than others (thereby resulting in seemingly "non-random" outcomes).
To resolve that, you can select a random record by a more uniformly distributed column (maybe person_id?) and then use the points value of that random record as the pivot; that is, substitute the following for the subquery in the above statement:
SELECT points AS pivot
FROM persons JOIN (
SELECT FLOOR(
MIN(person_id)
+ RAND() * (MAX(person_id)-MIN(person_id))
) AS random
FROM persons
WHERE points > 0
) r ON r.random <= person_id
WHERE points > 0
ORDER BY person_id
LIMIT 1
Removing a subquery from it will drasticly improve the performance and caching so you could for example get list your IDs, put it in a file and then random from it (for example by reading random lines from file). This will improve it by a whole lot, as you can see if you will run EXPLAIN on this query and compare it by changing the query to load just data for the 4 (still random) ids.
I would suggest doing two separate sql queries in PHP and not join/subquery them. In many cases the optimizer can not simplify your query and has to perform each one separatly. So, in your case. if you have 1000 persons the optimizer will do the following wueries at worst case:
Get 1000 persons rows
Do Sub Select for each person which get's 1000 persons rows
Join 1000 persons with joined rows resulting in 1.000.000 rows
Filter all of them
In short:
1001 queries with 1.000.000 rows
My advice?
Perform two queries and NO joins or sub-selects as both (especially in combination have dramatic performance drops in most cases)
SELECT person_id, points
FROM persons
ORDER BY RAND() LIMIT 1
Now use the found points for your second query
SELECT person_id, points, ABS(points - <POINTS FROM ABOVE>) AS distance
FROM persons
ORDER BY distance ASC LIMIT 4
I'm trying to get 4 random results from a table that holds approx 7 million records. Additionally, I also want to get 4 random records from the same table that are filtered by category.
Now, as you would imagine doing random sorting on a table this large causes the queries to take a few seconds, which is not ideal.
One other method I thought of for the non-filtered result set would be to just get PHP to select some random numbers between 1 - 7,000,000 or so and then do an IN(...) with the query to only grab those rows - and yes, I know that this method has a caveat in that you may get less than 4 if a record with that id no longer exists.
However, the above method obviously will not work with the category filtering as PHP doesn't know which record numbers belong to which category and hence cannot select the record numbers to select from.
Are there any better ways I can do this? Only way I can think of would be to store the record id's for each category in another table and then select random results from that and then select only those record ID's from the main table in a secondary query; but I'm sure there is a better way!?
You could of course use the RAND() function on a query using a LIMIT and WHERE (for the category). That however as you pointed out, entails a scan of the database which takes time, especially in your case due to the volume of data.
Your other alternative, again as you pointed out, to store id/category_id in another table might prove a bit faster but again there has to be a LIMIT and WHERE on that table which will also contain the same amount of records as the master table.
A different approach (if applicable) would be to have a table per category and store in that the IDs. If your categories are fixed or do not change that often, then you should be able to use that approach. In that case you will effectively remove the WHERE from the clause and getting a RAND() with a LIMIT on each category table would be faster since each category table will contain a subset of records from your main table.
Some other alternatives would be to use a key/value pair database just for that operation. MongoDb or Google AppEngine can help with that and are really fast.
You could also go towards the approach of a Master/Slave in your MySQL. The slave replicates content in real time but when you need to perform the expensive query you query the slave instead of the master, thus passing the load to a different machine.
Finally you could go with Sphinx which is a lot easier to install and maintain. You can then treat each of those category queries as a document search and let Sphinx randomize the results. This way you offset this expensive operation to a different layer and let MySQL continue with other operations.
Just some issues to consider.
Working off your random number approach
Get the max id in the database.
Create a temp table to store your matches.
Loop n times doing the following
Generate a random number between 1 and maxId
Get the first record with a record Id greater than the random number and insert it into your temp table
Your temp table now contains your random results.
Or you could dynamically generate sql with a union to do the query in one step.
SELECT * FROM myTable WHERE ID > RAND() AND Category = zzz LIMIT 1
UNION
SELECT * FROM myTable WHERE ID > RAND() AND Category = zzz LIMIT 1
UNION
SELECT * FROM myTable WHERE ID > RAND() AND Category = zzz LIMIT 1
UNION
SELECT * FROM myTable WHERE ID > RAND() AND Category = zzz LIMIT 1
Note: my sql may not be valid, as I'm not a mySql guy, but the theory should be sound
First you need to get number of rows ... something like this
select count(1) from tbl where category = ?
then select a random number
$offset = rand(1,$rowsNum);
and select a row with offset
select * FROM tbl LIMIT $offset, 1
in this way you avoid missing ids. The only problem is you need to run second query several times. Union may help in this case.
For MySQl you can use
RAND()
SELECT column FROM table
ORDER BY RAND()
LIMIT 4
I am taking gps data and drawing a map, I would like to show the maximum speed at various points. I would need to keep track of the speed value for each data point and if it decreases mark the previous point as a max value.
Is this something i could do nicely in MySql, or do I just loop though it in php and get the values. It is quite simple to do via php, but includes pulling out loads of data that is not really required.
One data set could be up to 20k rows give or take a few thousand.
From the graph below I would expect 4 data points back. Table structure is simple id, long, lat, speed (not that it matters much)
EDIT:
id is a uuid, not integer :/
Assuming you have ids > 0:
SELECT id, speed FROM (
SELECT
if(speed<#speed,#id,0) AS id,
if(speed<#speed,#id:=0,#id:=id) AS ignoreme,
#speed:=speed AS speed
FROM
(SELECT #speed:=0) AS initspeed,
(SELECT #id:=0) AS initid,
yourtable
WHERE ...
) AS baseview
WHERE id>0
This will compare the last speed with the current speed, give the last rising id and speed for every falling interval, 0 and the speed for all other cases in the inner query. From that we select only those rows, that have a positive id
I've tried this one and it works just fine. Not sure if it will have increased performance over PHP, but most likely will. It creates two "temp selects" and joins them by rownum (it would be good if you can sort those selects by something, just to be sure that you will get the same order, I'm ordering by id), and then it makes a join, but by shifting second table by 1...that way A.speed - B.speed gives difference between "current" and "previous" value. At the end, you just need records where that difference is > 0...Hope this helps.
SELECT A.speed, A.speed - B.speed as diff
FROM
(SELECT #rownumA:=#rownumA+1 AS rownum, speed
FROM speed_table ORDER BY id,
(SELECT #rownumA:=0) r) A
INNER JOIN
(SELECT #rownumB:=#rownumB+1 AS rownum, speed
FROM speed_table ORDER BY id,
(SELECT #rownumB:=0) r) B ON A.rownum = B.rownum - 1
WHERE A.speed - B.speed > 0
I have the following query:
$strQuery = "SELECT siteid, SUM(watts) AS wattage, unit, device, time FROM inverter WHERE siteid = '528' AND time Between '$time1' AND '$time2' Order By device Asc";
I'm making a graph in fusion charts and need the total watts for each device but when i do the query above it takes all values and places them for just the first device. I have 40 devices and need each one to have its total watts produced.
On the chart i am displaying device as the x axis name label and the wattage as the value.
Thanks
SUM() is an aggregate function which should be used together with a GROUP statement. If GROUP statement is omitted, all selected rows are aggregated as a group.
see docs: http://dev.mysql.com/doc/refman/5.0/en/group-by-functions.html
Look into GROUP BY statement:
SELECT siteid, device,
SUM(watts) AS wattage,
MIN(time) as start_time,
MAX(time) as end_time
unit
FROM inverter
WHERE siteid = '528'
AND time BETWEEN '$time1' AND '$time2'
GROUP BY device, unit
ORDER BY device ASC
Also, when you use aggregation and select fields that are not contained in the GROUP statement (here: time) you must make sure you really get the value you want.
The values of the aggregated row you get for each device are calculated using many rows with possibly individual values in those fields. When you don't use an aggregate function (MIN, MAX, AVG, ..) to determine which of the multiple values your aggregated row should contain, you will receive a value selected more or less selected randomly by MySQL.
You should either
do not select such field, when not necessary. (e.g. time)
use an appropriate aggregate function (MIN, MAX, AVG, ..) for each field that is not contained in the GROUP BY clause to ensure you get the right value. (here i inserted both start and end time for the measure timespan using MIN and MAX)
or include it in the GROUP BY clause
(i did that for "unit" in the example, because you get a wrong result when using SUM on both megawatts and kilowatts .. i assume that is what you use "unit" for?)
Again: Otherwise the resulting value in such fields is random and not deterministic.
You can just use the naive query as proposed by #Kaii and #cairnz.
However, if there are multiple times and units for the devices selected, MySQL will just select a unit and time from one device more or less at random.
Why your query failed
Use are using a aggregate function.
Unless you specify a group by clause the aggregate function will simple add all rows into 1 aggregate.
All the rows not covered in an aggregate function will be condensed into one item (one will simple be chosen at leasure).
In order to see what data lies underneigh, I suggest doing:
SELECT
siteid
, SUM(watts) AS wattage
, GROUP_CONCAT(DISTINCT unit) as units
, device
, GROUP_CONCAT(time) as times
FROM inverter
WHERE siteid = '528'
AND time BETWEEN '$time1' AND '$time2'
GROUP BY device
ORDER BY device ASC
The aggregate function GROUP_CONCAT will show a comma separated list of all (DISTINCT) values lumped together by the aggregation.
Because you are only grouping by device, you will different kinds of units and times will be lumped together.
The more columns you add to the group by clause the more you will "zoom in" on the data and the more rows will be shown.
Why you should be careful in listing 'naked' columns in the select that are not in the group by
If you list 'naked' columns (i.e. not as part of an aggregate expression) that are not listed in the group by clause and these columns are not functionally dependent on the group by clause.
There will be multiple different values, of which only one will be shown.
This behavior is unique to MySQL, most other SQL flavors will issue an error and refuse.
The reason MySQL allows this dangerous behavior is that
A: it allows for faster queries,
B: if you specify a unique key in the group by clause, the other fields will be functionally dependent on that key (cf ANSI SQL 2003) and specifying more fields makes no sense,
C: sometimes you don't care about the values of those other fields and displaying a random representative is what you want.
Why can't I just list all non-aggregate columns in the group by
If you alter the group by statement, this has a profound effect on the output.
The more columns (up until the point where you hit a unique key for the selected columns) you group by, the more details you get, which negates the whole point of using aggregate functions.
The solution
If you just want to show a random unit and time, that's cool of course, but perhaps it makes more sense to use the following query:
SELECT
siteid
, SUM(watts) AS wattage
, COUNT(*) as NumberOfUnits
, device
, MIN(time) as Mintime
, MAX(time) as Maxtime
FROM inverter
WHERE siteid = '528'
AND time BETWEEN '$time1' AND '$time2'
GROUP BY device
ORDER BY device ASC
Creative use of the aggregate functions at your disposal is the answer, see:
http://dev.mysql.com/doc/refman/5.0/en/group-by-functions.html
Because you're using MySQL, I esp. recommend studying GROUP_CONCAT a very versatile and useful function.
You need to tell what to group by:
SELECT
siteid,
SUM(watts) AS wattage,
unit,
device,
MIN(time) AS StartTime,
MAX(time) AS EndTime
FROM inverter
WHERE siteid = '528' AND time Between '$time1' AND '$time2'
GROUP BY siteid, unit, device
Order By device Asc
EDIT: Using MIN() and MAX() for time allows you to see which timespans the SUM() covers. You should not use time by itself in the query without specifying what time for the sum of wattage for the unit/device combination is showing.