I have a table of over 300,000 rows and I would like to render this data on a graph, but 300,000 rows isn't really necessary all at once. For example, even though there may be 100 rows of data for a given day, I don't need to display all that data if I'm showing a whole year worth of data. So I would like to "granularize" the data.
I was thinking of getting everything and then using a script to remove what I don't need, but that seems like it would be much slower and harder on the database.
So here's what I have so far.
SET #row_number := 0;
SELECT #row_number := #row_number + 1 as row_number,
price, region, timestamp as row_number FROM pricehistory;
This gives me all the rows and numbers them. I was planning on adding a where clause to get every 1000 rows (i.e. every nth row) like this
SET #row_number := 0;
SELECT #row_number := #row_number + 1 as row_number,
price, region, timestamp as row_number FROM pricehistory
WHERE row_number % 1000 = 0;
But MYSQL doesn't see row_number as a column for some reason. Any ideas? I've looked at other solutions online, but they don't seem to work for MYSQL in particular.
As Racil's comment suggested, you can just go by an auto-incremented id field if you have one; but you've stated the amount of data for different dates could be different, so this could make for a very distorted graph. If you select every 1000th record for a year and half the rows are from the last 3 months ("holiday shopping" for a commerce example), the latter half of a year graph will actually reflect the latter quarter of the year. For more useful results you're most likely better off with something like this:
SELECT region, DATE(timestamp) AS theDate
, AVG(price), MIN(price), MAX(price)
FROM pricehistory
GROUP BY region, theDate
;
It doesn't look like I'm going to get another answer so I'll go ahead and write the solution I came up with.
My data is pretty evenly distributed as it grabs prices at regular intervals so there's no reason to worry about that.
Here's my solution.
Let's say I have 500,000 rows and I want to display a subset of those rows let's say 5000 rows. 500000/5000 is 100 so I take 100 and use it in my select statement like this SELECT * FROM pricehistory where id % 100 = 0;
Here is the actual code
public function getScaleFactor($startDate, $endDate) {
$numPricePoints = $this->getNumPricePointsBetweenDates($startDate, $endDate);
$scaleFactor = 1;
if ($numPricePoints > $this->desiredNumPricePoints) {
$scaleFactor = floor($numPricePoints / $this->desiredNumPricePoints);
}
return $scaleFactor;
}
I then use $scaleFactor in the SQL like this SELECT * FROM pricehistory WHERE id % {$scaleFactor} = 0;
This isn't a perfect solution because you don't always end up with 5000 rows exactly, but I don't NEED exactly 5000 rows. I'm just trying to reduce the resolution of the data while still getting a graph that looks close to what it would be had I used all 500,000 rows.
Related
I need to run queries with several conditions which will result large dataset. Whereas all the conditions are straight forward, I need advice regarding 2 issues in terms of speedoptimization:
1) If I need to run those queries between 1st Apr till 20th June of each year for last 10 years, I have 2 options in my knowledge:
a. Run the query 10 times
$year = 2015;
$start_month_date = "-04-01";
$end_month_date = "-06-20";
for($i=0;$i<10;$i++){
$start = $year.$start_month_date;
$end = $year.$start_month_date;
$result = mysql_query("....... WHERE .... AND `event_date` BETWEEN $start AND $end");
// PUSH THE RESULT TO AN ARRAY
$year = $year - 1;
}
b. Run the query single time, however query will compare by DayOfYear (hence each date has to be converted to DayOfYear by the query)
$start = Date("z", strtotime("2015-04-01")) + 1;
$end = Date("z", strtotime("2015-06-20")) + 1;
$result = mysql_query("....... WHERE .... AND DAYOFYEAR(`event_date`) BETWEEN $start AND $end");
I am aware of the 1 day difference in day count for leap year with other years, but I can live with that. I am sensing 1.b is more optimized, just want to verify.
2) I have a large query with 2 sub query. When I want to limit the result by date, I should put the conditions inside or outside the sub query?
a. Inside sub query means it has to validate the condition twice
SELECT X.a,X.b,Y.c FROM
(SELECT * FROM mytable WHERE `event_date` BETWEEN '$startdate' AND '$enddate' AND `case` = 'AAA' AND .......) X
(SELECT * FROM mytable WHERE `event_date` BETWEEN '$startdate' AND '$enddate' AND `case` = 'BBB' AND .......) Y
WHERE X.`event_date` = Y.`event_date` AND ........... ORDER BY `event_date`
b. Outside sub query means it will validate once, but has to join a larger dataset (for which I need to set SQL_BIG_SELECTS = 1)
SELECT X.a,X.b,Y.c FROM
(SELECT * FROM mytable WHERE `case` = 'AAA' AND .......) X
(SELECT * FROM mytable WHERE `case` = 'BBB' AND .......) Y
WHERE X.`event_date` = Y.`event_date` AND X.`event_date` BETWEEN '$startdate' AND '$enddate' AND ........... ORDER BY `event_date`
Again, in my opinion 2.a is more optimized, but requesting your advise.
Thanks
(1) Running the queries 10 times with event_date BETWEEN $start AND $end will be faster when the SQL engine can take advantage of an index on event_date. This could be significant, but it depends on the rest of the query.
Also, because you are ordering the entire data set, running 10 queries is likely to be a bit faster. That's because sorting is O(n log(n)), meaning that it takes longer to sort larger data sets. As an example, sorting 100 rows might take X time units. Sorting 1000 rows might take X * 10 * log(10) time units. But, sorting 100 rows 10 times takes just X * 10 (this is for explanatory purposes).
(2) Don't use subqueries if you can avoid them in MySQL. The subqueries are materialized, which adds additional overhead. Plus, they then prevent the use of indexes. If you need to use subqueries, filter the data as much as possible in the subquery. This reduces the data that needs to be stored.
I assume you have lots rows over 10 years otherwise that wouldn't be much of an issue.
Now the best bet is to do a couple explain on the different queries you plan to use, that will probably tell you which index it can use as currently we don't know them (you didn't post the structure of the table)
1.b. use a function in where clause so it will be terrible as it won't be able to use index for date (assuming there is one). So this will read the entire table
One thing that you could do, is ask the database to join the resultset of the 10 queries together using UNION. Mysql would join the result instead of php... (see https://dev.mysql.com/doc/refman/5.0/en/union.html)
2 - As gordon said, filter data as much as possible. However instead of trying option blindly you can use EXPLAIN and the database will help you decide which one make the most sense.
I have a sql querying a MySql table having millions of records. This gets executed in phpMyAdmin in around 2 secs but when run from PHP script, it doesn't complete executing.
select
concat(p1.`Date`," ",p1.`Time`) as har_date_from,
concat(p2.`Date`," ",p2.`Time`) as har_date_to,
(select concat(p3.`Date`," ",p3.`Time`) from
power_logger p3
where p3.slno between 1851219 and 2042099
and p3.meter_id="logger1"
and str_to_date(concat(p3.`Date`," ",p3.`Time`),"%d/%m/%Y %H:%i:%s") >=
str_to_date(concat(p1.`Date`," ",p1.`Time`),"%d/%m/%Y %H:%i:%s")
order by p3.slno limit 1) as cur_date_from,
(select concat(p4.`Date`," ",p4.`Time`) from
power_logger p4
where p4.slno between 1851219
and 2042099
and p4.meter_id="logger1"
and str_to_date(concat(p4.`Date`," ",p4.`Time`),"%d/%m/%Y %H:%i:%s") >=
str_to_date(concat(p2.`Date`," ",p2.`Time`),"%d/%m/%Y %H:%i:%s")
order by p4.slno
limit 1
)
as cur_date_to,
p1.THD_A_N_Avg-p2.THD_A_N_Avg as thd_diff
from power_logger p1
join
power_logger p2
on p2.slno=p1.slno+1
and p1.meter_id="fluke1"
and p2.meter_id=p1.meter_id
and p1.slno between 2058609 and 2062310
and p1.THD_A_N_Avg-p2.THD_A_N_Avg>=2.0000
php script:
$query=/*The query above passed as string*/
$mysql=mysql_connect('localhost','username','pwd') or die(mysql_error());
mysql_select_db('dbname',$mysql);
$rows=mysql_query($query,$mysql) or die(mysql_error());
There are no issues in mysql connectivity and related stuffs, as I run a lot of other queries successfully. I have set indexes on meter_id and Date,Time together. slno is the auto increment value.
I know there are similar questions asked as I found a lot from my research but none of them really helped me. Thanks in advance if anybody could help me out to find a solution.
Query Description:This queries the power_logger table containing millions of records and THD_A_N_AVG, meter_id,slno,Date and Time are among the columns of the table. This selects The date and time from two consecutive rows with in a range of slnos where difference between THD_A_N_AVG is greater than or equal to 2. When those dates are fetched, it even has to fetch the date and time with in a different range of slnos where the date and time are the closest to the once fetched earlier thus forming har_date_from,har_date_to, cur_date_from,cur_date_to.
What messes up here is the nested select.
Usually PHPMyAdmin automatically adds "LIMIT 0, 30" at the end of the query, so you only load 30 rows at once. In your code you are trying to load everything at once, that's why it's taking so long.
I have this table:
person_id int(10) pk
fid bigint(20) unique
points int(6) index
birthday date index
4 FK columns int(6)
ENGINE = MyISAM
Important info: the table contains over 8 million rows and is fast growing (1.5M a day at the moment)
What I want: to select 4 random rows in a certain range when I order the table on points
How I do it now: In PHP I randomize a certain range, let's say this gives me 20% as low range and 30% as high range. Next I count(*) the number of rows in table. After I determine the lowest row number: table count / 100 * low range. Same for high range. After I calculate a random row by using rand(lowest_row, highest_row), which gives me a row number within the range. And at last I select the random row by doing:
SELECT * FROM `persons` WHERE points > 0 ORDER BY points desc LIMIT $random_offset, 1;
The points > 0 is in the query since I only want randoms with at least 1 point.
Above query takes about 1.5 seconds to run, but since I need 4 rows it takes over 6 seconds, which is too slow for me. I figured the order by points takes the most time, so I was thinking about making a VIEW of the table, but I have really no experience with views, so what do you think? Is a view a good option or are there better solutions?
ADDED:
I forgot to say that it is important that all rows has the same chance of being selected.
Thanks, I appreciate all the help! :)
Kevin
Your query is so slow, and will become exponentially slower, because using LIMIT here forces it to do a full table sort, and then a full table scan, to get the result. Instead you should do this on the PHP end of things as well (this kind of 'abuse' of LIMIT is actually the reason it's non-standard SQL and for example MSSQL and Oracle do not support it).
First ensure there's an index on points. This will make select max(points), min(points) from persons a query that'll return instantly. Next you can determine from those 2 results the points range, and use rand() to determine 4 points in the requested range. Then repeat for each result:
SELECT * FROM persons WHERE points < $myValue ORDER BY points DESC LIMIT 1
Since it only has to retrieve one row, and can determine which one via the index, this'll be in the milliseconds execution time as well.
Views aren't going to do anything to help your performance here. My suggestion would be to simply run:
SELECT * FROM `persons` WHERE points BETWEEN ? AND ?
Make sure you have an index on points. Also, you SHOULD replace * with only the fields you are concerned about if applicable. Here is course ? represents the upper and lower bounds for your search.
You can then determine the number of rows returned in the result set using mysqli_num_rows() (or similar based on your DB library of choice).
You now have the total number of rows that meet your criteria. You can easily then calculate 4 random numbers within the range of results and use mysqli_data_seek() or similar to go directly to the record at the random offset and get the values you want from it.
Putting it all together:
$result = mysqli_query($db_conn, $sql); // here $sql is your SQL query
$num_records = 4; // your number of records to return
$num_rows = mysqli_num_rows($result);
$rows = array();
while ($i = 0; $i < $num_records; $i++) {
$random_offset = rand(0, $num_rows - 1);
mysqli_data_seek($result, $random_offset);
$rows[] = mysqli_fetch_object($result);
}
mysqli_free_result($result);
In my MySQL database I have a table (PERSONS) with over 10 million rows, the two important columns are:
ID
POINTS
I would like to know the rank of the person with ID = randomid
I want to return to the person his "rank", which depends on his points. But his rank will not be the exact row number, but more like a percentage layer. Like: "You are in the top 5%" or "You are in the layer 10% - 15%".
Of course I could query the table and convert the row number to the layer% by dividing it with the total number of rows. But my question is, would it be faster (with 10M+ rows) to just grab the several rows with LIMIT X, 1, where X will be a row on percentage 100, 95, 90, 85 .. of the table. Next step: check if the points of this row is lower than the current persons points and if yes, grab next layer % row, if not, return previous layer row.
In the persons table there are 9 columns with 2 bigints, 4 varchars 150, 1 date and 2 booleans.
Of course I would prefer to get the exact row rank, but from what I tested, this is slow and takes at least several seconds, with my wat it can be done in a few hundreds of a second.
Also, the way I suggested is not precise when there are several layers with the same points, but it doesn't need to be that precise, so we can neglect that fact.
Extra info, I program in PHP, so if there is a specific solution for this in PHP + MySQL it would be nice too.
At last, it's worth to mention that the table grows with 20k rows an hour (almost 500k a day).
I appreciate all the help.
You could try this. I first count the number of rows with more points, and then add one to that, in case there are a number of rows with the same number of points. So if there are 10 rows with the same number of points, they all have the same rank as the first one in that group.
SELECT SUM(CASE WHEN points > (SELECT POINTS FROM YOUR_TABLE WHERE ID = randomid) THEN 1 ELSE 0 END) + 1 as Rank,
(SUM(CASE WHEN points > (SELECT POINTS FROM YOUR_TABLE WHERE ID = randomid) THEN 1 ELSE 0 END) + 1) / COUNT(*) as Pct
FROM YOUR_TABLE
If that is slow, I would run two queries. First get that ID's points and then plug that into a second query to determine the rank/pct.
SELECT POINTS
FROM YOUR_TABLE
WHERE ID = randomid
Then compute the rank and pct, plugging in the points from above.
SELECT SUM(CASE WHEN points > POINTS THEN 1 ELSE 0 END) + 1 as Rank,
(SUM(CASE WHEN points > POINTS THEN 1 ELSE 0 END) + 1) / COUNT(*) as Pct
FROM YOUR_TABLE
Is there an 'easy way' to grab two rows of data from a table, and add rows with values 'in-between'?
I want to grab a latitude, a longitude and a timestamp from each row. Compare the timestamp to the one from the previous row, and interpolate new rows if the timestamp is bigger than my minimum...grab two rows 1 minute apart and add rows for every 10 seconds...
Is using a stored procedure the best way to go about this? Easiest?
Currently using mySql and PHP...
I would just grab the data and do the math in PHP. SQL isn't all that versatile, and you'd be saving yourself a headache.
EDIT: Actually, just for the fun of it, you could make the math easier by left-joining to a calendar table.
First you need a table ints with the values 0-9. Then you can do something like:
SELECT cal.t, lat, lng FROM (
SELECT {start_time} + INTERVAL (t.i*1000 + u.i*100 + v.i*10) SECOND as t
FROM ints as t
JOIN ints as u
JOIN ints as v
WHERE t <= {end_time}
) LEFT JOIN locations ON (cal.t = locations.stamp)
This would return a table with NULL values for lat and lng where there isn't an entry on the 10 second mark, so you could iterate through and do the math for just those. Keep in mind, this only works if you all the datapoints you do have (other than the start and end) land right on a 10-second mark.