Query optimization: 1800 queries -> 50

Query optimization: 1800 queries -> 50 - php

I'm querying a postgresql database which holds an agenda-table:
agenda |> id (int) | start (timestamp) | end (timestamp) | facname | .....
I want to make a kind of summary of one day in the form of a 'timeline' consisting of a small picture for every 15 minutes interval: on / off according to the availability of the facility.
Now is it relatively simple to query the database for every 15 minutes and check if a reservation is present and change the img source.
But if you want to make an overview of 10 days and 5 different facilities you'll end up querying the database
10(days) * 36(quaters a day) * 5 (facilities) = 1800 database querys/page load.
So this results in a very heavy pay load.
Is there a way I can reduce the amount of queries and so the payload?

To solve this issue, I think we may first find a way to, given a timestamp, find in which quarter of an hour it belongs to. For instance, the hour 08:38 belongs to quarter 08:30, the 08:51 to 08:45, and so on.
To do that, we can use a function like this:
CREATE FUNCTION date_trunc_quarter(timestamp )
RETURNS TIMESTAMP
LANGUAGE SQL
IMMUTABLE
AS $$
SELECT * FROM
generate_series(
date_trunc('hour',$1),
date_trunc('hour',$1)+interval '1hour',
interval '15min'
) AS gen(quarter)
WHERE gen.quarter < $1
ORDER BY gen.quarter
DESC LIMIT 1
$$;
It uses the generate_series function to generate all the four quarters (e.g. 08:00, 08:15, 08:30 and 08:45) within the same hour as the given timestamp (e.g. 08:38), do get the given hour it uses the well-known date_trunc function. Then, it filters only the quarters which is smaller then the given timestamp, sort it and get the bigger one. As it is always only four values at most, sorting it is not a big issue.
Now, with that you can easily query like this:
SELECT date_trunc_quarter(tstart) AS quarter, count(*)
FROM agenda
GROUP BY quarter
ORDER BY quarter;
I think it is fast enough, and to make it even faster, you can create an expression index on agenda:
CREATE INDEX idx_agenda_quarter ON agenda ((date_trunc_quarter(tstart)));
See this fiddle with a self-contained test case of it all.

Related

Getting temperature difference between intervals

my question is more "theoretical" than practical - in other words, Im not really looking for a particular code for how to do something, but more like an advice about how to do it. Ive been thinking about it for some time but cannot come up with some feasible solution.
So basically, I have a MySQL database that saves weather information from my weather station.
Column one contains date and time of measurement (Datetime format field), then there is a whole range of various columns like temp, humidity etc. The one I am interested in now is the one with the temperature. The data is sorted by date and time ascending, meaning the most recent value is always inserted to the end.
Now, what I want to do is using a PHP script, connect to the db and find temperature changes within a certain interval and then find the maximum. In other words, for example lets say I choose interval 3h. Then I would like to find the time, from all the values, where there was the most significant temperature change in those 3 h (or 5h, 1 day etc.).
The problem is that I dont really know how to do this. If I just get the values from the db, Im getting the values one by one, but I cant think of a way of getting a value that is lets say 3h from the current in the past. Then it would be easy, just subtracting them and get the date from the datetime field at that time, but how to get the values that are for example those 3 h apart (also, the problem is that it cannot just simply be a particular number of rows to the past as the intervals of data save are not regular and range between 5-10mins, so 3 h in the past could be various number of rows).
Any ideas how this could be done?
Thx alot

Not terribly hard actually. So I would assume it's a two column table with time and temp fields, where time is a DATETIME field
SELECT MAX(temp) FROM records
WHERE time >= "2013-10-14 12:00:00" and time <= "2013-10-14 15:00:00"

SELECT t1.*, ABS(t1.temperature - t2.temperature) as change
FROM tablename t1
JOIN tablename t2
ON t2.timecolumn <= (t1.timecolumn - INTERVAL 3 HOUR)
LEFT JOIN tablename t3
ON t3.timecolumn <= (t1.timecolumn - INTERVAL 3 HOUR)
AND t2.timecolumn > t3.timecolumn
WHERE
t3.some_non_nullable_column IS NULL
ORDER BY ABS(t1.temperature - t2.temperature) DESC
LIMIT 1;
1 table joined 2 times on itself, t2 is the quaranteed direct predecessor of t1 t2 is the closest record with offset 3h before or more. This could with the proper indexes, and a limited amount of data (where limited is in the eye of the beholder) be quite performant. However, if you need a lot of those queries in a big dataset, this is a prime candidate for denormalization, were you create a table which also stores the calculated offsets compared to the previous entry.

Best way to limit the show count of particular object for an hour?

I'm working on a web application written on php. I have some objects (represented as rows) in mysql table. And I need to show them randomly during a day.
How can I limit the show count of a particular object, e.g not more than 10 times for an hour?
By the show count I mean how many times the object was rendered.
For example, there are 100 images and with each pageview random 5 are shown. I need to normalize the image shows distribution, by limiting images' show count for an hour, for preventing 1000 shows for one image and 3 to another.
Hope its useful explanation.

Probably the simplest way to do it would be to add a field called last_shown to your table and then exclude it from the candidate list if it's been shown within the hour. eg something along these lines:
SELECT id FROM my_objects WHERE last_shown < DATE_SUB(NOW(), INTERVAL 1 HOUR) ORDER BY RAND() LIMIT 1
Then when you display that actual object, timestamp the column, ie:
UPDATE my_objects SET last_shown = NOW() WHERE id = <the_id_you_displayed>
This approach is simpler, but just as effective. If you reduced the timeframe to once every 6 minutes, it would effectively be similar logic to '10 times within the hour', and not require an entire new reference table.

You could create a log table with id and date_displayed.
Each time you select the rows random, you make sure that you select only rows which were not displayed more than 10 times in the last hour.
SELECT * FROM table
WHERE id NOT IN (
SELECT id FROM log
WHERE date_displayed > now() - interval 1 hour
GROUP BY id HAVING COUNT(*) >= 10
)
ORDER BY rand()
Also, after one hour you no longer need older inserts, so you might want to do a DELETE query to remove old records.
DELETE FROM log WHERE date_displayed < now() - interval 1 hour

PHP/MYSQL datetime ranges overlapping for users

please I need help with this (for better understanding please see attached image) because I am completely helpless.
As you can see I have users and they store their starting and ending datetimes in my DB as YYYY-mm-dd H:i:s. Now I need to find out overlaps for all users according to the most frequent time range overlaps (for most users). I would like to get 3 most frequented datatime overlaps for most users. How can I do it?
I have no idea which mysql query should I use or maybe it would be better to select all datetimes (start and end) from database and process it in php (but how?). As stated on image results should be for example time 8.30 - 10.00 is result for users A+B+C+D.
Table structure:
UserID | Start datetime | End datetime
--------------------------------------
A | 2012-04-03 4:00:00 | 2012-04-03 10:00:00
A | 2012-04-03 16:00:00 | 2012-04-03 20:00:00
B | 2012-04-03 8:30:00 | 2012-04-03 14:00:00
B | 2012-04-06 21:30:00 | 2012-04-06 23:00:00
C | 2012-04-03 12:00:00 | 2012-04-03 13:00:00
D | 2012-04-01 01:00:01 | 2012-04-05 12:00:59
E | 2012-04-03 8:30:00 | 2012-04-03 11:00:00
E | 2012-04-03 21:00:00 | 2012-04-03 23:00:00

What you effectively have is a collection of sets and want to determine if any of them have non-zero intersections. This is the exact question one asks when trying to find all the ancestors of a node in a nested set.
We can prove that for every overlap, at least one time window will have a start time that falls within all other overlapping time windows. Using this tidbit, we don't need to actually construct artificial timeslots in the day. Simply take a start time and see if it intersects any of the other time windows and then just count up the number of intersections.
So what's the query?
/*SELECT*/
SELECT DISTINCT
MAX(overlapping_windows.start_time) AS overlap_start_time,
MIN(overlapping_windows.end_time) AS overlap_end_time ,
(COUNT(overlapping_windows.id) - 1) AS num_overlaps
FROM user_times AS windows
INNER JOIN user_times AS overlapping_windows
ON windows.start_time BETWEEN overlapping_windows.start_time AND overlapping_windows.end_time
GROUP BY windows.id
ORDER BY num_overlaps DESC;
Depending on your table size and how often you plan on running this query, it might be worthwhile to drop a spatial index on it (see below).
UPDATE
If your running this query often, you'll need to use a spatial index. Because of range based traversal (ie. does start_time fall in between the range of start/end), a BTREE index will not do anything for you. IT HAS TO BE SPATIAL.
ALTER TABLE user_times ADD COLUMN time_windows GEOMETRY NOT NULL DEFAULT 0;
UPDATE user_times SET time_windows = GeomFromText(CONCAT('LineString( -1 ', start_time, ', 1 ', end_time, ')'));
CREATE SPATIAL INDEX time_window ON user_times (time_window);
Then you can update the ON clause in the above query to read
ON MBRWithin( Point(0,windows.start_time), overlapping_windows.time_window )
This will get you an indexed traversal for the query. Again only do this if your planning on running the query often.
Credit for the spatial index to Quassoni's blog.

Something like this should get you started -
SELECT slots.time_slot, COUNT(*) AS num_users, GROUP_CONCAT(DISTINCT user_bookings.user_id ORDER BY user_bookings.user_id) AS user_list
FROM (
SELECT CURRENT_DATE + INTERVAL ((id-1)*30) MINUTE AS time_slot
FROM dummy
WHERE id BETWEEN 1 AND 48
) AS slots
LEFT JOIN user_bookings
ON slots.time_slot BETWEEN `user_bookings`.`start` AND `user_bookings`.`end`
GROUP BY slots.time_slot
ORDER BY num_users DESC
The idea is to create a derived table that consists of time slots for the day. In this example I have used dummy (which can be any table with an AI id that is contiguous for the required set) to create a list of timeslots by adding 30mins incrementally. The result of this is then joined to bookings to be able to count the number of books for each time slot.
UPDATE For entire date/time range you could use a query like this to get the other data required -
SELECT MIN(`start`) AS `min_start`, MAX(`end`) AS `max_end`, DATEDIFF(MAX(`end`), MIN(`start`)) + 1 AS `num_days`
FROM user_bookings
These values can then be substituted into the original query or the two can be combined -
SELECT slots.time_slot, COUNT(*) AS num_users, GROUP_CONCAT(DISTINCT user_bookings.user_id ORDER BY user_bookings.user_id) AS user_list
FROM (
SELECT DATE(tmp.min_start) + INTERVAL ((id-1)*30) MINUTE AS time_slot
FROM dummy
INNER JOIN (
SELECT MIN(`start`) AS `min_start`, MAX(`end`) AS `max_end`, DATEDIFF(MAX(`end`), MIN(`start`)) + 1 AS `num_days`
FROM user_bookings
) AS tmp
WHERE dummy.id BETWEEN 1 AND (48 * tmp.num_days)
) AS slots
LEFT JOIN user_bookings
ON slots.time_slot BETWEEN `user_bookings`.`start` AND `user_bookings`.`end`
GROUP BY slots.time_slot
ORDER BY num_users DESC
EDIT I have added DISTINCT and ORDER BY clauses in the GROUP_CONCAT() in response to your last query.
Please note that you will will need a much greater range of ids in the dummy table. I have not tested this query so it may have syntax errors.

I would not do much in SQL, this is so much simpler in a programming language, SQL is not made for something like this.
Of course, it's just sensible to break the day down into "timeslots" - this is statistics. But as soon as you start handling dates over the 00:00 border, things start to get icky when you use joins and inner selects. Especially with MySQL which does not quite like inner selects.
Here's a possible SQL query
SELECT count(*) FROM `times`
WHERE
( DATEDIFF(`Start`,`End`) = 0 AND
TIME(`Start`) < TIME('$SLOT_HIGH') AND
TIME(`End`) > TIME('$SLOT_LOW'))
OR
( DATEDIFF(`Start`,`End`) > 0 AND
TIME(`Start`) < TIME('$SLOT_HIGH') OR
TIME(`End`) > TIME('$SLOT_LOW')
Here's some pseudo code
granularity = 30*60; // 30 minutes
numslots = 24*60*60 / granularity;
stats = CreateArray(numslots);
for i=0, i < numslots, i++ do
stats[i] = GetCountFromSQL(i*granularity, (i+1)*granularity); // low, high
end
Yes, that makes numslots queries, but no joins no nothing, hence it should be quite fast. Also you can easily change the resolution.
And another positive thing is, you could "ask yourself", "I have two possible timeslots, and I need the one where more people are here, which one should I use?" and just run the query twice with respective ranges and you are not stuck with predefined time slots.
To only find full overlaps (an entry only counts if it covers the full slot) you have to switch low and high ranges in the query.
You might have noticed that I do not add times between entries that could span multiple days, however, adding a whole day, will just increase all slots by one, making that quite useless.
You could however add them by selecting sum(DAY(End) - DAY(Start)) and just add the return value to all slots.

Table seems pretty simple. I would keep your SQL query pretty simple:
SELECT * FROM tablename
Then when you have the info saved in your PHP object. Do the processing with PHP using loops and comparisons.
In simplest form:
for($x, $numrows = mysql_num_rows($query); $x < $numrows; $x++){
/*Grab a row*/
$row = mysql_fetch_assoc($query);
/*store userID, START, END*/
$userID = $row['userID'];
$start = $row['START'];
$end = $row['END'];
/*Have an array for each user in which you store start and end times*/
if(!strcmp($userID, "A")
{
/*Store info in array_a*/
}
else if(!strcmp($userID, "B")
{
/*etc......*/
}
}
/*Now you have an array for each user with their start/stop times*/
/*Do your loops and comparisons to find common time slots. */
/*Also, use strtotime() to switch date/time entries into comparable values*/
Of course this is in very basic form. You'll probably want to do one loop through the array to first get all of the userIDs before you compare them in the loop shown above.

Is it good or bad practise to alter start dates in a database to the next occurrence of an event?

I am trying to create an event calendar which whilst initially quite small could turn out to be quite large. To that end, when trying to future proof it as much as possible, all events that occur in the past will be deleted from the database. However, is it bad practise to alter the start date of recurring events once they have happened to indicate when the next event will start? This makes it easier to perform search queries because theoretically no events will start more than say a week in the past, depending on how often the database is updated.
Is there a better way to do this?
My current intention is to have a table listing the event details along with a column for whether it is a yearly, monthly, weekly or daily recurrence. When somebody then searches for events between 2 dates, I simply look at each row and check if (EVENT START <= SEARCH FINISH && EVENT FINISH >= SEARCH START). This then gets all the possible events, and the recurring ones then need to be checked to see if they occur during the time period given. This is where I come a little unstuck, as to how to achieve this specifically. My thoughts are as follows:
Yearly: if EVENT START + 1 YEAR <= SEARCH FINISH || EVENT FINISH + 1 Year >= SEARCH START; repeat for +2 YEARS etc until EVENT START + NO YEARS > SEARCH FINISH.
Monthly: As above but + 1 month each time.
Weekly: As above but EVENT START and EVENT FINISH will be plus 7 DAYS BETWEEN RECURRENCE each iteration until EVENT START + 7 DAYS REPEATED > SEARCH FINISH.
Daily: As above but NO OF DAYS DIFFERENCE instead of 7 days for a week. This could be used to specify things like every 14 days (fortnight), every 10 days. Even every week could use this method.
However, when I think about the query that would have to be built to achieve this, I cannot help think that it will be very cumbersome and probably slow. Is there a better way to achieve the results I want? I have still not found a way to do things like occurs on the first Monday of a month or the last Friday of a month, or the second Saturday of April each year. Are these latter options even possible?
-- Edit: added below:
It might help a bit if I explain a bit more about what I am creating. That way guidance can be given with respect to that.
I am creating a website which allows organisations to add events, whether they are a one-off or recurring (daily, weekly, monthly, first Tuesday of a month etc.). The user of the site will then be able to search for events within a chosen distance (arbitrary 10, 25, 50, 100miles, all of country) on a set date or between 2 given dates which could be from 1 day apart up to a couple of years apart (obviously events that far into the future will be minimal or non-existant depending on the dates used).
The EVENTS table itself currently holds a lot of information about the event, such as location, cost, age group etc. Would it be better to have this in a separate table which is looked up once it has been determined if the event is within the specified search parameters? Clearly not all of this information is needed until the detailed page view, maybe just a name, location, cost and brief description.
I appreciate there are many ways to skin a cat but I am unsure how to skin this one. The biggest thing I am struggling with is how to structure my data so that a query will know if the recursion is within the specified date. Also, given that the mathematics to calculate distance between 2 lat/longs is relatively complex, I need to be able to build this calculation into my query, otherwise I will be doing the calculation in PHP anyway. Granted, there will be less results to process this way, but it still needs to be done.
Any further advice is greatly appreciated.

Creating events for each recurrence is unnecessary. It is much better to store the details that define how the event recurs. This question has been answered many times on SO.
One way to do this is to use a structure like this -
tblEvent
--------
id
name
description
date
tblEventRecurring
-----------------
event_id
date_part
end_date
Then you could use a query like this to retrieve events -
SELECT *
FROM `tblEvent`
LEFT JOIN `tblEventRecurring`
ON `tblEvent`.`id` = `tblEventRecurring`.`event_id`
WHERE (`tblEvent`.`date` = CURRENT_DATE AND `tblEventRecurring`.`event_id` IS NULL)
OR (
CURRENT_DATE BETWEEN `tblEvent`.`date` AND `tblEventRecurring`.`end_date`
AND (
(`tblEventRecurring`.`date_part` = 'D') OR
(`tblEventRecurring`.`date_part` = 'W' AND DAYOFWEEK(`tblEvent`.`date`) = DAYOFWEEK(CURRENT_DATE)) OR
(`tblEventRecurring`.`date_part` = 'M' AND DAYOFMONTH(`tblEvent`.`date`) = DAYOFMONTH(CURRENT_DATE))
)
)
UPDATE Added the following example of returning events for a given date range.
When returning dates for a given date range you can join the above query to a table representing the date range -
SET #start_date = '2012-03-26';
SET #end_date = '2012-04-01';
SELECT *
FROM (
SELECT #start_date + INTERVAL num DAY AS `date`
FROM dummy
WHERE num < (DATEDIFF(#end_date, #start_date) + 1)
) AS `date_list`
INNER JOIN (
SELECT `tblEvent`.`id`, `tblEvent`.`date`, `tblEvent`.`name`, `tblEventRecurring`.`date_part`, `tblEventRecurring`.`end_date`
FROM `tblEvent`
LEFT JOIN `tblEventRecurring`
ON `tblEvent`.`id` = `tblEventRecurring`.`event_id`
WHERE `tblEvent`.`date` BETWEEN #start_date AND #end_date
OR (`tblEvent`.`date` < #end_date AND `tblEventRecurring`.`end_date` > #start_date)
) AS `events`
ON `events`.`date` = `date_list`.`date`
OR (
`date_list`.`date` BETWEEN `events`.`date` AND `events`.`end_date`
AND (
(`events`.`date_part` = 'D') OR
(`events`.`date_part` = 'W' AND DAYOFWEEK(`events`.`date`) = DAYOFWEEK(`date_list`.`date`)) OR
(`events`.`date_part` = 'M' AND DAYOFMONTH(`events`.`date`) = DAYOFMONTH(`date_list`.`date`))
)
)
WHERE `date_list`.`date` BETWEEN #start_date AND #end_date
ORDER BY `date_list`.`date`;
You can replace the SQL variables with PHP vars if you would prefer. To display days without any events you can change the INNER JOIN between the two derived tables, date_list and events, to a LEFT JOIN.
The table dummy consists of a single column with numbers from 0 to whatever you anticipate needing. This example creates the dummy table with enough data to cover one month. You could easily populate it using an INSERT... SELECT... on the AI PK of another table -
CREATE TABLE `dummy` (
`num` SMALLINT UNSIGNED NOT NULL PRIMARY KEY
);
INSERT INTO `dummy` VALUES
(00), (01), (02), (03), (04), (05), (06), (07), (08), (09),
(10), (11), (12), (13), (14), (15), (16), (17), (18), (19),
(20), (21), (22), (23), (24), (25), (26), (27), (28), (29),
(30), (31);

Break it up
Have one table for vents that haven't happened yet with a reccurring event ID. So you can just poke one offs in there with recurring veent id of null. Get rid /archive past ones etc.
Have another for the data about recurring events.
When an event marked as recurring happens, go back to recurring table, check to see if it's enabled (you might want to add a range to them ie do this every wek for three months), and if all is okay, add a new record for the next time it occurs.
One way to do it anyway, and it gets rid of the problem of using event start for two different things which is why your code is getting complicated.
If you want future jobs from this. ie everything needed to do in the next month.
The it would be a union query. One to get all teh "current jobs", unioned with one to get all the jobs that will recur in the next month.
Can't stress this enough, get the data design right the code "just happens". If you data is messed up as in one field "start date" serving two different needs, then every time you go near it, you have to deal with that dual use. Forget it once and you get anything from a painful mess to a disaster.
Adding a Recurring_Start_Date column would be better than your current plan, wouldn't it. You wouldn't be asking this question, beacseu your data would fit your needs.

I assume you'll be searching through events much more frequently than you will be creating new ones. During event creation, I would create records for each occurrence of the event up to so reasonable amount of time (maybe for the next year or two).
It would also make things like "The third thursday of each month" a little easier. If you tried to do any of the calculations in a query it would be difficult and probably slow.

Best way to query calendar events?

I'm creating a calendar that displays a timetable of events for a month. Each day has several parameters that determine if more events can be scheduled for this day (how many staff are available, how many times are available etc).
My database is set up using three tables:
Regular Schedule - this is used to create an array for each day of the week that outlines how many staff are available, what hours they are available etc
Schedule Variations - If there are variations for a date, this overrides the information from the regular schedule array.
Events - Existing events, referenced by the date.
At this stage, the code loops through the days in the month and checks two to three things for each day.
Are there any variations in the schedule (public holiday, shorter hours etc)?
What hours/number of staff are available for this day?
(If staff are available) How many events have already been scheduled for this day?
Step 1 and step 3 require a database query - assuming 30 days a month, that's 60 queries per page view.
I'm worried about how this could scale, for a few users I don't imagine that it would be much of a problem, but if 20 people try and load the page at the same time, then it jumps to 1200 queries...
Any ideas or suggestions on how to do this more efficiently would be greatly appreciated!
Thanks!

I can't think of a good reason you'd need to limit each query to one day. Surely you can just select all the values between a pair of dates.
Similarly, you could use a join to get the number of events scheduled events for a given day.
Then do the loop (for each day) on the array returned by the database query.

Create a table:
t_month (day INT)
INSERT
INTO t_month
VALUES
(1),
(2),
...
(31)
Then query:
SELECT *
FROM t_month, t_schedule
WHERE schedule_date = '2009-03-01' + INTERVAL t_month.day DAY
AND schedule_date < '2009-03-01' + INTERVAL 1 MONTH
AND ...
Instead of 30 queries you get just one with a JOIN.
Other RDBMS's allow you to generate rowsets on the fly, but MySQL doesn't.
You, though, can replace t_month with ugly
SELECT 1 AS month_day
UNION ALL
SELECT 2
UNION ALL
...
SELECT 31

I faced the same sort of issue with http://rosterus.com and we just load most of the data into arrays at the top of the page, and then query the array for the relevant data. Pages loaded 10x faster after that.
So run one or two wide queries that gather all the data you need, choose appropriate keys and store each result into an array. Then access the array instead of the database. PHP is very flexible with array indexing, you can using all sorts of things as keys... or several indexes.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.