MySQL/PHP: Search rows by datetime RANGE given overlaps in the table - php

I have a table that shows me when people are available to work, like the following:
+------+---------------------+---------------------+
| name | start | end |
+------+---------------------+---------------------+
| Odin | 2015-07-01 11:00:00 | 2015-07-01 11:30:00 |
| Thor | 2015-07-01 11:00:00 | 2015-07-01 11:30:00 |
| Odin | 2015-07-01 11:20:00 | 2015-07-01 12:45:00 |
| Odin | 2015-07-01 12:30:00 | 2015-07-01 15:30:00 |
| Thor | 2015-07-01 15:00:00 | 2015-07-01 17:00:00 |
+------+---------------------+---------------------+
I'd like to check if a specific person is available to work in a given range. For example, I want to have a PHP function that returns the names of people available to work in a given range, like so: canWork($start, $end)
This important part is handling the overlaps, especially since the table could be very, very large. For example, if I called canWork('2015-07-01 11:10:00', '2015-07-01 15:30:00') I would expect to get Odin back given the 1st, 3rd and 4th rows of the table together do cover that range.
Is there an easy way to do this with MySQL? Or PHP?

Try to avoid looping over data in this sort of large data situations. In a similar exercise SQL was able to deliver in seconds what in code took hours. Having a smart look at the data pays of.
The smart step here is: You can reduce the number of possible matches by checking the SUM of the time: The time in the range should be equal (or smaller) then the SUM of the time in the records.
However since the start time entered can be smaller then the starttime you are looking for, and the end time can be larger then the endtime you are looking for, you first have to find the end time closest to endtime and start time closest to starttime.
(end is a reserved word, so this code will not work with that columnname, endtime and starttime are the variables for the schedule check)
Start time per user (last possible):
SELECT name,MAX(start) AS MAX_start
FROM scheduleTable
WHERE start<=starttime
GROUP BY name;
End time per user (first possible)
SELECT name,MIN(`end`) AS MIN_end
FROM scheduleTable
WHERE `end`>=endtime
GROUP BY name;
Joining these together gives a subset of possible users, plus this can be filtered on the
SELECT name, MAX_start,MIN_end
FROM
(SELECT name,MIN(`end`) AS MIN_end
FROM scheduleTable
WHERE `end`>=endtime
GROUP BY name) a
INNER JOIN
(SELECT name,MAX(start) AS MAX_start
FROM scheduleTable
WHERE start<=starttime
GROUP BY name) b ON a.name=b.name;
This will give you a schedule with a valid end as close as possible to the endtime indicated for scheduling purpose but at least equal to the indicated endtime.
Applying the fact that all the time frames together must at least be equal to the endtime-starttime:
SELECT st.name
FROM scheduleTable st
INNER JOIN (
SELECT name, MAX_start AS start,MIN_end AS end
FROM
(SELECT name,MIN(`end`) AS MIN_end
FROM scheduleTable
WHERE `end`>=endtime
GROUP BY name) a
INNER JOIN
(SELECT name,MAX(start) AS MAX_start
FROM scheduleTable
WHERE start<=starttime
GROUP BY name) b ON a.name=b.name
) et ON st.name=et.name
WHERE et.start>={starttime} AND `end`<=et.endtime AND et.name=st.name
GROUP BY st.name
HAVING SUM(st.`end`-st.start)>=(endtime-starttime);
You might have to manipulate the start and end time to unix time or use mysql date time functions for the calculations.
There still might be gaps: Those need a second check. For this use the group_concat to get some data we can pass as 1 call into a function. The function results in 0 for: no gaps found, 1 for gaps found:
SELECT a.name
FROM (
SELECT st.name,
GROUP_CONCAT(start ORDER BY start ASC SEPARATOR ',') starttimelist,
GROUP_CONCAT(`end` ORDER BY `end` ASC SEPARATOR ',') endtimelist
FROM scheduleTable st
INNER JOIN (
SELECT name, MAX_start AS start,MIN_end AS end
FROM
(SELECT name,MIN(`end`) AS MIN_end
FROM scheduleTable
WHERE `end`>=endtime
GROUP BY name) a
INNER JOIN
(SELECT name,MAX(start) AS MAX_start
FROM scheduleTable
WHERE start<=starttime
GROUP BY name) b ON a.name=b.name
) et ON st.name=et.name
WHERE et.start>={starttime} AND `end`<=et.endtime AND et.name=st.name
GROUP BY st.name
HAVING SUM(st.`end`-st.start)>=(endtime-starttime);
) a
WHERE gapCheck(starttimelist,endtimelist)=0;
WARNING: Do not add DISTINCT to the GROUP_CONCAT: The start/endtimelist will have different lengths and the gaCcheck function will fail....
The function gapCheck:
In this function the first start time and the last end time can be ignored: start time is larger or equal then starttime and end time is larger or equal to endtime. So no boundary checks are needed, plus boundaries do not have to be checked for gaps anyway.
CREATE FUNCTION gapCheck(IN starttimeList VARCHAR(200),endtimeList VARCHAR(200))
BEGIN
DECLARE helperTimeStart,helperTimeEnd,prevHelperTimeStart,prevHelperTimeEnd DATETIME
DECLARE c,splitIndex,gap INT
SET c-0;
SET gap=0;
WHILE(c=0) DO
SET splitIndex=INSTR(starttimeList,',');
IF(splitIndex>0) THEN
SET helperTimeStart=SUBSTRING(starttimeList,1,splitIndex-1);
SET starttimeList=SUBSTRING(starttimeList,splitIndex); /* String for the next iteration */
ELSE
SET helperTimeStart=starttimeList; /* End of list reached */
SET helperTimeEnd=endtimeList; /* end can be set too: Lists are of same length */
SET c=1;
END IF;
IF(splitIndex>0) THEN
SET splitIndex=INSTR(endtimeList,',');
SET helperTimeEnd=SUBSTRING(endtimeList,1,splitIndex-1);
END IF;
IF prevHelperTimeEnd>=helperTimeEnd THEN /* if prevHelperTimeEnd is not set, this is false and the check is skipped: on the first record we can not check anything */
/* If previous end time > current start time: We have a gap */
IF CAST(prevHelperTimeEnd AS DATETIME)>=CAST(helperTimeStart AS DATETIME) THEN
gap=1;
END IF;
END IF;
/* save some data for the next loop */
SET prevHelperTimeStart=helperTimeStart;
SET prevHelperTimeEnd=helperTimeEnd;
END WHILE;
RETURN gap;
END;

I think the shortest way to do this would be to
1) First merge all timelines for the same person with overlaps.
For eg. row 1 and row 3 would be merged to change the end time of row 1 to '2015-07-01 12:45:00' (and row 3 would be deleted or marked used), and then row 1 and row 4 would be merged to again change the end time of row 1 to '2015-07-01 15:30:00'.
2) Once you have a table of non-overlapping timelines, this is a simple problem of finding rows where start <= $start and end >= $end.
For 1) I would prefer executing this process in PHP by first copying the whole table in a data structure
$a = array();
//in a for loop after a select all query: for (all elements) {
$a[$name][$start] = $end));
//} end of for loop
And then removing all overlaps from that data structure:
for($a as $currName => $timeArray) {
ksort($timeArray);
removeOverlaps(&$timeArray);
}
function removeOverlaps($timeArray) {
$allKeys = array_keys($timeArray);
$arrLength = count($allKeys);
for ($i = 0; $i < $arrlength; ++$i) {
$start = $allKeys[$i];
if(array_key_exists($start, $timearray)) {
$end = $timeArray[$start])
for ($j = $i; $j < $arrlength; ++$j) {
$newStart = $allKeys[$j];
$newEnd = $timeArray[$newStart];
if($newStart <= $end) && ($newEnd > $end)) {
$timeArray[$start] = $newEnd;
unset($timeArray[$newStart]);
}
}
}
}
}
Then continue with 2).

Related

Codeigniter - How to select row id's of matching date column in query

I am trying to write a query that outputs the shiftId's into an array.
I have a table that looks like this.
+---------+----------+-------+
| shiftId | endTime | shift |
+---------+----------+-------+
| 1 | 03/03/19 | 1 |
| 2 | 03/03/19 | 2 |
| 3 | 03/01/19 | 1 |
| 4 | 03/01/19 | 2 |
+---------+----------+-------+
I want to return the shiftId of each date with the largest shift, and not sure how to go about.
I want my array to look like below, based on above table.
Array
(
[0] => 2
[1] => 4
)
I have tried to group_by date and then select_max of each shift but don't think I'm on the correct path. Any help would be appreciated.
I want to select shiftId foreach date where shift # is the largest.
You were on the right path!
Either use (this shows the SQL more clearly):
$query = $this->db->query('SELECT max(shiftId) shiftId FROM yourtable GROUP BY endTime')->result_array();
Or (if you want to use CI's query builder):
$query = $this->db->select_max('shiftId')->group_by('endTime')->get('yourtable')->result_array();
Both of these group the table by endTime, and then return the maximum shiftId for each group of identical endTimes. Both give an array that looks like this:
Array
(
[0] => Array
(
[shiftId] => 2
)
[1] => Array
(
[shiftId] => 4
)
)
To get rid of the shiftId index in the result and get the exact array structure from your OP, use:
array_column($query, 'shiftId');
Edit
If you want to get the shiftId for each endTime + MAX(shift) combination, use this:
SELECT shiftId FROM yourtable
WHERE CONCAT(endTime, "-", shift) IN (
SELECT CONCAT(endTime, "-", MAX(shift)) FROM yourtable GROUP BY endTime
)
The inner query (after IN) does more or less the same as the previous query: it groups the records in the table by endTime, then gets the maximum shift for each group of identical endTimes, and here it concatenates this with the endTime and a dash.
You need to concatenate endTime with MAX(shift) here, because MAX(shift) alone is not unique in the table (there's more than one shift with number 2, for example), and neither is endTime.
The outer query (SELECT shiftId...) then finds the matching shiftId for each endTime + MAX(shift) combination and returns that.
You need to use two (nested) queries for this, because the inner one uses grouping and the outer one doesn't, and you're not allowed to mix those two types in one query.
Note: CONCAT only works in MySQL, if you're using a different database type, you might have to look up what concatenation syntax it uses (could be + or || for example).
In CI:
$query = $this->db->query('SELECT shiftId FROM yourtable
WHERE CONCAT(endTime, "-", shift) IN (SELECT CONCAT(endTime, "-", MAX(shift)) FROM yourtable GROUP BY endTime)')->result_array();

Get minimum values only for duplicates in table

I have the following SQL:
set #arbitraryMin = 'two weeks ago';
set #myDesiredMinimumTime = 'thisMorning';
select distinct order, box from db.table
where scantime >= #arbitraryMin
having min(scantime) >= #myDesiredMinimumTime
Essentially, we have a system where it is possible that there are multiple scans for a distinct box/order combo. I only want to get the ones where the minimum scantime is >= #myDesiredMinimumTime. The query above returns two columns with no values in them. I can do this with a sub query, but I was wondering if there was a way to do this without using one.
I am no SQL guru, so I appreciate any help. Table sample (sorry for format):
scantime | Order | Box
2017-06-29 12:34:56 | 123456 | 123
2107-06-29 12:12:12 | 123456 | 124
2017-06-28 14:50:00 | 123456 | 123
Note the two duplicate order/box combos on different days on rows 1 and 3. If I input my query with #arbitraryMin = '2017-06-28 00:00:00' and #myDesiredMinimumTime = '2017-06-29 00:00:00', I only want to get the last two rows, as the top one is a duplicate scan at a different time.
Thank you
That's a invalid SQL. You can't have a HAVING clause without GROUP BY. So the below line is faulty
having min(scantime) >= #myDesiredMinimumTime
You should put that condition in WHERE clause only
where scantime >= #arbitraryMin
and (select min(scantime) from db.table) >= #myDesiredMinimumTime
Thank you to Rahul.
I have found a solution:
select distinct order, box from db.table
where scantime <= #maxTime
group by ordernumber, boxnumber
having min(scantime) >= #myDesiredMinimumTime

MySQL Select from 3 tables and get attendance In and Out Time for all Members in specific date

I have three table Like this:
members_tbl
id | Fullname | Email | MobileNo
attendance_in_tbl
id | member_id | DateTimeIN
attendance_out_tbl
id | member_id | DateTime_OUT
I want to select all members for date: 2014-03-10 by this query:
SELECT
attendance_in.EDatetime,
members_info.mfullname,
attendance_out.ODatetime
FROM
attendance_in
LEFT JOIN members_info ON members_info.id = attendance_in.MemID
LEFT JOIN attendance_out ON attendance_out.MemID = attendance_in.MemID
WHERE date(attendance_in.EDatetime) OR date(attendance_out.ODatetime) = "2014-03-10"
But it give me different results in Attendace_out Results
You have a mistake in your query.
You wrote:
WHERE date(attendance_in.EDatetime) /* wrong! */
OR date(attendance_out.ODatetime) = "2014-03-10"
This is wrong, as the first expression date(attendance_in.EDatetime) always evaluates to true.
You may want
WHERE date(attendance_in.EDatetime) = "2014-03-10"
OR date(attendance_out.ODatetime) = "2014-03-10"
But, this is guaranteed to perform poorly when your attendance_in and attendance_out tables get large, because it will have to scan them; it can't use an index.
You may find that it performs better to write this:
WHERE (attendance_in.EDatetime >='2014-03-10' AND
attendance_in.EDatetime < '2014-03-10' + INTERVAL 1 DAY)
OR (attendance_out.EDatetime >='2014-03-10' AND
attendance_out.EDatetime < '2014-03-10' + INTERVAL 1 DAY)
That will check whether either the checkin our checkout time occurs on the day in question.

MSSQL Aggregated time query with multiple columns

In this example, I am collecting some engine data on a car.
Variables
--------------------------------------
id | name
--------------------------------------
1 Headlights On
2 Tire Pressure
3 Speed
4 Engine Runtime in Seconds
...
Values
--------------------------------------
id | var_id | value | time
--------------------------------------
1 1 1 2013-05-28 16:42:00.100
2 1 0 2013-05-28 16:42:22.150
3 2 32.0 2013-05-28 16:42:22.153
4 3 65 2013-05-28 16:42:22.155
...
I want to write a query that returns a result set something like the following:
Input: 1,2,3
Time | Headlights On | Tire Pressure | Speed
---------------------------------------------------------------
2013-05-28 16:42:00 1
2013-05-28 16:42:22 0 32 65
Being able to modify the query to include only results for a given set of variables and at a specified interval say (1 second, 1 minute or 5 minutes) are also really important for my use case.
How do you write a query in T-SQL that will return a time-aggregated multi column result set at a specific interval?
1 minute aggregate:
SELECT {edit: aggregate functions over fields here} FROM Values WHERE {blah} GROUP BY DATEPART (minute, time);
5 minute aggregate:
SELECT {edit: aggregate functions over fields here} FROM Values WHERE {blah} GROUP BY
DATEPART(YEAR, time),
DATEPART(MONTH, time),
DATEPART(DAY, time),
DATEPART(HOUR, time),
(DATEPART(MINUTE, time) / 5);
For the reason this latter part is so convoluded, please see the SO post here: How to group time by hour or by 10 minutes .
Edit 1:
For the part "include only results for a given set of variables", my interpretation is that you want to to isolate Values with var_id being within a specified set. If you can rely on the variable numbers/meanings not changing, the common SQL solution is the IN keyword (http://msdn.microsoft.com/en-us/library/ms177682.aspx).
This is what you would put into the WHERE clause above, e.g.
... WHERE var_id IN (2, 4) ...
If you can't rely on knowing the variable numbers but are certain about their names, you can replace the set by a sub-query, e.g.:
... WHERE var_id IN (SELECT id FROM Variables WHERE name IN ('Tire Pressure','Headlights On')) ...
The alternative interpretation is that you actually want to aggregate based on the variable ids as well. In this case, you'll have to include the var_id in your GROUP BY clause.
To make the results more crosstab-like, I guess you'll want to order by time aggregate that you're using. Hope that helps more.
Try
SELECT
VehicleID
, Case WHEN Name = 'Headlights on' THEN 1
Else 0 END ' as [Headlights on]
, Case WHEN Name = 'Tyre pressure' THEN Value
Else CAST( NULL AS REAL) END ' as [Tyre pressure]
, DateName(Year, DateField) [year ]
FROM
Table
ETC
Then agrregate as required
SELECT
VehicleID
, SUM([Headlights on]) SUM([Headlights on],
FROM
(
QUery above
) S
GROUP BY
VehicleID
, [Year]

Count number of consecutive visits

Every time a logged in user visits the website their data is put into a table containing the userId and date (either one or zero row per user per day):
444631 2011-11-07
444631 2011-11-06
444631 2011-11-05
444631 2011-11-04
444631 2011-11-02
444631 2011-11-01
I need to have ready access to the number of consecutive visits when I pull the user data from the main user table.. In the case for this user, it would be 4.
Currently I'm doing this through a denormalized consecutivevisits counter in the main user table, however for unknown reasons it sometimes resets.. I want to try an approach that uses exclusively the data in the table above.
What's the best SQL query to get that number (4 in the example above)? There are users who have hundreds of visits, we have millions of registered users and hits per day.
EDIT: As per the comments below I'm posting the code I currently use to do this; it however has the problem that it sometimes resets for no reason and it also reset it for everyone during the weekend, most likely because of the DST change.
// Called every page load for logged in users
public static function OnVisit($user)
{
$lastVisit = $user->GetLastVisit(); /* Timestamp; db server is on the same timezone as www server */
if(!$lastVisit)
$delta = 2;
else
{
$today = date('Y/m/d');
if(date('Y/m/d', $lastVisit) == $today)
$delta = 0;
else if(date('Y/m/d', $lastVisit + (24 * 60 * 60)) == $today)
$delta = 1;
else
$delta = 2;
}
if(!$delta)
return;
$visits = $user->GetConsecutiveVisits();
$userId = $user->GetId();
/* NOTE: t_dailyvisit is the table I pasted above. The table is unused;
* I added it only to ensure that the counter sometimes really resets
* even if the user visits the website, and I could confirm that. */
q_Query("INSERT IGNORE INTO `t_dailyvisit` (`user`, `date`) VALUES ($userId, CURDATE())", DB_DATABASE_COMMON);
/* User skipped 1 or more days.. */
if($delta > 1)
$visits = 1;
else if($delta == 1)
$visits += 1;
q_Query("UPDATE `t_user` SET `consecutivevisits` = $visits, `lastvisit` = CURDATE(), `nvotesday` = 0 WHERE `id` = $userId", DB_DATABASE_COMMON);
$user->ForceCacheExpire();
}
I missed the mysql tag and wrote up this solution. Sadly, this does not work in MySQL as it does not support window functions.
I post it anyway, as I put some effort into it. Tested with PostgreSQL. Would work similarly with Oracle or SQL Server (or any other decent RDBMS that supports window functions).
Test setup
CREATE TEMP TABLE v(id int, visit date);
INSERT INTO v VALUES
(444631, '2011-11-07')
,(444631, '2011-11-06')
,(444631, '2011-11-05')
,(444631, '2011-11-04')
,(444631, '2011-11-02')
,(444631, '2011-11-01')
,(444632, '2011-12-02')
,(444632, '2011-12-03')
,(444632, '2011-12-05');
Simple version
-- add 1 to "difference" to get number of days of the longest period
SELECT id, max(dur) + 1 as max_consecutive_days
FROM (
-- calculate date difference of min and max in the group
SELECT id, grp, max(visit) - min(visit) as dur
FROM (
-- consecutive days end up in a group
SELECT *, sum(step) OVER (ORDER BY id, rn) AS grp
FROM (
-- step up at the start of a new group of days
SELECT id
,row_number() OVER w AS rn
,visit
,CASE WHEN COALESCE(visit - lag(visit) OVER w, 1) = 1
THEN 0 ELSE 1 END AS step
FROM v
WINDOW w AS (PARTITION BY id ORDER BY visit)
ORDER BY 1,2
) x
) y
GROUP BY 1,2
) z
GROUP BY 1
ORDER BY 1
LIMIT 1;
Output:
id | max_consecutive_days
--------+----------------------
444631 | 4
Faster / Shorter
I later found an even better way. grp numbers are not continuous (but continuously rising). Doesn't matter, since those are just a mean to an end:
SELECT id, max(dur) + 1 AS max_consecutive_days
FROM (
SELECT id, grp, max(visit) - min(visit) AS dur
FROM (
-- subtract an integer representing the number of day from the row_number()
-- creates a "group number" (grp) for consecutive days
SELECT id
,EXTRACT(epoch from visit)::int / 86400
- row_number() OVER (PARTITION BY id ORDER BY visit) AS grp
,visit
FROM v
ORDER BY 1,2
) x
GROUP BY 1,2
) y
GROUP BY 1
ORDER BY 1
LIMIT 1;
SQL Fiddle for both.
More
A procedural solution for a similar problem.
You might be able to implement something similar in MySQL.
Closely related answers on dba.SE with extensive explanation here and here.
And on SO:
GROUP BY and aggregate sequential numeric values
If it is not necessary to have a log of every day the user was logged on to the webiste and you only want to know the consecutive days he was logged on, I would prefer this way:
Chose 3 columns: LastVisit (Date), ConsecutiveDays (int) and User.
On log-in you check the entry for the user, determine if last visit was "Today - 1", then add 1 to the columns ConsecutiveDays and store "Today" in column LastVisit. If last vist is greater than "Today - 1" then store 1 in ConsecutiveDays.
HTH

Categories