Optimising PHP/mysql algorithm

Optimising PHP/mysql algorithm - php

I have to make some statistics for my application, so I need an algorithm with a performance as best as possible. I have some several question.
I have a data structure like this in the mysql database:
user_id group_id date
1 5 2012-11-20
1 2 2012-11-01
1 4 2012-11-01
1 3 2012-10-15
1 9 2013-01-18
...
So I need to find the group of some user at a specific date. For example, the group of the user 1 at date 2012-11-15 (15 november 2012) should return the most recent group, which is 2 and 4 (many group at the same time) at date 2012-11-01 (the closest and smaller date).
Normally, I could do a Select where date <= chosen date order by date desc, etc... but that's not the point because if I have 1000 users, it will need 1000 requests to have all the result.
So here are some question:
I have already using the php method to loop through the array to avoid the high number of mysql request, but it's still not good because the array size may be 10000+. Using a foreach (or for?) is quite costly.
So my question is if given an array, ordered by date (desc or asc), what's the fastest way to find the closest index of the element which contain a date smaller (or greater) than a given date; beside using a for or foreach loop to loop through each element.
If there is no solution for the first question, then what kind of data structure would you suggest for this kind of problem.
Note: the date is in mysql format, it's not converted in timestamp when you stored it in an array
EDIT: this is a sql fiddle http://sqlfiddle.com/#!2/dc28d/1
For dos_id = 6, t="2012-11-01" it should returns only 2 and 5 at date "2010-12-10 13:16:58"

Not sure why you'd want to do this in php. Here's some SQL using joins instead to get most recent group(s) for all users given a date. Make sure you've got indexes on date and userid.
SELECT *
FROM test t1
LEFT JOIN test t2
ON t1.userid = t2.userid AND t2.thedate <= '2012-11-15' AND t2.thedate > t1.thedate
WHERE t1.thedate <= '2012-11-15' AND t2.userid IS NULL;
SQLfiddle
Or using your SQLFiddle
SELECT t1.*
FROM dossier_dans_groupe t1
LEFT JOIN dossier_dans_groupe t2
ON t1.dos_id = t2.dos_id AND t2.updated_at <= '2012-11-01'
AND t2.updated_at > t1.updated_at
WHERE t1.updated_at <= '2012-11-01' AND t2.dos_id IS NULL;

This would give you a list of all users and their groups (1 row per group) for the latest date that is smaller than the one you specify (2012-11-15 below).
SELECT user_id, group_id, date FROM table WHERE date <= '2012-11-15' AND NOT EXISTS (SELECT 1 FROM table test WHERE test.user_id = table.user_id AND test.date > table.date and test.date <= '2012-11-15')

Related

Finding ocuppations via SQL and/or PHP

I am making a student web app. Amongst other tables, I have a table in which students enroll and enrollments are between two dates.
This app uses MySQL 5.6 and PHP 7.2
It has the following fields:
IDStudent
StartDate
EndDate
IDCourse
Each course has a maximum capacity in which it cannot be surpassed.
I want to know, given a start date, end date and IDCourse, how many concurrent students are in a course. I get an approxiumate value just counting rows between two dates
SELECT COUNT(*) FROM enrollments
WHERE IDCourse = ?
AND (
(StartDate BETWEEN "<start date>" AND "<end date>")
OR
(EndDate BETWEEN "<start date>" AND "<end date>")
OR
(StartDate <= "<start date>" AND EndDate>= "<end date>")
)
But that doesn't take account non overlapping ranges. It counts every enrollment.
For example, I have this very simple case:
Want to find how many students are enrolled between 01/01/2021 and 05/01/2021 at a specified course
And I have those 3 enrollments on that course:
01/01/2021 - 02/01/2021
03/01/2021 - 04/01/2021
20/12/2020 - 01/02/2021
I should get 2 count and not 3, because 1 and 2 don't overlap while 3 overlaps both.
I tried to search online but I didn't found something similar, maybe I am not using the correct keywords!
I found Determine max number of overlapping DATETIME ranges but that is for MySQL 8
Many thanks for your help
Regards

I think you may need to create a calendar table between the first start date and the last end date, count by date and then select the max between the period you are interested:
select max(stcount)
from
(
select c.dt, count(*) stcount from calendar_table c
join enrollments e on c.dt between e.StartDate and e.EndDate
group by c.dt
) countbydate
where dt between '2021-01-01' and '2021-01-05'
db-fiddle:
https://www.db-fiddle.com/f/dXuKMoRQ2ivLt5qi5AVFcG/0

How to select a corresponding column in mysql for a MAX DATE grouped by DATE

mysql table: stats
columns:
date | stats
05-05-2015 22:25:00 | 78
05-05-2015 09:25:00 | 21
05-05-2015 05:25:00 | 25
05-04-2015 09:25:00 | 29
05-04-2015 05:25:00 | 15
sql query:
SELECT MAX(date) as date, stats FROM stats GROUP BY date(date) ORDER BY date DESC
when I do this, I does select one row per date (grouped by date, regardless of the time), and selects the largest date with MAX, but it does not select the corresponding column.
for example, it returns 05-05-2015 22:25:00 as the date, and 25 as the stats. It should be selecting 78 as the stats. I've done my research and seems like solutions to this are out there, but I am not familiar with JOIN or other less-common mysql functions to achieve this, and it's hard for me to understand other examples/solutions so I decided to post my own specific scenario.

This question is asked every single day in SO. Sometimes, it's correctly answered too. Anyway, purists won't like it but here's one option:
Select x.* from stats x join (SELECT MAX(date) max_date FROM stats GROUP BY date(date)) y on y.max_date = x.date;
Obviously, for this to work dates need to be stored using a datetime data type.

totals by month even for if there is a missing month

just a quick one, hopefully....
i am after getting some totals (sales value) by month from only a single table.
The problem i have is:
If there are no sales for a month, the month is of course not being returned in the results. Is there a way i can do this in a single query so if there were no sales in i.e "January 2015" the result would return "0.00 - January - 2015"
The basic SQL i currently have is:
SELECT SUM(p.PaymentAmount) AS Total, MONTHNAME(p.PaymentDate) AS Month, YEAR(p.PaymentDate) AS Year
FROM tPayment p
WHERE p.PaymentType = 2
GROUP BY YEAR(p.PaymentDate), MONTH(p.PaymentDate)
i cant think of how to do this without selecting the date range in php and then querying each month and year... this just seems messy... so i would like to know if i can do this in a single query.
Any help is much appreciated!

you should create yourself a separate table containing at dates such as
CREATE TABLE `dates` (
`uid` INT NOT NULL AUTO_INCREMENT,
`datestamp` DATE NOT NULL,
PRIMARY KEY (`uid`))
ENGINE = InnoDB;
and fill it
INSERT INTO dates (datestamp)
SELECT ADDDATE('2015-01-01', INTERVAL SomeNumber DAY)#set start date
FROM (SELECT a.i+b.i*10+c.i*100+d.i*1000 AS SomeNumber
FROM integers a, integers b, integers c, integers d) Sub1
WHERE SomeNumber BETWEEN 0 AND (365 * 3)#3 years
then you can join against it
SELECT SUM(p.PaymentAmount) AS Total, MONTHNAME(p.PaymentDate) AS Month, YEAR(p.PaymentDate) AS Year
FROM tPayment p
LEFT OUTER JOIN dates d
ON d.datestamp = CAST(p.PaymentDate AS DATE)
WHERE p.PaymentType = 2
GROUP BY YEAR(p.PaymentDate), MONTH(p.PaymentDate)
ORDER BY d.datestamp DESC;
regardless of if I fatfingered the queries here, the concept should hold up for you

This wouldn't be my first choice method for accomplishing this task, but for the sake of providing multiple alternatives I offer this if you're trying to keep it all in MySQL and avoid creating an additional table.
SELECT
SUM(p.PaymentAmount) AS Total,
MONTHNAME(p.PaymentDate) AS Month,
YEAR(p.PaymentDate) AS Year
FROM ( SELECT 1 AS m UNION ALL
SELECT 2 UNION ALL
SELECT 3 UNION ALL
SELECT 4 UNION ALL
SELECT 5 UNION ALL
SELECT 6 UNION ALL
SELECT 7 UNION ALL
SELECT 8 UNION ALL
SELECT 9 UNION ALL
SELECT 10 UNION ALL
SELECT 11 UNION ALL
SELECT 12
) AS months
OUTER JOIN tPayment p
ON MONTH(p.PaymentDate) = months.m
WHERE p.PaymentType = 2
GROUP BY YEAR(p.PaymentDate), MONTH(p.PaymentDate)
I think it'd be easier to check against it in some quick PHP code after you run the query.
As suggested by Jon B Creating a months table and joining against that would shorten and clean the query up quite a bit. If you're trying to keep it all in your MySQL query I personally would choose his method.

If you have data in your table for all months -- but the where clause is filtering out all the rows from one or more months -- you can try conditional aggregation:
SELECT SUM(CASE WHEN p.PaymentType = 2 THEN p.PaymentAmount ELSE 0 END) AS Total,
MONTHNAME(p.PaymentDate) AS Month, YEAR(p.PaymentDate) AS Year
FROM tPayment p
GROUP BY YEAR(p.PaymentDate), MONTH(p.PaymentDate)
This isn't guaranteed to work (it depends on the data). But if it does, it is the simplest way to solve this problem.

Getting next available date from database

I have a system that allows users to assign a specific file to a past or present date. The limitations are that they may only upload one file per day per user. When the user goes to upload a file the date field must default to the current date and when that date is not available it will show the first available date in the past in DESC order. Below is the relevent field names.
file_id (INT - INDEX - AUTO INCREMENT)
user_id (INT - may index this)
upload_date (INT - stores date as a unix timestamp)
The only solution I have really found would be to build them all into an array in DESC order by date and loop through until i found an empty slot. However, I feel this could really cause speed issues if the user had the past thousand days filled. I feel like I am overlooking a simple solution.
PLEASE NOTE: For one reason or another they Date is being stored as a Unix timestamp which I understand the downsides on and I am not concerned about correcting at this time.

To get the most recent date that has not been used:
select user_id, max(date) - 1
from (select ud.*,
(select max(date) from upload_date ud2 where ud2.user_id = ud.user_id and ud2.date < ud.date
) as prevdate
from upload_date ud
) ud
where date(from_unixtime(ud.prevdate)) <> date(from_unixtime(ud.date)) - 1 or
ud.prevdate is null
group by user_id
This query first gets the previous date for any given day using a correlated subquery. It then converts the time values to dates and selects any row where the previous date has a gap. The largest of the date minus one is the date you are looking for.
This SQL is untested, so it may have syntax errors.

One way to approach this is with a classic "return missing rows" query. Basically, to get a "missing" row returned from the database, you need a way to generate the "missing" rows.
To build such a query, we can start with:
SELECT MAX(t.upload_date)
FROM mytable t
WHERE t.upload_date <= NOW()
AND t.user = 'someuser'
That gets the initial date, that we are going to work backwards from.
For the "one per day" requirement, you probably want to truncate that upload_date to midnight, at least for this query. For now, we'll assume that the expression in the SELECT list is already truncated, to illustrate the approach, without bogging down in the details of dealing with a unix timestamp.
To generate a descending list of dates, starting with that initial date retrieved by the previous query...
SELECT s.upload_date - INTERVAL n.d DAY AS available_date
FROM ( SELECT MAX(t.upload_date) AS upload_date
FROM mytable t
WHERE t.upload_date <= NOW()
AND t.user = 'someuser'
) s
CROSS
JOIN ( SELECT 0 AS d UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3
UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6
UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9
) n
ORDER BY n.d DESC
With that result, we can use an anti-join pattern to find which dates are not already used. This is a LEFT JOIN and a predicate that throws out matching rows:
SELECT s.upload_date - INTERVAL n.d DAY AS available_date
FROM ( SELECT MAX(t.upload_date) AS upload_date
FROM mytable t
WHERE t.upload_date <= NOW()
AND t.user = 'someuser'
) s
CROSS
JOIN ( SELECT 0 AS d UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3
UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6
UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9
) n
LEFT
JOIN mytable u
ON u.upload_date = s.upload_date - INTERVAL n.d DAY
AND u.user = 'someuser'
WHERE u.upload_date IS NULL
ORDER BY n.d DESC
LIMIT 1
That only looks back 9 days, to get it to look back more days, just extend the inline view aliased as n to return more consecutive integers. (There's some tricks we can play with cross joins to get a whole boatload of integers.)
All that remains is working on the "matching" criteria (which works with the MySQL DATE datatype):
ON u.upload_date = s.upload_date - INTERVAL n.d DAY
into something like this:
ON u.upload_date >= UNIX_TIMESTAMP(FROM_UNIXTIME(s.upload_date)-INTERVAL n.d+1 DAY)
AND u.upload_date < UNIX_TIMESTAMP(FROM_UNIXTIME(s.upload_date)-INTERVAL n.d DAY)
And futzing with the integer timestamp value to get a MySQL DATE out of it...
SELECT DATE(FROM_UNIXTIME(s.upload_date)) - INTERVAL n.d DAY AS available_date

PHP/MYSQL datetime ranges overlapping for users

please I need help with this (for better understanding please see attached image) because I am completely helpless.
As you can see I have users and they store their starting and ending datetimes in my DB as YYYY-mm-dd H:i:s. Now I need to find out overlaps for all users according to the most frequent time range overlaps (for most users). I would like to get 3 most frequented datatime overlaps for most users. How can I do it?
I have no idea which mysql query should I use or maybe it would be better to select all datetimes (start and end) from database and process it in php (but how?). As stated on image results should be for example time 8.30 - 10.00 is result for users A+B+C+D.
Table structure:
UserID | Start datetime | End datetime
--------------------------------------
A | 2012-04-03 4:00:00 | 2012-04-03 10:00:00
A | 2012-04-03 16:00:00 | 2012-04-03 20:00:00
B | 2012-04-03 8:30:00 | 2012-04-03 14:00:00
B | 2012-04-06 21:30:00 | 2012-04-06 23:00:00
C | 2012-04-03 12:00:00 | 2012-04-03 13:00:00
D | 2012-04-01 01:00:01 | 2012-04-05 12:00:59
E | 2012-04-03 8:30:00 | 2012-04-03 11:00:00
E | 2012-04-03 21:00:00 | 2012-04-03 23:00:00

What you effectively have is a collection of sets and want to determine if any of them have non-zero intersections. This is the exact question one asks when trying to find all the ancestors of a node in a nested set.
We can prove that for every overlap, at least one time window will have a start time that falls within all other overlapping time windows. Using this tidbit, we don't need to actually construct artificial timeslots in the day. Simply take a start time and see if it intersects any of the other time windows and then just count up the number of intersections.
So what's the query?
/*SELECT*/
SELECT DISTINCT
MAX(overlapping_windows.start_time) AS overlap_start_time,
MIN(overlapping_windows.end_time) AS overlap_end_time ,
(COUNT(overlapping_windows.id) - 1) AS num_overlaps
FROM user_times AS windows
INNER JOIN user_times AS overlapping_windows
ON windows.start_time BETWEEN overlapping_windows.start_time AND overlapping_windows.end_time
GROUP BY windows.id
ORDER BY num_overlaps DESC;
Depending on your table size and how often you plan on running this query, it might be worthwhile to drop a spatial index on it (see below).
UPDATE
If your running this query often, you'll need to use a spatial index. Because of range based traversal (ie. does start_time fall in between the range of start/end), a BTREE index will not do anything for you. IT HAS TO BE SPATIAL.
ALTER TABLE user_times ADD COLUMN time_windows GEOMETRY NOT NULL DEFAULT 0;
UPDATE user_times SET time_windows = GeomFromText(CONCAT('LineString( -1 ', start_time, ', 1 ', end_time, ')'));
CREATE SPATIAL INDEX time_window ON user_times (time_window);
Then you can update the ON clause in the above query to read
ON MBRWithin( Point(0,windows.start_time), overlapping_windows.time_window )
This will get you an indexed traversal for the query. Again only do this if your planning on running the query often.
Credit for the spatial index to Quassoni's blog.

Something like this should get you started -
SELECT slots.time_slot, COUNT(*) AS num_users, GROUP_CONCAT(DISTINCT user_bookings.user_id ORDER BY user_bookings.user_id) AS user_list
FROM (
SELECT CURRENT_DATE + INTERVAL ((id-1)*30) MINUTE AS time_slot
FROM dummy
WHERE id BETWEEN 1 AND 48
) AS slots
LEFT JOIN user_bookings
ON slots.time_slot BETWEEN `user_bookings`.`start` AND `user_bookings`.`end`
GROUP BY slots.time_slot
ORDER BY num_users DESC
The idea is to create a derived table that consists of time slots for the day. In this example I have used dummy (which can be any table with an AI id that is contiguous for the required set) to create a list of timeslots by adding 30mins incrementally. The result of this is then joined to bookings to be able to count the number of books for each time slot.
UPDATE For entire date/time range you could use a query like this to get the other data required -
SELECT MIN(`start`) AS `min_start`, MAX(`end`) AS `max_end`, DATEDIFF(MAX(`end`), MIN(`start`)) + 1 AS `num_days`
FROM user_bookings
These values can then be substituted into the original query or the two can be combined -
SELECT slots.time_slot, COUNT(*) AS num_users, GROUP_CONCAT(DISTINCT user_bookings.user_id ORDER BY user_bookings.user_id) AS user_list
FROM (
SELECT DATE(tmp.min_start) + INTERVAL ((id-1)*30) MINUTE AS time_slot
FROM dummy
INNER JOIN (
SELECT MIN(`start`) AS `min_start`, MAX(`end`) AS `max_end`, DATEDIFF(MAX(`end`), MIN(`start`)) + 1 AS `num_days`
FROM user_bookings
) AS tmp
WHERE dummy.id BETWEEN 1 AND (48 * tmp.num_days)
) AS slots
LEFT JOIN user_bookings
ON slots.time_slot BETWEEN `user_bookings`.`start` AND `user_bookings`.`end`
GROUP BY slots.time_slot
ORDER BY num_users DESC
EDIT I have added DISTINCT and ORDER BY clauses in the GROUP_CONCAT() in response to your last query.
Please note that you will will need a much greater range of ids in the dummy table. I have not tested this query so it may have syntax errors.

I would not do much in SQL, this is so much simpler in a programming language, SQL is not made for something like this.
Of course, it's just sensible to break the day down into "timeslots" - this is statistics. But as soon as you start handling dates over the 00:00 border, things start to get icky when you use joins and inner selects. Especially with MySQL which does not quite like inner selects.
Here's a possible SQL query
SELECT count(*) FROM `times`
WHERE
( DATEDIFF(`Start`,`End`) = 0 AND
TIME(`Start`) < TIME('$SLOT_HIGH') AND
TIME(`End`) > TIME('$SLOT_LOW'))
OR
( DATEDIFF(`Start`,`End`) > 0 AND
TIME(`Start`) < TIME('$SLOT_HIGH') OR
TIME(`End`) > TIME('$SLOT_LOW')
Here's some pseudo code
granularity = 30*60; // 30 minutes
numslots = 24*60*60 / granularity;
stats = CreateArray(numslots);
for i=0, i < numslots, i++ do
stats[i] = GetCountFromSQL(i*granularity, (i+1)*granularity); // low, high
end
Yes, that makes numslots queries, but no joins no nothing, hence it should be quite fast. Also you can easily change the resolution.
And another positive thing is, you could "ask yourself", "I have two possible timeslots, and I need the one where more people are here, which one should I use?" and just run the query twice with respective ranges and you are not stuck with predefined time slots.
To only find full overlaps (an entry only counts if it covers the full slot) you have to switch low and high ranges in the query.
You might have noticed that I do not add times between entries that could span multiple days, however, adding a whole day, will just increase all slots by one, making that quite useless.
You could however add them by selecting sum(DAY(End) - DAY(Start)) and just add the return value to all slots.

Table seems pretty simple. I would keep your SQL query pretty simple:
SELECT * FROM tablename
Then when you have the info saved in your PHP object. Do the processing with PHP using loops and comparisons.
In simplest form:
for($x, $numrows = mysql_num_rows($query); $x < $numrows; $x++){
/*Grab a row*/
$row = mysql_fetch_assoc($query);
/*store userID, START, END*/
$userID = $row['userID'];
$start = $row['START'];
$end = $row['END'];
/*Have an array for each user in which you store start and end times*/
if(!strcmp($userID, "A")
{
/*Store info in array_a*/
}
else if(!strcmp($userID, "B")
{
/*etc......*/
}
}
/*Now you have an array for each user with their start/stop times*/
/*Do your loops and comparisons to find common time slots. */
/*Also, use strtotime() to switch date/time entries into comparable values*/
Of course this is in very basic form. You'll probably want to do one loop through the array to first get all of the userIDs before you compare them in the loop shown above.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.