Advanced statistics in PHP and MySQL

Advanced statistics in PHP and MySQL - php

I have a slight problem. I have a dataset, which contains values measured by a weather station, which I want to analyze further using MySQL database and PHP.
Basically, the first column of the db contains the date and the other columns temperature, humidity, pressure etc.
Now, the problem is, that for the calculation of the mean, st.dev., max, min etc. it is quite simple. However there are no build-in commands for other parameters which I need, such as kurtosis etc.
What I need is for example to calculate the skewness, mean, stdev etc. for the individual months, then days etc.
For the build-in functions it is easy, for example finding some of the parameters for the individual months would be:
SELECT AVG(Temp), STD(Temp), MAX(Temp)
FROM database
GROUP BY YEAR(Date), MONTH(Date)
Obviously I cannot use this for the more advanced parameters. I thought about ways of achieving this and I could only think of one solution. I manually wrote a function, which processes the values and calculates the things such as kurtosis using the particular formulae. But, what that means is that I would need to create arrays of data for each month, day, etc. depending on what I am currently calculating. So for example, i would first need to take the data and split it into arrays lets say Jan11, Feb11, Mar11...... and each array would contain the data for that month. Then I would apply the function on those arrays and create new variables with the result (lets say kurtosis_jan11, kurtosis_feb11 etc.)
Now to my question. I need help with the splitting of data. The problem is that I dont know in advance which month the data starts and which it ends, so I cannot set fixed variables for this. The program first has to check the first month and then create new array for each month, day etc. until it reaches the last record. And for each it would create the array.
That of course would be maybe one solution but if anyone has any other ideas about how to go around this problem I would very much appreciate your help.

You can do more complex queries to achieve this. Here are some examples http://users.drew.edu/skass/sql/ , including Skew

SELECT AVG(Temp), STD(Temp), MAX(Temp)
FROM database
GROUP BY YEAR(Date), MONTH(Date)
having date between date_from and date_to
I think you want a group of data in between a data range.

Related

Fuzzy date match

I have a mysql db of clients and crawled a website retrieving all the reviews for the past few years. Now I am trying to match those reviews up with the clients so I can email them. The problem is that the review site allowed them to enter anything they wanted for the name, so in some cases I have full first name and last initial, and in some cases first initial and last full name. It also gives an approximate time it was posted such as "1 week ago", "6 months ago" and so on which we already have converted to an approximate date.
Now I need to try matching those up to the clients. Seems the best way would be to do a fuzzy search on the names, and then once I find all John B% I look for the one with a job completion date nearest the posting of the review naturally eliminating anything that was posted before jobs were completed.
I put together a small sample dataset where table1 is the clients, table2 is the review to match on here:
http://sqlfiddle.com/#!9/23928c/6/0
I was initially thinking of doing a date_diff, but then I need to sort by the lowest number. Before I tackle this on my own, I thought I would ask if anyone has any tricks they want to share.
I am using PHP / Laravel to query MySql

You can use DATEDIFF with absolute values:
ORDER BY ABS(DATEDIFF(`date`, $calculatedDate)) DESC
To find records that match your estimation closely, positive or negative.

MySql: saving date ranges VS saving single day

I am currently working on a simple booking system and I need to select some ranges and save them to a mysql database.
The problem I am facing is deciding if it's better to save a range, or to save each day separately.
There will be around 500 properties, and each will have from 2 to 5 months booked.
So the client will insert his property and will chose some dates that will be unavailable. The same will happen when someone books a property.
I was thinking of having a separate table for unavailable dates only, so if a property is booked from 10 may to 20 may, instead of having one record (2016-06-10 => 2016-06-20) I will have 10 records, one for each booked day.
I think this is easier to work with when searching between dates, but I am not sure.
Will the performance be noticeable worse ?
Should I save the ranges or single days ?
Thank you

I would advise that all "events" go into one table and they all have a start and end datetime. Use of indexes on these fields is of course recommended.
The reasons are that when you are looking for bookings and available events - you are not selecting from two different tables (or joining them). And storing a full range is much better for the code as you can easily perform the checks within a SQL query and all php code to handle events works as standard for both. If you only store one event type differently to another you'll find loads of "if's" in your code and find it harder to write the SQL.
I run many booking systems at present and have made mistakes in this area before so I know this is good advice - and also a good question.

This is too much for a comment,So I will leave this as an answer
So the table's primary key would be the property_id and the Date of a particular month.
I don't recommend it.Because think of a scenario when u going to apply this logic to 5 or 10 years system,the performance will be worse.You will get approximately 30*12*1= 360 raws for 1 year.Implement a logic to calculate the duration of a booking and add it to table against the user.

codeigniter active record querying for results based on hours as well as by state

I have two queries ultimately I think they will be in the same context of the other but in all. I have a user database that I want to pull out for tracking records based on hour. Example registrations per hour. But in this registrations per hour I want to have the query to dump results by hour increments (or weeks, or months) ie: 1,000 regitrations in november, 1,014 in december and so on, or similar for weeks hours.
I also have a similar query where I want to generate a list of states with the counts next to them of how many users I have per state.
My issue is, I'm thinking I think to one dimensionally currently cause the best idea I can think of at the moment is making in the case of the states 50 queries, but I know thats insane, and there has to be an easier way thats less intense. So thats what Im hoping someone from here can help me with, by giving me a general idea. Cause I don't know which is the best course of action for this currently.. be it using distinct, group_by or something else.

Experiment a bit and see if that doesn't help you focus on the question a bit more.
Try selecting from your registrations per hour table and appending the time buckets you are interested in to the select list.
like this:
select userid, regid, date_time, week(date_time), year(date_time), day(date_time)
from registraions;
you can then roll up and count things in that table by using group by and an aggregate function like this:
select count(distinct userid), year(date_time)
from registraions
group by year(date_time)
Read about about date time functions:
MySQL Date Time Functions
Read about aggregate functions"
MySQL Group By

Too many SQL calls on page load?

I'm constructing a website for a small collection of parents at a private daycare centre. One of the desired functions of the site is to have a calendar where you can pick what days you can be responsible for the cleaning of the locales. Now, I have made a working calendar. I found a simple script online that I modified abit to fit our purpose. Technically, it works well, but I'm starting to wonder if I really should alter the way it extracts information from the databse.
The calendar is presented monthly, and drawn as a table using a for-loop. That means that said for-loop is run 28-31 times each time the page is loaded depending on the month. To present who is responsible for cleaning each day, I have added a call to a MySQL database where each member's cleaning day is stored. The pseudo code looks like this, simplified:
Draw table month
for day=start_of_month to day=end_ofmonth
type day
select member from cleaning_schedule where picked_day=day
type member
This means that each reload of the page does at least 28 SELECT calls to the database and to me it seems both inefficient and that one might be susceptible to a DDOS-attack. Is there a more efficient way of getting the same result? There are much more complex booking calendars out there, how do they handle it?

SELECT picked_day, member FROM cleaning_schedule WHERE picked_day BETWEEN '2012-05-01' AND '2012-05-31' ORDER BY picked_day ASC
You can loop through the results of that query, each row will have a date and a person from the range you picked, in order of ascending dates.

The MySQL query cache will save your bacon.
Short version: If you repeat the same SQL query often, it will end up being served without table access as long as the underlying tables have not changed. So: The first call for a month will be ca. 35 SQL Queries, which is a lot but not too much. The second load of the same page will give back the results blazing fast from the cache.
My experience says, that this tends to be much faster than creating fancy join queries, even if that would be possible.

Not that 28 calls is a big deal but I would use a join and call in the entire month's data in one hit. You can then iterate through the MySQL Query result as if it was an array.

You can use greater and smaller in SQL. So instead of doing one select per day, you can write one select for the entire month:
SELECT day, member FROM cleaning_schedule
WHERE day >= :first_day_of_month AND day >= :last_day_of_month
ORDER BY day;
Then you need to pay attention in your program to handle multiple members per day. Although the program logic will be a bit more complex, the program will be faster: The interprocess or even network based communication is a lot slower than the additional logic.
Depending on the data structure, the following statement might be possible and more convenient:
SELECT day, group_concat(member) FROM cleaning_schedule
WHERE day >= :first_day_of_month AND day >= :last_day_of_month
GROUP BY day
ORDER BY day;

28 queries isnt a massive issue and pretty common for most commercial websites but is recommend just grabbing your monthly data by each month on one hit. Then just loop through the records day by day.

Storing Date Range in MySQL Solution

I am working on script which requires giving the admin the ability to insert dates for when he wants a parking lot available, the admin inserts dates in a range.
I am having a hard time coming to a solution to what would be the best way to store the dates in MySQL.
Should i store the dates using two columns AVAILABLE_FROM_DATE and AVAILABLE_UNTIL_DATE?
PLID AVAILABLE_FROM DATE AVAILABLE_UNTIL_DATE
1 2012-04-01 2012-04-03
1 2012-04-05 2012-04-15
2 2012-04-21 2012-04-30
OR should i just use a single column AVAILABLE_DATE and store the ranges the admin selects in a new row for each date between the range?
[EDIT START]
What i mean above by using a single column is not to join or split the dates into a single column, i actually mean to store a date in a single row with a single column like below:
PLID AVAILABLE_DATE
1 2012-04-01
1 2012-04-02
1 2012-04-03
and so on for all the available dates i want to store.
[EDIT END]
Basically, the admin will want to insert a date range the parking lot is available and allow members to choose that slot if the user is looking for a slot within that range.
OR is there some better and simpler way to do this?
I am currently trying to use the first method using separate columns for the range, but having trouble getting the desired results when looking for parking lots within a range.
[EDIT START]
SELECT * FROM `parking_lot_dates`
WHERE (available_from_date BETWEEN '2012-04-22' AND '2012-04-30'
AND (available_until_date BETWEEN '2012-04-22' AND '2012-04-30'))
I use the following query on the above rows i have, and it returns empty.
I want it to return the last row having the PLID 2.
[EDIT END]
Thank you in advance.

Regarding your EDIT with the query, you have the logic inside out. You need to compare whether each date you are checking is inside the range BETWEEN available_from_date and available_until_date, like this:
SELECT * FROM `parking_lot_dates`
WHERE
(
'2012-04-22' BETWEEN available_from_date AND available_until_date
AND '2012-04-30' BETWEEN available_from_date AND available_until_date
)
Demo: http://www.sqlfiddle.com/#!2/911a3/2
Edit: Although if you'll want to allow partial-range matches, you'll need both types of logic, i.e., the parking lot is available 4-22 to 4-27, and you need it 4-23 to 4-28. You can use it for the dates 4-23 to 4-27, but not 4-28.

Why to complicate so much?
SELECT *
FROM `parking_lot_dates`
WHERE available_from_date <= '2012-04-22'
AND available_until_date >= '2012-04-30';

I personally have found it better to have 2 columns, a start and end time, for searching a specific date, or just looking at it seems easier to me

Using 1 column to store those dates is a bad design from a database point of view (not normalized). It's better to have 2 columns because the results can be retrieved easier and extracting the information from a single column would mean having to do some sort of split. It's just not elegant and it doesn't behave well when requirements change.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.