Convert mysql irregular time series data to regular sequence

Convert mysql irregular time series data to regular sequence - php

I have a table of temperature data, updated every 5-15 mins by multiple sensors.
The data is essentially this: unique id, device(sensor id), timestamp, value(float)
The sensors does not have an accurate clock, so the readings are doomed to skew over time, so I'm unable to use things like group by hour in mysql to get a reading of the last 24h of temperature data.
My solution as a php programmer would be to make a pre-processor that reads all the un-processed readings and "join them" in a table.
There must be others than me who has this need to "downscale" x-minute/hour reads down to one per hour, to use in lets say graphing.
My problem is how do I calculate the rounded hour value from one or several readings.
For example, I have 12 readings over 2,5 hours, and I need an explicit value for each whole hour for all these readings.
Data:
Date Device Value
2016-06-27 12:15:15, TA, 23.5
2016-06-27 12:30:19, TA, 23.1
2016-06-27 12:45:35, TA, 22.9
2016-06-27 13:00:55, TA, 22.5
2016-06-27 13:05:15, TA, 22.8
2016-06-27 13:35:35, TA, 23.2
I'm not that much into statistical math, so "standard deviation" and the likes are citys in Russia for me.
Also, the devices go to sleep sometimes, and does not always transmit a temperature.
Feel free to ask me to add info to the question, as I'm not sure what you guys need to answer this.
The most important parts is this:
1. I'm using MySQL, and that's not going to change.
2. I'm hoping for a solution (or tips) in php, though tips in many other languages also would help my understanding. I'm primarily a PHP programmer though, so answers in that language would be most appreciated.
Edit: I would like to specify a few points.
Because the time data recorded from the sensors may be inaccurate, I'm relying on the SQL insert time. That way the time is controlled by one device only, the controller that's inserting the data.
For example, if I select 30 timestamp/value pairs in a 24h period, I would like to "combine" these to 24 timestamp/value pairs, using an average to combine the overflowing data.
I'm not that good to explain, but I hope this makes it clearer.
Also, would love either a clean SQL way of doing it, but also a PHP way of looping through 30 rows to produce 24 whole hour rows of data.
My goal is to have one row for every hour, with an accurate timestamp and temperature value. Mainly because most graphing libraries expect that kind of input. Especially when I have more than one series in a graph.
At some point, I may find it useful to show a graph for let's say the last six hours, with a 15 minute accuracy.
The clue is that I don't want to change the raw data, just find a way to extract/compute linear results from it.

How I would try to handle this is;
Take day start value; 01/01/2016 00:00:00 and do a 'between' 'sql' in MySQL, progressing every hour. So the first 'sql' would be like;
'select avg(temp_value) from table where date between 01/01/2016 00:00:00 and 01/01/2016 00:59:99' and progress on by the hour.
The sql isn't correct, and the entire 24hr period can be written out programmatically, but I think this will start you on your way.

Related

MySql: saving date ranges VS saving single day

I am currently working on a simple booking system and I need to select some ranges and save them to a mysql database.
The problem I am facing is deciding if it's better to save a range, or to save each day separately.
There will be around 500 properties, and each will have from 2 to 5 months booked.
So the client will insert his property and will chose some dates that will be unavailable. The same will happen when someone books a property.
I was thinking of having a separate table for unavailable dates only, so if a property is booked from 10 may to 20 may, instead of having one record (2016-06-10 => 2016-06-20) I will have 10 records, one for each booked day.
I think this is easier to work with when searching between dates, but I am not sure.
Will the performance be noticeable worse ?
Should I save the ranges or single days ?
Thank you

I would advise that all "events" go into one table and they all have a start and end datetime. Use of indexes on these fields is of course recommended.
The reasons are that when you are looking for bookings and available events - you are not selecting from two different tables (or joining them). And storing a full range is much better for the code as you can easily perform the checks within a SQL query and all php code to handle events works as standard for both. If you only store one event type differently to another you'll find loads of "if's" in your code and find it harder to write the SQL.
I run many booking systems at present and have made mistakes in this area before so I know this is good advice - and also a good question.

This is too much for a comment,So I will leave this as an answer
So the table's primary key would be the property_id and the Date of a particular month.
I don't recommend it.Because think of a scenario when u going to apply this logic to 5 or 10 years system,the performance will be worse.You will get approximately 30*12*1= 360 raws for 1 year.Implement a logic to calculate the duration of a booking and add it to table against the user.

MySQL: ignore results where difference is more than x between rows

I have a simple PHP/HTML page that runs MySQL queries to pull temperature data and display on a graph. Every once in a while there is some bad data read from my sensors (DHT11 Temp / RH sensors, read by Arduino), where there will be a spike that is too high or too low, so I know it's not a good data point. I have found this is easy to deal with if it is "way" out of range, as in not a sane temperature, I just use a BETWEEN statement to filter out any records that are not possibly true.
I do realize that ultimately this should be fixed at the source so these bad readings never post in the first place, however as a debugging tool, I do actually want to record those errors in my DB, so I can track down the points in time when my hardware was erroring.
However, this does not help with the occasional spikes that actually fall within the range of sane temperatures. For example if it is 65 F outside, and the sensor occasionally throws an odd reading and I get a 107 F reading, it totally screws up my graphs, scaling, etc. I cant filter that with a BETWEEN (that I know of), because 107 F is actually a practical summer time temp in my region.
Is there a way to filter out values based on their neighboring rows? Can I do something like, if I am reading five rows for the sake of simplicity, and their result is: 77,77,76,102,77 ... that I can say "anything that is more than (x) difference between sequential rows, ignore it because it's bad data" ?
[/longWinded]

It is hard to answer without your schema so I did a SQLFiddle to reproduce your problem.
You need to average the temperature between a time frame and then compare this value with the current row. If the difference is too big, then we don't select this row. In my Fiddle this is done by :
abs(temp - (SELECT AVG(temp) FROM temperature AS t
WHERE
t.timeRead BETWEEN
DATE_ADD(temperature.timeRead, interval-3 HOUR)
AND
DATE_ADD(temperature.timeRead, interval+3 HOUR))) < 8
This condition is calculating the average of the temprature of the last 3 hours and the next 3 hours. If the difference is more than 8 degrees then we skip this row.

Time Prediction based on existing date:time records

I have a system that logs date:time and it returns results such as:
05.28.2013 11:58pm
05.27.2013 10:20pm
05.26.2013 09:47pm
05.25.2013 07:30pm
05.24.2013 06:24pm
05.23.2013 05:36pm
What I would like to be able to do is have a list of date:time prediction for the next few days - so a person could see when the next event might occur.
Example of prediction results:
06.01.2013 04:06pm
05.31.2013 03:29pm
05.30.2013 01:14pm
Thoughts on how to go about doing time prediction of this kind with php?

The basic answer is "no". Programming tools are not designed to do prediction. Statistical tools are designed for that purpose. You should be thinking more about R, SPSS, SAS, or some other similar tool. Some databases have rudimentary data analysis tools built-in, which is another (often inferior) option.
The standard statistical technique for time-series prediction is called ARIMA analysis (auto-regressive integrated moving average). It is unlikely that you are going to be implementing that in php/SQL. The standard statistical technique for estimating time between events is Poisson regression. It is also highly unlikely that you are going to be implementing that in php/SQL.
I observe that your data points are once per day in the evening. I might guess that this is the end of some process that runs during the day. The end time is based on the start time and the duration of the process.
What can you do? Often a reasonable prediction is "what happened yesterday". You would be surprised at how hard it is to beat this prediction for weather forecasting and for estimating the stock market. Another very reasonable method is the average of historical values.
If you know something about your process, then an average by day of the week can work well. You can also get more sophisticated, and do Monte Carlo estimates, by measuring the average and standard deviation, and then pulling a random value from a statistical distribution. However, the average value would work just as well in your case.
I would suggest that you study a bit about statistics/data mining/predictive analytics before attempting to do any "predictions". At the very least, if you really have a problem in this domain, you should be looking for the right tools to use.

As Gordon Linoff posted, the simple answer is "no", but you can write some code that will give a rough guess on what the next time will be.
I wrote a very basic example on how to do this on my site http://livinglion.com/2013/05/next-occurrence-in-datetime-sequence/

Here's a possible way that this could be done, using PHP + MySQL:
You can have a table with two fields: a DATE field and a TIME field (essentially storing the date + time portion separately). Say that the table is named "timeData" and the fields are:
eventDate: date
eventTime: time
Your primary key would be the combination of eventDate and eventTime, so that they're never repeated as a pair.
Then, you can do a query like:
SELECT eventTime, count(*) as counter FROM timeData GROUP BY eventTime ORDER BY counter DESC LIMIT 0, 10
The aforementioned query will always return the first 10 most frequent event times, ordered by frequency. You can then order these again from smallest to largest.
This way, you can return quite accurate time prediction results, which will become even more accurate as you gather data each day

clever way to increment a value in a large array

Hep hey!
I am building an statistic overview of how many people is supposed to be at work at any given 5 minuts interval on a given day.
Say, we have 6 people working at 10.50, same at 10.55, then one go home and we got 5 people working at 11.00
Now, the way i imagined to keep track of this was to have an array with 5x12x24 elements (1 element per 5 minuts for an 24 hour interval), where i run through each employees shift time and increment the elements for the given 5 min intervals their shift takes them over.
(say a person works from 9.00 to 10.00, then i will increment the values from 9.00, 9.05, 9.10 up to 10.00 by one)
I need the data to make a diagram later, that is why i store it in an array.
Now my question is, which way is the fastest to do this?
Should i start out with an array which contains all the time elements and then increment it as i run through the shift hours of the employees ($arr['9.05']++) or should i start out by making an empty array and just check if the value of the time exsists, if not, create that element and if it does, increment it?
Or is there in general a smarter way to do this?
I ask as i can see this becomming a pretty heavy operation if you have 50+ employees which have to run through this function, so the smarter it can be made, the better :)
PS. the shift times comes from a database that i do not have access to, so i only have the timestamps of the start of the shit and the finish.

Store Open/Close Times and DST Changes

I've been stuck on this for two days and have gotten no where. I tend to think future and the future problems that will come around. My server's time is set to UTC and linux box is fully updated with the timezones as well as the data is in my database.
I'll explain my system for the best answer.
This site sells "items" but can only sell during open and closed times. The stores can have split hours: ie: open from 8am-12pm 1pm-8pm... etc.
So my hours table looks like:
id (1) | store_id (1) | opens (08:00) | closes (21:00)
Above has the sample data next to the column name. Basically store id#1 may be in Los Angeles (US/Pacific) or it may be in New York City (US/Eastern).
What is the best way to ensure that I don't miss an hour of downtime so I can disalow users to order from these stores during their off hours. If I'm off of the times one hour, that's one hour no one can order when they are really open and an hour users will order when they are really closed.. visa versa depending on time changes.
Has anyone dealt with this? And if so, how did you do it?
What is the best way I can go to solve this issue. I've been dealing with it and it's eating my brain for the past 48 hours.
Please help! :)

It's actually super easy to achieve in Postgres and MySQL. All you have to do is store the timezone in the user side, set server TZ to UTC, then convert between the two.
EX:
SELECT
CASE WHEN (
(CAST((CURRENT_TIMESTAMP at time zone s.timezone) as time) BETWEEN h.opens AND h.closes) AND
h.day = extract(dow from CURRENT_TIMESTAMP at time zone s.timezone)) THEN 0 ELSE 1 END
) as closed
FROM store s
LEFT JOIN store_hours r ON s.id = r.store_id
HERE h.day = extract(dow from CURRENT_TIMESTAMP at time zone s.timezone)
It's something like that. I had to do typecasting that way because I was limited for using Doctrine 1.2.
Works like a charm even with DST changes.

One thing to bear in mind is that some places (think Arizona) don't do DST. You might want to make sure that your database has enough information so you can distinguish between LA and Phoenix should that prove necessary.
Assuming you follow ITroubs' advice, and put offsets in the database (and possibly information about whether a store is in a DST-respecting locale), you could do the following:
Build your code so it checks whether DST is in effect, and builds your queries appropriately. If all your stores are in NY and LA, then you can just add 1 to the offset when needed. If not, you'll need a query which uses different rules for DST and non-DST stores. Something like,
SELECT store_id
FROM hours
WHERE
(supportsDST = true AND opens < dstAdjustedNow AND closes > dstAdjustedNow)
OR (supportsDST = false AND opens < UTCNow AND closes > UTCNow)
If you go this route, I recommend trying to centralize and isolate the code that deals with this as much as possible.
Also, you don't mention this, but I assume that a store with split time would have two rows in the hours table, one for each block that it's open.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.