Detect points of change in sensor data

Detect points of change in sensor data - php

I have lots of sensor data from which I need to be able to detect changes reliably. Basically it comes from water level sensor in remote client. It's using accelerometer & float to get the water level. My problem is that the data can be noisy sometimes (it varies by 2-5 units per measurement) and sometimes I need to detect changes as low as 7-9 units.
When I'm graphing the data it's quite obvious for human eye that there's a change but how would I go at it programming wise? Now I'm just trying to detect changes bigger than x programmatically but it's not too reliable. I've attached a sample graph and pointed the changes with arrows. The huge changes in the beginning are just testing, so it's not normal behaviour for data.
The data is in MYSQL database and the code is in PHP so if you could point me to right direction I'd highly appreciate it!
EDIT: Also there can be some spikes in the data which are not considered valid but rather a mistake in the data.
EDIT: Example data can be found from http://pastebin.com/x8C9AtAk
The algorithm would need to run every 30 mins or so and should be able to detect changes within the last 2-4 pings. Each ping is in 3-5min interval.

I made some awk that you, or someone else, might like to experiment with. I average the last 10 (m) samples excluding the current one, and also average the last 2 samples (n) and then calculate the difference between the two and output a message if the absolute difference exceeds a threshold.
#!/bin/bash
awk -F, '
# j will count number of samples
# we will average last m samples and last n samples
BEGIN {j=0;m=10;n=2}
{d[j]=$3;id[j++]=$1" "$2} # Store this point in array d[]
END { # Do this at end after reading all samples
for(i=m-1;i<j;i++){ # Iterate over all samples, except first few while building average
totlastm=0 # Calculate average over last m not incl current
for(k=m;k>0;k--)totlastm+=d[i-k]
avelastm=totlastm/m # Average = total/m
totlastn=0 # Calculate average over last n
for(k=n-1;k>=0;k--)totlastn+=d[i-k]
avelastn=totlastn/n # Average = total/n
dif=avelastm-avelastn # Calculate difference between ave last m and ave last n
if(dif<0)dif=-dif # Make absolute
mesg="";
if(dif>4)mesg="<-Change detected"; # Make message if change large
printf "%s: Sample[%d]=%d,ave(%d)=%.2f,ave(%d)=%.2f,dif=%.2f%s\n",id[i],i,d[i],m,avelastm,n,avelastn,dif,mesg;
}
}
' <(tr -d '"' < levels.txt)
The last bit <(tr...) just removes the double quotes before sending the file levels.txt to awk.
Here is an excerpt from the output:
18393344 2014-03-01 14:08:34: Sample[1319]=343,ave(10)=342.00,ave(2)=342.00,dif=0.00
18393576 2014-03-01 14:13:37: Sample[1320]=343,ave(10)=342.10,ave(2)=343.00,dif=0.90
18393808 2014-03-01 14:18:39: Sample[1321]=343,ave(10)=342.10,ave(2)=343.00,dif=0.90
18394036 2014-03-01 14:23:45: Sample[1322]=342,ave(10)=342.30,ave(2)=342.50,dif=0.20
18394266 2014-03-01 14:28:47: Sample[1323]=341,ave(10)=342.20,ave(2)=341.50,dif=0.70
18394683 2014-03-01 14:38:16: Sample[1324]=346,ave(10)=342.20,ave(2)=343.50,dif=1.30
18394923 2014-03-01 14:43:17: Sample[1325]=348,ave(10)=342.70,ave(2)=347.00,dif=4.30<-Change detected
18395167 2014-03-01 14:48:25: Sample[1326]=345,ave(10)=343.20,ave(2)=346.50,dif=3.30
18395409 2014-03-01 14:53:28: Sample[1327]=347,ave(10)=343.60,ave(2)=346.00,dif=2.40
18395645 2014-03-01 14:58:30: Sample[1328]=347,ave(10)=343.90,ave(2)=347.00,dif=3.10

The right way to go about problems of this kind is to build a model of the phenomenon of interest and also a model of the noise process, and then make inferences about the phenomenon given some data. These inferences are necessarily probabilistic. The general computation you need to carry out is P(H_k | data) = P(data | H_k) P(H_k) / (sum_k (P(data | H_k) P(H_k)) (a generalized form of Bayes rule) where the H_k are all the hypotheses of interest, such as "step of magnitude at time " or "noise of magnitude ". In this case there might be a large number of plausible hypotheses, covering all possible magnitudes and times. You might need to limit the range of hypotheses considered in order to make the problem tractable, e.g. only looking back a certain number of time steps.

Related

Word Occurrences count in PHP - scaling and optimization

You have a large collection of very short (<100 characters) to medium-long (~5k characters) text items in a Mysql table. You want a solid and fast algorithm to obtain an aggregated word occurrency count in all the items.
Items selected can be as few as one (or none) and as many as 1M+, so the size of the text that has to be analyzed can vary wildly.
The current solution involves:
reading text from the selected records into a text variable
preg_replacing out everything that is not a "word" getting a somewhat cleaned up, shorter text (hashtags, #mentions, http(s):// links, numeric sequences such as phone numbers, etc. are not "words" and are parsed out)
exploding what's left into a "buffer" array
taking out everything that's shorter than two words and dumping everything else in a "master" array
the resulting array is then transformed with array_count_values, sorted (arsort), spliced a first time to have a more manageable array size, then parsed against stopwords lists in several languages, processed and spliced some more and finally output in JSON form as the list of the 50 most frequent words.
By tracing timing on the various steps of the operation sequence, the apparent bottleneck is in the query at first but as the item count increases, it rapidly moves to the array_count_values function (everything after that is more or less immediate).
On a ~ 10k items test run the total time for execution is ~3s from beginning to end; with 20k items it takes ~7s.
A (rather extreme but not impossible) case with 1.3M items takes 1m for the mysql query, and is able to parse roughly ~75k items per minute (so 17 minutes-ish is the estimate).
The result should be visualized as a result of an AJAX call, so with this kind of timing the UX is evidently disrupted. I'm looking for ways to optimize everything as much as possible. A 30s load time may be acceptable (however little realistic), 10 minutes (or more) is not.
I've tried batch processing by array_count_values-ing the chunks and then adding the resulting count arrays to a master array by key, but that helps only so much - the sum of the parts is equal (or slightly larger) than the total in terms of timing.
I only need the 50 most frequent occurrences at the top of the list, so there's possibly some room for improvement by cutting a few corners.

dump the column(s) into a text file. Suggest SELECT ... INTO OUTFILE 'x.txt'...
(if on linux etc): tr -s '[:blank:]' '\n' <x.txt | sort | uniq -c
To get the top 50:
tr -s '[:blank:]' '\n' <x.txt | sort | uniq -c | sort -nbr | head -50
If you need to tweak the definition of "word", such as dealing with punctuation, see the documentation on tr. You may have trouble with contractions versus phrases in single-quotes.
Some of the cleansing of the text can (and should) be done in SQL:
WHERE col NOT LIKE 'http%'
Some can (and should) be done in the shell script.
For example, this will get rid of #variables and phone numbers and 0-, 1-, and 2-character "words":
egrep -v '^#|^([-0-9]+$|$|.$|..)$'
That can be in the pipe stream just before the first sort.
The only limitation is disk space. Each step in the script is quite fast.
Test case:
The table: 176K rows of mostly text (including some linefeeds)
$ wc x3.txt
3,428,398 lines 31,925,449 'words' 225,339,960 bytes
$ time tr -s '[:blank:]' '\n' <x.txt | egrep -v '^#|^([-0-9]+$|$|.$|..)$' |
sort | uniq -c | sort -nbr | head -50
658569 the
306135 and
194778 live
175529 rel="nofollow"
161684 you
156377 for
126378 that
121560 this
119729 with
...
real 2m16.926s
user 2m23.888s
sys 0m1.380s
Fast enough?
I was watching it via top. It seems that the slowest part is the first sort. The SELECT took under 2 seconds (however, the table was probably cached in RAM).

Peak counter algorithm

I was searching for some effective peak counter algorithm but couldn't find one that suits my needs:
C-like language (it will be in PHP but I could translate from JS/C/Cpp)
Data is not traversable forward
Has static thresholds - detects peaks that are over 270 Watts AND those peaks are short so there is also a time restriction.
I already checked here:
Peak signal detection in realtime timeseries data
But that does not seem to be a viable solution for me.
Here is some example data:
Data has following format:
$a = [
'timestamp' = 1500391300.342
'power' => 383.87
];
On this chart there are 14 peaks. So i need algorithm to count them in loop.
Surprisingly each peak can have more than 10 points. Peak is continuous array that keeps filling itself. There is access to at least 100 previous points but there is no access to future data.
I have also prepared Calc (ODC) document with 15 peaks and some example data.
Here it is
So far I have simple algorithm that counts rising slopes but that does not work correctly (because there may be no slope over 270W, it may be divided to 5 points or jump may be too long to count as peak):
if ($previousLog['power'] - $deviceLog['power'] > 270) {
$numberShots++;
}
Thanks in advance, any tips could be help.

Simple hysteresis should work:
Find a sample that is above 270W
While sample are above 240W keep track of the largest sample seen,
and its timestamp
When you find a sample that is below 240W, go back to step 1

Convert mysql irregular time series data to regular sequence

I have a table of temperature data, updated every 5-15 mins by multiple sensors.
The data is essentially this: unique id, device(sensor id), timestamp, value(float)
The sensors does not have an accurate clock, so the readings are doomed to skew over time, so I'm unable to use things like group by hour in mysql to get a reading of the last 24h of temperature data.
My solution as a php programmer would be to make a pre-processor that reads all the un-processed readings and "join them" in a table.
There must be others than me who has this need to "downscale" x-minute/hour reads down to one per hour, to use in lets say graphing.
My problem is how do I calculate the rounded hour value from one or several readings.
For example, I have 12 readings over 2,5 hours, and I need an explicit value for each whole hour for all these readings.
Data:
Date Device Value
2016-06-27 12:15:15, TA, 23.5
2016-06-27 12:30:19, TA, 23.1
2016-06-27 12:45:35, TA, 22.9
2016-06-27 13:00:55, TA, 22.5
2016-06-27 13:05:15, TA, 22.8
2016-06-27 13:35:35, TA, 23.2
I'm not that much into statistical math, so "standard deviation" and the likes are citys in Russia for me.
Also, the devices go to sleep sometimes, and does not always transmit a temperature.
Feel free to ask me to add info to the question, as I'm not sure what you guys need to answer this.
The most important parts is this:
1. I'm using MySQL, and that's not going to change.
2. I'm hoping for a solution (or tips) in php, though tips in many other languages also would help my understanding. I'm primarily a PHP programmer though, so answers in that language would be most appreciated.
Edit: I would like to specify a few points.
Because the time data recorded from the sensors may be inaccurate, I'm relying on the SQL insert time. That way the time is controlled by one device only, the controller that's inserting the data.
For example, if I select 30 timestamp/value pairs in a 24h period, I would like to "combine" these to 24 timestamp/value pairs, using an average to combine the overflowing data.
I'm not that good to explain, but I hope this makes it clearer.
Also, would love either a clean SQL way of doing it, but also a PHP way of looping through 30 rows to produce 24 whole hour rows of data.
My goal is to have one row for every hour, with an accurate timestamp and temperature value. Mainly because most graphing libraries expect that kind of input. Especially when I have more than one series in a graph.
At some point, I may find it useful to show a graph for let's say the last six hours, with a 15 minute accuracy.
The clue is that I don't want to change the raw data, just find a way to extract/compute linear results from it.

How I would try to handle this is;
Take day start value; 01/01/2016 00:00:00 and do a 'between' 'sql' in MySQL, progressing every hour. So the first 'sql' would be like;
'select avg(temp_value) from table where date between 01/01/2016 00:00:00 and 01/01/2016 00:59:99' and progress on by the hour.
The sql isn't correct, and the entire 24hr period can be written out programmatically, but I think this will start you on your way.

Add margin to X-axis on RRDtool graph

A customer of mine wants to get better insight of the dBm values that an optical SFP sends and receives. Every 5 minute I poll these values and update the values in an RRD file. The RRD graph I create with the RRD file as its source is created in the following way:
/usr/bin/rrdtool graph /var/www/customer/tmp/ZtIKQOJZFf.png --alt-autoscale
--rigid --start now-3600 --end now --width 800 --height 350
-c BACK#EEEEEE00 -c SHADEA#EEEEEE00 -c SHADEB#EEEEEE00 -c FONT#000000
-c GRID#a5a5a5 -c MGRID#FF9999 -c FRAME#5e5e5e -c ARROW#5e5e5e -R normal
--font LEGEND:8:'DejaVuSansMono' --font AXIS:7:'DejaVuSansMono' --font-render-mode normal
-E COMMENT:'Bits/s Last Avg Max \n'
DEF:sfptxpower=/var/www/customer/rrd/sfpdbm.rrd:SFPTXPOWER:AVERAGE
DEF:sfprxpower=/var/www/customer/rrd/sfpdbm.rrd:SFPRXPOWER:AVERAGE
DEF:sfptxpower_max=/var/www/customer/rrd/sfpdbm.rrd:SFPTXPOWER:MAX
DEF:sfprxpower_max=/var/www/customer/rrd/sfpdbm.rrd:SFPRXPOWER:MAX
LINE1.25:sfptxpower#000099:'tx ' GPRINT:sfptxpower:LAST:%6.2lf%s\g
GPRINT:sfptxpower:AVERAGE:%6.2lf%s\g GPRINT:sfptxpower_max:MAX:%6.2lf%s\g
COMMENT:'\n' LINE1.25:sfprxpower#B80000:'rx '
GPRINT:sfprxpower:LAST:%6.2lf%s\g GPRINT:sfprxpower:AVERAGE:%6.2lf%s\g
GPRINT:sfprxpower_max:MAX:%6.2lf%s\g COMMENT:'\n'
which draws a graph just how it is supposed to be. However, the graph that comes out of it is not very readable as both tx and rx values make up the border of the graph:
My question therefor is: Is it possible to add some sort of margin (like a percentage (%)?) to the X-axis so that both lines can be easily seen on the graph?

RRDTool graph has four different scaling modes you can select via options: autoscale (the default), alt-autoscale, specified-expandable, and specified-rigid.
Autoscale - this scales the graph to fit the data, using the default algorythm. You choose this using the --autoscale option (or by omitting the other scaling options). This will try to make the Y-axis range limited by common ranges -- in your case, probably 0 to -5. Sometimes it works well, sometimes it doesnt.
Alt-Autoscale - this is like autoscale, but clings closely to the actual data max and min. You choose this with --alt-autoscale and it is what you are currently using.
Specified, expandable - This lets you specify a max/min for the Y axis, but they are expanded out if the data are outside this range. You choose this by specifying --upper-limit and/or --lower-limit but NOT --rigid. In your case, if you give an upper limit of -2 and a lower limit of -4 it would look good, and the graph range would be expanded if your data go to -5.
Specified, rigid - This is like above, but the limits are fixed where you specify them. If the data go outside this range, then the line is not displayed. You specify this by using --rigid when giving an upper or lower bound.
Note that with the Specified types, you can specify only one end of the range so as to get a specified type at one end and continue to use an autoscale type for the other.
From this, I would suggest that you remove the --rigid and --alt-autoscale options, and instead specify --upper-limit -2 and --lower-limit -4 to display your data more neatly. If they leave this range then you will continue to get a graph as currently - whether this works or not depends on the nature of the data and how much they normally can vary.

MySQL: ignore results where difference is more than x between rows

I have a simple PHP/HTML page that runs MySQL queries to pull temperature data and display on a graph. Every once in a while there is some bad data read from my sensors (DHT11 Temp / RH sensors, read by Arduino), where there will be a spike that is too high or too low, so I know it's not a good data point. I have found this is easy to deal with if it is "way" out of range, as in not a sane temperature, I just use a BETWEEN statement to filter out any records that are not possibly true.
I do realize that ultimately this should be fixed at the source so these bad readings never post in the first place, however as a debugging tool, I do actually want to record those errors in my DB, so I can track down the points in time when my hardware was erroring.
However, this does not help with the occasional spikes that actually fall within the range of sane temperatures. For example if it is 65 F outside, and the sensor occasionally throws an odd reading and I get a 107 F reading, it totally screws up my graphs, scaling, etc. I cant filter that with a BETWEEN (that I know of), because 107 F is actually a practical summer time temp in my region.
Is there a way to filter out values based on their neighboring rows? Can I do something like, if I am reading five rows for the sake of simplicity, and their result is: 77,77,76,102,77 ... that I can say "anything that is more than (x) difference between sequential rows, ignore it because it's bad data" ?
[/longWinded]

It is hard to answer without your schema so I did a SQLFiddle to reproduce your problem.
You need to average the temperature between a time frame and then compare this value with the current row. If the difference is too big, then we don't select this row. In my Fiddle this is done by :
abs(temp - (SELECT AVG(temp) FROM temperature AS t
WHERE
t.timeRead BETWEEN
DATE_ADD(temperature.timeRead, interval-3 HOUR)
AND
DATE_ADD(temperature.timeRead, interval+3 HOUR))) < 8
This condition is calculating the average of the temprature of the last 3 hours and the next 3 hours. If the difference is more than 8 degrees then we skip this row.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.