A customer of mine wants to get better insight of the dBm values that an optical SFP sends and receives. Every 5 minute I poll these values and update the values in an RRD file. The RRD graph I create with the RRD file as its source is created in the following way:
/usr/bin/rrdtool graph /var/www/customer/tmp/ZtIKQOJZFf.png --alt-autoscale
--rigid --start now-3600 --end now --width 800 --height 350
-c BACK#EEEEEE00 -c SHADEA#EEEEEE00 -c SHADEB#EEEEEE00 -c FONT#000000
-c GRID#a5a5a5 -c MGRID#FF9999 -c FRAME#5e5e5e -c ARROW#5e5e5e -R normal
--font LEGEND:8:'DejaVuSansMono' --font AXIS:7:'DejaVuSansMono' --font-render-mode normal
-E COMMENT:'Bits/s Last Avg Max \n'
DEF:sfptxpower=/var/www/customer/rrd/sfpdbm.rrd:SFPTXPOWER:AVERAGE
DEF:sfprxpower=/var/www/customer/rrd/sfpdbm.rrd:SFPRXPOWER:AVERAGE
DEF:sfptxpower_max=/var/www/customer/rrd/sfpdbm.rrd:SFPTXPOWER:MAX
DEF:sfprxpower_max=/var/www/customer/rrd/sfpdbm.rrd:SFPRXPOWER:MAX
LINE1.25:sfptxpower#000099:'tx ' GPRINT:sfptxpower:LAST:%6.2lf%s\g
GPRINT:sfptxpower:AVERAGE:%6.2lf%s\g GPRINT:sfptxpower_max:MAX:%6.2lf%s\g
COMMENT:'\n' LINE1.25:sfprxpower#B80000:'rx '
GPRINT:sfprxpower:LAST:%6.2lf%s\g GPRINT:sfprxpower:AVERAGE:%6.2lf%s\g
GPRINT:sfprxpower_max:MAX:%6.2lf%s\g COMMENT:'\n'
which draws a graph just how it is supposed to be. However, the graph that comes out of it is not very readable as both tx and rx values make up the border of the graph:
My question therefor is: Is it possible to add some sort of margin (like a percentage (%)?) to the X-axis so that both lines can be easily seen on the graph?
RRDTool graph has four different scaling modes you can select via options: autoscale (the default), alt-autoscale, specified-expandable, and specified-rigid.
Autoscale - this scales the graph to fit the data, using the default algorythm. You choose this using the --autoscale option (or by omitting the other scaling options). This will try to make the Y-axis range limited by common ranges -- in your case, probably 0 to -5. Sometimes it works well, sometimes it doesnt.
Alt-Autoscale - this is like autoscale, but clings closely to the actual data max and min. You choose this with --alt-autoscale and it is what you are currently using.
Specified, expandable - This lets you specify a max/min for the Y axis, but they are expanded out if the data are outside this range. You choose this by specifying --upper-limit and/or --lower-limit but NOT --rigid. In your case, if you give an upper limit of -2 and a lower limit of -4 it would look good, and the graph range would be expanded if your data go to -5.
Specified, rigid - This is like above, but the limits are fixed where you specify them. If the data go outside this range, then the line is not displayed. You specify this by using --rigid when giving an upper or lower bound.
Note that with the Specified types, you can specify only one end of the range so as to get a specified type at one end and continue to use an autoscale type for the other.
From this, I would suggest that you remove the --rigid and --alt-autoscale options, and instead specify --upper-limit -2 and --lower-limit -4 to display your data more neatly. If they leave this range then you will continue to get a graph as currently - whether this works or not depends on the nature of the data and how much they normally can vary.
Related
You have a large collection of very short (<100 characters) to medium-long (~5k characters) text items in a Mysql table. You want a solid and fast algorithm to obtain an aggregated word occurrency count in all the items.
Items selected can be as few as one (or none) and as many as 1M+, so the size of the text that has to be analyzed can vary wildly.
The current solution involves:
reading text from the selected records into a text variable
preg_replacing out everything that is not a "word" getting a somewhat cleaned up, shorter text (hashtags, #mentions, http(s):// links, numeric sequences such as phone numbers, etc. are not "words" and are parsed out)
exploding what's left into a "buffer" array
taking out everything that's shorter than two words and dumping everything else in a "master" array
the resulting array is then transformed with array_count_values, sorted (arsort), spliced a first time to have a more manageable array size, then parsed against stopwords lists in several languages, processed and spliced some more and finally output in JSON form as the list of the 50 most frequent words.
By tracing timing on the various steps of the operation sequence, the apparent bottleneck is in the query at first but as the item count increases, it rapidly moves to the array_count_values function (everything after that is more or less immediate).
On a ~ 10k items test run the total time for execution is ~3s from beginning to end; with 20k items it takes ~7s.
A (rather extreme but not impossible) case with 1.3M items takes 1m for the mysql query, and is able to parse roughly ~75k items per minute (so 17 minutes-ish is the estimate).
The result should be visualized as a result of an AJAX call, so with this kind of timing the UX is evidently disrupted. I'm looking for ways to optimize everything as much as possible. A 30s load time may be acceptable (however little realistic), 10 minutes (or more) is not.
I've tried batch processing by array_count_values-ing the chunks and then adding the resulting count arrays to a master array by key, but that helps only so much - the sum of the parts is equal (or slightly larger) than the total in terms of timing.
I only need the 50 most frequent occurrences at the top of the list, so there's possibly some room for improvement by cutting a few corners.
dump the column(s) into a text file. Suggest SELECT ... INTO OUTFILE 'x.txt'...
(if on linux etc): tr -s '[:blank:]' '\n' <x.txt | sort | uniq -c
To get the top 50:
tr -s '[:blank:]' '\n' <x.txt | sort | uniq -c | sort -nbr | head -50
If you need to tweak the definition of "word", such as dealing with punctuation, see the documentation on tr. You may have trouble with contractions versus phrases in single-quotes.
Some of the cleansing of the text can (and should) be done in SQL:
WHERE col NOT LIKE 'http%'
Some can (and should) be done in the shell script.
For example, this will get rid of #variables and phone numbers and 0-, 1-, and 2-character "words":
egrep -v '^#|^([-0-9]+$|$|.$|..)$'
That can be in the pipe stream just before the first sort.
The only limitation is disk space. Each step in the script is quite fast.
Test case:
The table: 176K rows of mostly text (including some linefeeds)
$ wc x3.txt
3,428,398 lines 31,925,449 'words' 225,339,960 bytes
$ time tr -s '[:blank:]' '\n' <x.txt | egrep -v '^#|^([-0-9]+$|$|.$|..)$' |
sort | uniq -c | sort -nbr | head -50
658569 the
306135 and
194778 live
175529 rel="nofollow"
161684 you
156377 for
126378 that
121560 this
119729 with
...
real 2m16.926s
user 2m23.888s
sys 0m1.380s
Fast enough?
I was watching it via top. It seems that the slowest part is the first sort. The SELECT took under 2 seconds (however, the table was probably cached in RAM).
We have MRTG set up to monitor the network .So for that we are using RRD tool to fetch an plotting the graph data. Now i have created a script which actually fetch data from RRD files , so from fetched data i need max in and and max out in 24 Hours. Now with these max values , i calculate the badwidth utilization for each customer/link.
Now my question is there, single rrd command to fetch max in , max out, min in and min out values from RRD files.
Since i am newbee to this RRD so i would appreciate if command is also provided with your solution.
Please help.
With an MRTG-created RRD files, the 'in' and 'out' datasources are named 'ds0' and 'ds1' respectively. There exist 8 RRAs; these correspond to granularities of 5min, 30min, 2hr and 1day with both AVG and MAX rollups. By default, these will be of length 400 (older versions of MRTG) or length 800 (newer versions of MRTG) which means that you are likely to have a time window of 2 days, 2 weeks, 2 months and 2 years respectively for these RRAs. (Note that RRDTool 1.5 may omit the 1pdp MAX RRA as this is functionally identical the the 1pdp AVG RRA)
What this means for you is the following:
You do not have a MIN type RRA. If working over the most recent 2 days, then this can be calculated from the highest-granularity AVG RRA. Otherwise, your data will be increasingly inaccurate.
Your lowest-granularity RRA holds MAX values on a per-day basis. However these days are split at midnight UCT rather than midnight local time. You do not specify which 24hr windows you need to calculate for.
IF you are only interested in claculating for the most recent 24h period, then all calculations can use the highest-granularity RRA.
Note that, because step boundaries are all calculated using UCT, unless you live in that timezone you can't use FETCH or XPORT to obtain the data you need as you need to summarise over a general time window.
To retrieve the data you can use something like this:
rrdtool graph /dev/null -e 00:00 -s "end-1day" --step 300
DEF:inrmax=target.rrd:ds0:AVERAGE:step=300:reduce=MAXIMUM
DEF:outrmax=target.rrd:ds1:AVERAGE:step=300:reduce=MAXIMUM
DEF:inrmin=target.rrd:ds0:AVERAGE:step=300:reduce=MINIMUM
DEF:outrmin=target.rrd:ds1:AVERAGE:step=300:reduce=MINIMUM
VDEF:inmax=inrmax,MAXIMUM
VDEF:inmin=inrmin,MINIMUM
VDEF:outmax=outrmax,MAXIMUM
VDEF:outmin=outrmin,MINIMUM
LINE:inrmax
PRINT:inmax:"In Max=%lf"
PRINT:inmin:"In Min=%lf"
PRINT:outmax:"Out Max=%lf"
PRINT:outmin:"Out Min=%lf"
A few notes on this:
We are using 'graph' so that we can use a generic time window, not dependent on a step boundary
Use rrdgraph in order to use a generic time window; fetch and xport will work on step boundaries.
We are summarising the highest-granularity RRA on the fly
We use /dev/null as we dont actually want the graph image
We have to define a dummy line in the graph else we get nothing
The DEF lines specify the highest-granularity step and a reduction CF. You might be able to skip this part if you're using 5min step
We calculate the summary values using VDEF and then print them on stdout using PRINT
The first line of the output will be the graph size; you can discard this
When you call rrdtool::graph from your php script, simply pass it the parameters in the same way as you would for commandline operation. If you're not using Linux you might need to use something other than /dev/null.
I have lots of sensor data from which I need to be able to detect changes reliably. Basically it comes from water level sensor in remote client. It's using accelerometer & float to get the water level. My problem is that the data can be noisy sometimes (it varies by 2-5 units per measurement) and sometimes I need to detect changes as low as 7-9 units.
When I'm graphing the data it's quite obvious for human eye that there's a change but how would I go at it programming wise? Now I'm just trying to detect changes bigger than x programmatically but it's not too reliable. I've attached a sample graph and pointed the changes with arrows. The huge changes in the beginning are just testing, so it's not normal behaviour for data.
The data is in MYSQL database and the code is in PHP so if you could point me to right direction I'd highly appreciate it!
EDIT: Also there can be some spikes in the data which are not considered valid but rather a mistake in the data.
EDIT: Example data can be found from http://pastebin.com/x8C9AtAk
The algorithm would need to run every 30 mins or so and should be able to detect changes within the last 2-4 pings. Each ping is in 3-5min interval.
I made some awk that you, or someone else, might like to experiment with. I average the last 10 (m) samples excluding the current one, and also average the last 2 samples (n) and then calculate the difference between the two and output a message if the absolute difference exceeds a threshold.
#!/bin/bash
awk -F, '
# j will count number of samples
# we will average last m samples and last n samples
BEGIN {j=0;m=10;n=2}
{d[j]=$3;id[j++]=$1" "$2} # Store this point in array d[]
END { # Do this at end after reading all samples
for(i=m-1;i<j;i++){ # Iterate over all samples, except first few while building average
totlastm=0 # Calculate average over last m not incl current
for(k=m;k>0;k--)totlastm+=d[i-k]
avelastm=totlastm/m # Average = total/m
totlastn=0 # Calculate average over last n
for(k=n-1;k>=0;k--)totlastn+=d[i-k]
avelastn=totlastn/n # Average = total/n
dif=avelastm-avelastn # Calculate difference between ave last m and ave last n
if(dif<0)dif=-dif # Make absolute
mesg="";
if(dif>4)mesg="<-Change detected"; # Make message if change large
printf "%s: Sample[%d]=%d,ave(%d)=%.2f,ave(%d)=%.2f,dif=%.2f%s\n",id[i],i,d[i],m,avelastm,n,avelastn,dif,mesg;
}
}
' <(tr -d '"' < levels.txt)
The last bit <(tr...) just removes the double quotes before sending the file levels.txt to awk.
Here is an excerpt from the output:
18393344 2014-03-01 14:08:34: Sample[1319]=343,ave(10)=342.00,ave(2)=342.00,dif=0.00
18393576 2014-03-01 14:13:37: Sample[1320]=343,ave(10)=342.10,ave(2)=343.00,dif=0.90
18393808 2014-03-01 14:18:39: Sample[1321]=343,ave(10)=342.10,ave(2)=343.00,dif=0.90
18394036 2014-03-01 14:23:45: Sample[1322]=342,ave(10)=342.30,ave(2)=342.50,dif=0.20
18394266 2014-03-01 14:28:47: Sample[1323]=341,ave(10)=342.20,ave(2)=341.50,dif=0.70
18394683 2014-03-01 14:38:16: Sample[1324]=346,ave(10)=342.20,ave(2)=343.50,dif=1.30
18394923 2014-03-01 14:43:17: Sample[1325]=348,ave(10)=342.70,ave(2)=347.00,dif=4.30<-Change detected
18395167 2014-03-01 14:48:25: Sample[1326]=345,ave(10)=343.20,ave(2)=346.50,dif=3.30
18395409 2014-03-01 14:53:28: Sample[1327]=347,ave(10)=343.60,ave(2)=346.00,dif=2.40
18395645 2014-03-01 14:58:30: Sample[1328]=347,ave(10)=343.90,ave(2)=347.00,dif=3.10
The right way to go about problems of this kind is to build a model of the phenomenon of interest and also a model of the noise process, and then make inferences about the phenomenon given some data. These inferences are necessarily probabilistic. The general computation you need to carry out is P(H_k | data) = P(data | H_k) P(H_k) / (sum_k (P(data | H_k) P(H_k)) (a generalized form of Bayes rule) where the H_k are all the hypotheses of interest, such as "step of magnitude at time " or "noise of magnitude ". In this case there might be a large number of plausible hypotheses, covering all possible magnitudes and times. You might need to limit the range of hypotheses considered in order to make the problem tractable, e.g. only looking back a certain number of time steps.
I'm using the Grid component in fusion charts and need a date string to be used in the value place. It will always fail when i do this as it is looking for a number. Is there anyway it can allow text to be used aswell?
Thanks
The ability to have date-time axis values is not yet available in FusionCharts. Nevertheless, the use case for you does not fit right.
Ideally, the grid component's right column should show a number (value). The left column is text showing the labels. For a grid, it makes very less sense to have text on both columns.
Instead of a date, the right column should show how many month's or days or hours, etc.
Excerpts from FusionCharts documentation: http://docs.fusioncharts.com/charts/contents/advanced/number-format/Number_Scaling.html
Say we're plotting a chart which indicates the time taken by a list of automated processes. Each process in the list can take time ranging from a few seconds to few days. And we've the data for each process in seconds itself. Now, if we were to show all the data on the chart in seconds only, it won't appear too legible. What we can do is build a scale indicating time and then specify it to the chart. This scale, in human terms, would look something as under:
60 seconds = 1 minute
60 minute = 1 hr
24 hrs = 1 day
7 days = 1 week
Now, to convert this scale into FusionCharts XML format, you'll have to do it as under:
First you would need to define the unit of the data which you're providing. Like, in this example, you're providing all data in seconds. So, default number scale would be represented in seconds. We can represent it as <chart defaultNumberScale='s' ...>
Next, we define our own scale for the chart as: <chart numberScaleValue='60,60,24,7' numberScaleUnit='min,hr,day,wk' >. If you carefully see this and match it with our range, you'll find that whatever numeric figures are present on the left hand side of the range is put in numberScaleValue and whatever units are present on the right side of the scale has been put under numberScaleUnit - all separated by commas.
Set the chart formatting flags to on as: <chart formatNumber='1' formatNumberScale='1' ...>
The entire XML would look like:
<chart defaultNumberScale='s' numberScaleValue='60,60,24,7' numberScaleUnit='min,hr,day,wk'><set label='A' value='38' /><set label='B' value='150' /><set label='C' value='11050' /><set label='D' value='334345' /><set label='E' value='1334345' /></chart>
A sample grid (not with the above data) would look like this:
How could I retrieve a list of cities which are enroute(waypoints) between 2 gps coordinates?
I have a table of all cities, lat-lon.
So if I have a starting location (lat-lon) and ending location (lat-lon)...
It must be very easy to determine the path of cities (from table) to pass by(waypoints) to get from start(lat-lon) to en (lat-lon)?
I have looked different algorithms and bearing. Still not clear for me.
If you're using the between point A and B method, then you'd just query the cities with Latitude and Longitude between the first and the second, respectively.
If you want to get the cities that are within X miles of a straight line from A to B, then you'd calculate the starting point and slope, and then query cities which are within X miles of the line that creates
If you're not using a simple point A to point B method which ignores roads, then you'll need some kind of data on the actual roads between A and B for us to give you an answer. This can be done using a Node system in your db, and it can also be done by using various geolocation APIs that are out there.
the solution to this can be found by standard discrete routing algorithms
those algorithms need a set of nodes (start, destination, your cities) and edges between those nodes (representing the possible roads or more generally the distances between the locations.)
nodes and edges form a graph ... start point and destination are known ... now you can use algorithms like A* or djikstra to solve a route along this graph
a typical problem for this approach could be that you don't have definitions for the edges (the usable direct paths between locations). you could create such a "road network" in various ways, for example:
initialize "Network_ID" with 0
take your starting location, and find the closest other location. measure the distance and multiply it by a factor. now connect each location to the original location which has a distance less than this value and is not connected to the current location yet. add all locations that were connected by this step to a list. mark the current location with the current "Network_ID" repeat this step for the next location on that list. if your list runs out of locations, increment "Network_ID" and choose a random location that has not yet been processed and repeat the step
after all locations have been processed you have one or more road networks (if more than one, they are not connected yet. add a suitable connection edge between them, or restart the process with a greater factor)
you have to make sure, that either start and destination have the same network_ID or that both networks have been connected
Hmm... I have used BETWEEN min AND max for something like this, but not quite the same.
try maybe:
SELECT * from `cities` WHERE `lat` BETWEEN 'minlat' AND 'maxlat' AND `lon` BETWEEN 'minlon' and 'maxlon';
something like that may work
look at mysql comparisons here:
http://dev.mysql.com/doc/refman/5.0/en/comparison-operators.html
I know this is a late answer, but if you are still working on this problem you should read this:-
http://dev.mysql.com/doc/refman/5.6/en/spatial-extensions.html