RRD Tool Structure and understanding

RRD Tool Structure and understanding - php

I am currently monitoring 5 different buildings, for each building there are around 300 rooms. Each room has 4 sensors, three which monitor temperature at different points of the room and one for the amount of power (killowatts) the room is consuming.
I am currently polling every sensor every 15 minutes which is producing 576,000 entries per day, the amount of buildings I am monitoring is soon going to increase.
I am currently storing all the information in MySQL, I have a MySQL table for each sensor type so the tables are named 'power', 'temp1', 'temp2', 'temp3'. The columns within these tables are 'id', 'building_id', 'epoch', 'value'.
I then use this data to produce graphs using the Chart.js library and statistics such as the amount of power used per building within a certain time period etc, I do all this using PHP.
I don't believe my MySQL database is going to be able to handle this without serious scale and clustering
I need to be able to view historic data for 5 years although some of the granularity can be lost after a certain period of time.
I have been informed that RRD might be able to solve my problem and have done some research on it but still have some questions.
Will it still allow me to create my own graphs using specifically the Chart.js library? If I can get time / value JSON data from the RRD this should be ok.
How many different RRD files will I need to create also? Would I need one per building? Per room? Per sensor? Is it still going to be easy to manage.
I have PHP scripts which run at 15 minute intervals that pull the data from the sensors using SNMP and then insert the data into MySQL if I can use the same scripts to also insert into the RRD that would also be great, from what I have seen you can use PHP to insert into RRD so that should be ok.
EDIT: I am currently reading http://michael.bouvy.net/blog/en/2013/04/28/graph-data-rrdtool-sensors-arduino/ which has started to answer some of my questions.

Whether you have one RRD file with 6000 metrics, or 5 files with 1200 metrics etc depends on how you are managing the data.
Firstly, you should not group together metrics for which the samples arrive at different points in time. So, if you sample one room at a time, you should probably have one RRD file per room (with 4 metrics in it). This will depend on what manages your sensors; if you have one device per room or per building. Retrieving the data and graphing it works whether you have one file or a thousand (though the 'thousand' scenario work much better in the latest version of RRDTool).
Secondly, are you likely to add new data points (IE, buildings or rooms)? You cannot (easily) add new metrics to an existing RRD file. So, if you expect to add a new building in the future, or to add or remove a room, then maybe one RRD per building or one per room would be better.
Without having any more information, I would guess that you'd be better off with one RRD per room (containing 4 metrics) and update them separately. Name the files according to the building and room IDs, and they can hold the power and 3 temperature values according to Epoch.
For graphing, RRDTool is of course capable of creating its own graphs by accessing the data directly. However, if you want to extract the data and put them into a graph yourself this is possible; the Xport function will allow you to pull out the necessary datapoints (possibly from multiple RRD files and with aggregation) which you can then pass to the graphing library of your choice. There is also the Fetch function if you want the raw data.
If your data samples are coming at 15min intervals, make sure you set your RRD Interval, Heartbeat and RRAs up correctly. In particular, the RRAs will specify what aggregation is performed and for how long data are kept at higher granularities. The RRAs should in general correspond to the resolutions you expect to graph the data at (which is why people generally use 5min/30min/2h/1d as these correspond nicely to daily, weekly, monthly and yearly graphs at a 400px width)

You might want to take a look at time-series databases and test a few systems that have built-in visualization, an API that allows you to perform aggregations and PHP wrappers. Time-series databases are optimized for efficient storage of timestamped data and have built-in functionality for time-series transformations.
https://en.wikipedia.org/wiki/Time_series_database

Related

MySQL DB: Efficient structure for data logging / retreiving by many users

I'm building a web application using php and MySQL. The application is mainly to monitor temperature using embedded devices distributed in many areas.
Each device posts a reading every 2 hours to the DB. I currently have a table with the following columns:
Device ID
Time and Date (for the reading)
Temperature Reading.
my concern is that
with time the DB will become very huge, also
if i queried the table for a certain device (by device ID), and certain time span (for example last week) this can take long time.
is there a better structure that i can use, maybe using a different table for each device, or different tables for each day (or week).
i'm looking for
easy backup,
fast responses (less complex query statements)

How to update large number of records via XML files?

I'm the developer for a real estate syndication website and am currently having trouble figuring out a way to update massive numbers of listings/records efficiently (2,000,000+ listings).
We currently accept XML feeds, containing real estate listings, from about ~20 different websites. Most of the incoming feeds are small (~100 or so listings), but we have a couple of XML feeds that contain ~1,000,000 listings. The small feeds are parsed fast and easy, however, the large feeds are taking upwards of 2-3 hours each.
The current "live" database table that contains the listings for viewing on the site is MyISAM. I chose MyISAM because ~95% of the queries to the table are SELECTs. Really the only time there are writes (UPDATE/INSERT queries) are during the time the XML feeds are being processed.
The current process is as follows:
There is a CRON in place that starts the main parsing script.
It loops through a feeds table and grabs the external XML feed source files. It then runs through said file and for each record in the XML file it checks against the listings table to see if a listing needs to be updated or inserted (if it's a new listing).
This is all happening against the live table. What I'd like to find out is if anybody has any better logic to make these updates/inserts happen in the background so as to not slow down the production tables, and ultimately, the user experience.
Would a delta table be the best choice? Maybe do all the heavy work on a separate database and just copy the new table over to the production database? On a separate workhorse domain altogether? Should I have a separate listings table that does all the parsing which would be InnoDB instead of MyISAM?
What we're trying to accomplish is to have our system be able to update listings frequently throughout the day without slowing the site down. Our competitors boast that they are updating their listings every 5 minutes in some cases. I just don't see how that's even possible.
I'm working right now so this is more of a brain dump just to get the ball rolling. If anybody would like me to provide table schematics, I'd be more than happy.
In summary: I'm looking for a way to frequently update millions of records in our database (daily) via a couple dozen external XML feeds/files. I just need some logic on how to effectively, and efficiently, make this happen so as to not drag the production server down with it.

Firstly, for your existing 3 hour import, try wrapping every 100 inserts in a transaction. They will be written to the database in one go, and that might speed things up dramatically. Play around with the 100 value - the best value will depend on how resilient you want it, and how much memory your transaction cache has. (This will of course require you to switch to a different engine).
For providers that are known to offer larger files, try keeping a copy of the previous XML download, and then do a text diff between the old one and the new one. If you set your context settings (i.e. the number of unchanged lines around changed lines) sufficiently you might be able to capture the primary keys of changed items. You would then just do a small number of updates.
Of course, it would help if your providers maintain the order of their XML listing. If they don't, a text sort then a diff may still be faster than importing everything.
FWIW, I think a complete refresh every 5 minutes is probably not feasible. I expect your providers would not be happy with you downloading 1M records at this frequency!

Best practice for custom statistics

I'm sitting in a situation where i have to build a statistics module which can store user related statistical informations.
Basically, all thats stored is a event identifier, a datetime object and the amount of times this event has been fired and the id of the object which is being interacted with.
Ive made similar systems before, but never anything that has to store the amount of informations as this one.
My suggestion would be a simple tabel in the database.
etc. "statistics" containing the following rows
id (Primary, auto-increment)
amount (integer)
event (enum -(list,click,view,contact)
datetime (datetime)
object_id (integer)
Usually, this method works fine, enabling me to store statistics about the object in a given timeframe ( inserting a new datetime every hour or 15 minutes, so the statistics will update every 15 minute )
Now, my questions are:
is theres better methods or more optimized methods of achieving
and building a custom statistics module.
since this new site will receive massive traffic, how do i go about the paradox that index on object id will cause slower update response time
How do you even achieve live statistics like etc. analytics? Is this solely about the server size and processing power? Or is there a best practice.
I hope my questions are understandable, and i'm looking forward to get wiser on this topic.
best regards.
Jonas

I believe one of the issues you are going to run into is you wanting two worlds of transactional and analytical. Which is fine in small cases, but when you start to scale, especially into realm of 500M+ records.
I would suggest separating the two, you generate events and keep track of just the event itself. You would then run analytical queries to get things such as count of events per object interaction. You could have these counts or other metric calculations aggregated into a report table periodically.
As for tracking events, you could either do that with keeping them in a table of occurrences of events, or have something before the database that is doing this tracking and it is then providing the periodic aggregations to the database. Think of the world of monitoring systems which use collect agents to generate events which go to an aggregation layer which then writes a periodic metric snapshot to an analytical area (e.g. CollectD to StatsD / Graphite to Whisper)
Disclaimer, I am an architect for InfiniDB
Not sure what kind of datasource you are using, but as you grow and determine amount of history etc... you will probably face sizing problems as most people typically do when they are collecting event data or monitoring data. If you are in MySQL / MariaDB / PostegreSQL , I would suggest you check out InfiniDB (open source columnar MPP database for analytics); It is fully open source (GPLv2) and will provide the performance you need to do queries upon billions and TBs of data for answering those analytical questions.

Insert a row every given time else update previous row (Postgresql, PHP)

I have a multiple devices (eleven to be specific) which sends information every second. This information in recieved in a apache server, parsed by a PHP script, stored in the database and finally displayed in a gui.
What I am doing right now is check if a row for teh current day exists, if it doesn't then create a new one, otherwise update it.
The reason I do it like that is because I need to poll the information from the database and display it in a c++ application to make it look sort of real-time; If I was to create a row every time a device would send information, processing and reading the data would take a significant ammount of time as well as system resources (Memory, CPU, etc..) making the displaying of data not quite real-time.
I wrote a report generation tool which takes the information for every day (from 00:00:00 to 23:59:59) and put it in an excel spreadsheet.
My questions are basically:
Is it posible to do the insertion/updating part directly in the database server or do I have to do the logic in the php script?
Is there a better (more efficient) way to store the information without a decrease in performance in the display device?
Regarding the report generation, if I want to sample intervals lets say starting from yesterday at 15:50:00 and ending today at 12:45:00 it cannot be done with my current data structure, so what do I need to consider in order to make a data structure which would allow me to create such queries.
The components I use:
- Apache 2.4.4
- PostgreSQL 9.2.3-2
- PHP 5.4.13

My recommendations - just store all the information, your devices are sending. With proper indexes and queries you can process and retrieve information from DB really fast.
For your questions:
Yes it is possible to build any logic you desire inside Postgres DB using SQL, PL/pgSQL, PL/PHP, PL/Java, PL/Py and many other languages built into Postgres.
As I said before - proper indexing can do magic.
If you cannot get desired query speed with full table - you can create a small table with 1 row for every device. And keep in this table last known values to show them in sort of real-time.

1) The technique is called upsert. In PG 9.1+ it can be done with wCTE (http://www.depesz.com/2011/03/16/waiting-for-9-1-writable-cte/)
2) If you really want it to be real-time you should be sending the data directly to the aplication, storing it in memory or plaintext file also will be faster if you only care about the last few values. But PG does have Listen/notify channels so probabably your lag will be just 100-200 mili and that shouldn't be much taken you're only displaying it.

I think you are overestimating the memory system requirements given the process you have described. Adding a row of data every second (or 11 per second) is not a hog of resources. In fact it is likely more time consuming to UPDATE vs ADD a new row. Also, if you add a TIMESTAMP to your table, sort operations are lightning fast. Just add some garbage collection handling as a CRON job (deletion of old data) once a day or so and you are golden.
However to answer your questions:
Is it posible to do the insertion/updating part directly in the database server or do I >have to do the logic in the php script?
Writing logic from with the Database engine is usually not very straight forward. To keep it simple stick with the logic in the php script. UPDATE (or) INSERT INTO table SET var1='assignment1', var2='assignment2' (WHERE id = 'checkedID')
Is there a better (more efficient) way to store the information without a decrease in >performance in the display device?
It's hard to answer because you haven't described the display device connectivity. There are more efficient ways to do the process however none that have locking mechanisms required for such frequent updating.
Regarding the report generation, if I want to sample intervals lets say starting from >yesterday at 15:50:00 and ending today at 12:45:00 it cannot be done with my current data >structure, so what do I need to consider in order to make a data structure which would >allow me to create such queries.
You could use the a TIMESTAMP variable type. This would include DATE and TIME of the UPDATE operation. Then it's just a simple WHERE clause using DATE functions within the database query.

Statistical analysis on large data set to be published on the web

I have a non-computer related data logger, that collects data from the field. This data is stored as text files, and I manually lump the files together and organize them. The current format is through a csv file per year per logger. Each file is around 4,000,000 lines x 7 loggers x 5 years = a lot of data. some of the data is organized as bins item_type, item_class, item_dimension_class, and other data is more unique, such as item_weight, item_color, date_collected, and so on ...
Currently, I do statistical analysis on the data using a python/numpy/matplotlib program I wrote. It works fine, but the problem is, I'm the only one who can use it, since it and the data live on my computer.
I'd like to publish the data on the web using a postgres db; however, I need to find or implement a statistical tool that'll take a large postgres table, and return statistical results within an adequate time frame. I'm not familiar with python for the web; however, I'm proficient with PHP on the web side, and python on the offline side.
users should be allowed to create their own histograms, data analysis. For example, a user can search for all items that are blue shipped between week x and week y, while another user can search for sort the weight distribution of all items by hour for all year long.
I was thinking of creating and indexing my own statistical tools, or automate the process somehow to emulate most queries. This seemed inefficient.
I'm looking forward to hearing your ideas
Thanks

I think you can utilize your current combination(python/numpy/matplotlib) fully if the number of users are not too big. I do some similar works, and my data size a little more than 10g. Data are stored in a few sqlite files, and i use numpy to analyze data, PIL/matplotlib to generate chart files(png, gif), cherrypy as a webserver, mako as a template language.
If you need more server/client database, then you can migrate to postgresql, but you can still fully use your current programs if you go with a python web framework, like cherrypy.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.