MySQL DB: Efficient structure for data logging / retreiving by many users

MySQL DB: Efficient structure for data logging / retreiving by many users - php

I'm building a web application using php and MySQL. The application is mainly to monitor temperature using embedded devices distributed in many areas.
Each device posts a reading every 2 hours to the DB. I currently have a table with the following columns:
Device ID
Time and Date (for the reading)
Temperature Reading.
my concern is that
with time the DB will become very huge, also
if i queried the table for a certain device (by device ID), and certain time span (for example last week) this can take long time.
is there a better structure that i can use, maybe using a different table for each device, or different tables for each day (or week).
i'm looking for
easy backup,
fast responses (less complex query statements)

Related

Best way for storing millions of records a day of data that can be grouped for statistic purposes?

I'm developing a custom tracking tool for marketing campaigns. This tool is in the middle between the ads and the landing pages. It takes care of saving all data from the user, such as the info in the user-agent, the IP, the clicks on the landing page and the geocoding data of the IPs of the users (country, ISP, etc).
At the moment I have some design issues:
The traffic on these campaigns is very very high, so potentially I have millions of rows insert a day. This system can have more than one user, so I can't store all this data on a single table because would become a mess. Maybe I can split the data in more tables, one table per user, but I'm not sure about this solution.
The data saving process must be done as quickly as possible (some milliseconds), so I think that NodeJS is much better than PHP for doing this. Especially with regard to speed and server resources. I do not want the server to crash from lack of RAM.
I need to group these data for statistic purposes. For example, I have one row for every user that visit my landing page, but I need to group these data for showing the number of impressions on this specific landing page. So all these queries need to be executed as faster as possible with this large amount of rows.
I need to geocode the IP addresses, so i need accurate information like the Country, the ISP, the type of connection etc, but this can slow down the data saving process if I call an API service. And this must be done in real-time and can't be done later.
After the saving process, the system should do a redirect to the landing page. Time is important for not losing any possible lead.
Basically, I'm finding the best solutions for:
Efficiently manage a very large database
Saving data from the users in the shortest time possible (ms)
If possible, make geocode an ip in the shortest time possible, without blocking execution
Optimize the schema and the queries for generating statistics
Do you have any suggestion? Thanks in advance.

One table per user is a worse mess; don't do that.
Millions of rows a day -- dozens, maybe hundreds, per second? That probably requires some form of 'staging' -- collecting multiple rows, then batch-inserting them. Before discussing further, please elaborate on the data flow: Single vs. multiple clients. UI vs. batch processes. Tentative CREATE TABLE. Etc.
Statistical -- Plan on creating and incrementally maintaining "Summary tables".
Are you trying to map user IP addresses to Country? That is a separate question, and it has been answered.
"Must" "real-time" "milliseconds". Face it, you will have to make some trade-offs.
More info: Go to http://mysql.rjweb.org/ ; from there, see the three blogs on Data Warehouse Techniques.
How to store by day
InnoDB stores data in PRIMARY KEY order. So, to get all the rows for one day adjacent to each other, one must start the PK with the datetime. For huge databases, may improve certain queries significantly by allowing the query to scan the data sequentially, thereby minimizing disk I/O.
If you already have id AUTO_INCREMENT (and if you continue to need it), then do this:
PRIMARY KEY(datetime, id), -- to get clustering, and be UNIQUE
INDEX(id) -- to keep AUTO_INCREMENT happy
If you have a year's worth of data, and the data won't fit in RAM, then this technique is very effective for small time ranges. But if your time range is bigger than the cache, you will be at the mercy of I/O speed.
Maintaining summary tables with changing data
This may be possible; I need to better understand the data and the changes.
You cannot scan a million rows in sub-second time, regardless of caching, tuning, and other optimizations. You can do the desired data with a Summary table much faster.
Shrink the data
Don't use BIGINT (8 bytes) if INT (4 bytes) will suffice; don't use INT if MEDIUMINT (3 bytes) will do. Etc.
Use UNSIGNED where appropriate.
Normalize repeated strings.
Smaller data will make it more cacheable, hence run faster when you do have to hit the disk.

RRD Tool Structure and understanding

I am currently monitoring 5 different buildings, for each building there are around 300 rooms. Each room has 4 sensors, three which monitor temperature at different points of the room and one for the amount of power (killowatts) the room is consuming.
I am currently polling every sensor every 15 minutes which is producing 576,000 entries per day, the amount of buildings I am monitoring is soon going to increase.
I am currently storing all the information in MySQL, I have a MySQL table for each sensor type so the tables are named 'power', 'temp1', 'temp2', 'temp3'. The columns within these tables are 'id', 'building_id', 'epoch', 'value'.
I then use this data to produce graphs using the Chart.js library and statistics such as the amount of power used per building within a certain time period etc, I do all this using PHP.
I don't believe my MySQL database is going to be able to handle this without serious scale and clustering
I need to be able to view historic data for 5 years although some of the granularity can be lost after a certain period of time.
I have been informed that RRD might be able to solve my problem and have done some research on it but still have some questions.
Will it still allow me to create my own graphs using specifically the Chart.js library? If I can get time / value JSON data from the RRD this should be ok.
How many different RRD files will I need to create also? Would I need one per building? Per room? Per sensor? Is it still going to be easy to manage.
I have PHP scripts which run at 15 minute intervals that pull the data from the sensors using SNMP and then insert the data into MySQL if I can use the same scripts to also insert into the RRD that would also be great, from what I have seen you can use PHP to insert into RRD so that should be ok.
EDIT: I am currently reading http://michael.bouvy.net/blog/en/2013/04/28/graph-data-rrdtool-sensors-arduino/ which has started to answer some of my questions.

Whether you have one RRD file with 6000 metrics, or 5 files with 1200 metrics etc depends on how you are managing the data.
Firstly, you should not group together metrics for which the samples arrive at different points in time. So, if you sample one room at a time, you should probably have one RRD file per room (with 4 metrics in it). This will depend on what manages your sensors; if you have one device per room or per building. Retrieving the data and graphing it works whether you have one file or a thousand (though the 'thousand' scenario work much better in the latest version of RRDTool).
Secondly, are you likely to add new data points (IE, buildings or rooms)? You cannot (easily) add new metrics to an existing RRD file. So, if you expect to add a new building in the future, or to add or remove a room, then maybe one RRD per building or one per room would be better.
Without having any more information, I would guess that you'd be better off with one RRD per room (containing 4 metrics) and update them separately. Name the files according to the building and room IDs, and they can hold the power and 3 temperature values according to Epoch.
For graphing, RRDTool is of course capable of creating its own graphs by accessing the data directly. However, if you want to extract the data and put them into a graph yourself this is possible; the Xport function will allow you to pull out the necessary datapoints (possibly from multiple RRD files and with aggregation) which you can then pass to the graphing library of your choice. There is also the Fetch function if you want the raw data.
If your data samples are coming at 15min intervals, make sure you set your RRD Interval, Heartbeat and RRAs up correctly. In particular, the RRAs will specify what aggregation is performed and for how long data are kept at higher granularities. The RRAs should in general correspond to the resolutions you expect to graph the data at (which is why people generally use 5min/30min/2h/1d as these correspond nicely to daily, weekly, monthly and yearly graphs at a 400px width)

You might want to take a look at time-series databases and test a few systems that have built-in visualization, an API that allows you to perform aggregations and PHP wrappers. Time-series databases are optimized for efficient storage of timestamped data and have built-in functionality for time-series transformations.
https://en.wikipedia.org/wiki/Time_series_database

Best practice for custom statistics

I'm sitting in a situation where i have to build a statistics module which can store user related statistical informations.
Basically, all thats stored is a event identifier, a datetime object and the amount of times this event has been fired and the id of the object which is being interacted with.
Ive made similar systems before, but never anything that has to store the amount of informations as this one.
My suggestion would be a simple tabel in the database.
etc. "statistics" containing the following rows
id (Primary, auto-increment)
amount (integer)
event (enum -(list,click,view,contact)
datetime (datetime)
object_id (integer)
Usually, this method works fine, enabling me to store statistics about the object in a given timeframe ( inserting a new datetime every hour or 15 minutes, so the statistics will update every 15 minute )
Now, my questions are:
is theres better methods or more optimized methods of achieving
and building a custom statistics module.
since this new site will receive massive traffic, how do i go about the paradox that index on object id will cause slower update response time
How do you even achieve live statistics like etc. analytics? Is this solely about the server size and processing power? Or is there a best practice.
I hope my questions are understandable, and i'm looking forward to get wiser on this topic.
best regards.
Jonas

I believe one of the issues you are going to run into is you wanting two worlds of transactional and analytical. Which is fine in small cases, but when you start to scale, especially into realm of 500M+ records.
I would suggest separating the two, you generate events and keep track of just the event itself. You would then run analytical queries to get things such as count of events per object interaction. You could have these counts or other metric calculations aggregated into a report table periodically.
As for tracking events, you could either do that with keeping them in a table of occurrences of events, or have something before the database that is doing this tracking and it is then providing the periodic aggregations to the database. Think of the world of monitoring systems which use collect agents to generate events which go to an aggregation layer which then writes a periodic metric snapshot to an analytical area (e.g. CollectD to StatsD / Graphite to Whisper)
Disclaimer, I am an architect for InfiniDB
Not sure what kind of datasource you are using, but as you grow and determine amount of history etc... you will probably face sizing problems as most people typically do when they are collecting event data or monitoring data. If you are in MySQL / MariaDB / PostegreSQL , I would suggest you check out InfiniDB (open source columnar MPP database for analytics); It is fully open source (GPLv2) and will provide the performance you need to do queries upon billions and TBs of data for answering those analytical questions.

How to build a proper Database for a traffic analytics system?

How to build a proper structure for an analytics service? Currently i have 1 table that stores data about every user that visits the page with my client's ID so later my clients will be able to see the statistics for a specific date.
I've thought a bit today and I'm wondering: Let's say i have 1,000 users and everyone has around 1,000 impressions on their sites daily, means i get 1,000,000 (1M) new records every day to a single table. How will it work after 2 months or so (when the table reaches 60 Million records)?
I just think that after some time it will have so much records that the PHP queries to pull out the data will be really heavy, slow and take a lot of resources, is it true? and how to prevent that?
A friend of mine working on something similar and he is gonna make a new table for every client, is this the correct way to go with?
Thanks!

Problem you are facing is I/O bound system. 1 million records a day is roughly 12 write queries per second. That's achievable, but pulling the data out while writing at the same time will make your system to be bound at the HDD level.
What you need to do is configure your database to support the I/O volume you'll be doing, such as - use appropriate database engine (InnoDB and not MyISAM), make sure you have fast enough HDD subsystem (RAID, not regular drives since they can and will fail at some point), design your database optimally, inspect queries with EXPLAIN to see where you might have gone wrong with them, maybe even use a different storage engine - personally, I'd use TokuDB if I were you.
And also, I sincerely hope you'd be doing your querying, sorting, filtering on the database side and not on PHP side.

Consider this Link to the Google Analytics Platform Components Overview page and pay special attention to the way the data is written to the database, simply based on the architecture of the entire system.
Instead of writing everything to your database right away, you could write everything to a log file, then process the log later (perhaps at a time when the traffic isn't so high). At the end of the day, you'll still need to make all of those writes to your database, but if you batch them together and do them when that kind of load is more tolerable, your system will scale a lot better.

You could normalize impressions the data like this;
Client Table
{
ID
Name
}
Pages Table
{
ID
Page_Name
}
PagesClientsVisits Table
{
ID
Client_ID
Page_ID
Visits
}
and just increment visits on the final table on each new impression. Then the maximum number of records in there becomes (No. of clients * No. of pages)

Having a table with 60 million records can be ok. That is what a database is for. But you should be careful about how many fields you have in the table. Also what datatype (=>size) each field has.
You create some kind of reports on the data. Think about what data you really need for those reports. For example you might need only the numbers of visits per user on every page. A simple count would do the trick.
What you also can do is generate the report every night and delete the raw data afterwards.
So, read and think about it.

Approaches to gathering large visiting statistics

I have website, where users can post their articles and I would like to give full stats about each articles visits and referrers to it's author. Realization seems quite straight forward here, just store a database record for every visit and then use aggregate functions to draw graphs and so on.
The problem is, that articles receive about 300k views in 24 hours and just in a month, stats table will get about 9 million records which is a very big number, because my server isn't quite powerful.
Is there a solution to this kind of task? Is there an algorithm or caching mechanism that allows to store long term statistics without losing accuracy?
P.S. Here is my original stats table:
visitid INT
articleid INT
ip INT
datetime DATETIME

Assuming a home-brewed usage-tracking solution (as opposed to say GA as suggested in other response), a two databases setup may be what you are looking for:
a "realtime" database which captures the vist events as they come.
an "offline" database where the data from the "realtime" database is collected on a regular basis, for being [optionally] aggregated and indexed.
The purpose for this setup is mostly driven by operational concerns. The "realtime" database is not indexed (or minimally indexed), for fast insertion, and it is regularly emptied, typically each night, when the traffic is lighter, as the "offline" database picks up the events collected through the day.
Both databases can have the very same schema, or the "offline" database may introduce various forms of aggregation. The specific aggregation details applied to the offline database can vary greatly depending on the desire to keep the database's size in check and depending on the data which is deemed important (most statistics/aggregation functions introduce some information loss, and one needs to decide which losses are acceptable and which are not).
Because of the "half life" nature of the value of usage logs, whereby the relative value of details decays with time, a common strategy is to aggregate info in multiple tiers, whereby the data collected in the last, say, X days remains mostly untouched, the data collected between X and Y days is partially aggregated, and finally, data older than Y days only keep the most salient info (say, number of hits).

Unless you're particularly keen on storing your statistical data yourself, you might consider using Google Analytics or one of its modern counterparts, which are much better than the old remotely hosted hit counters of the 90s. You can find the API to the Google Analytics PHP interface at http://code.google.com/p/gapi-google-analytics-php-interface/

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.