MongoDB aggregation by time interval PHP - php

I'm using MongoDB to store server statistics that are captured every 15 seconds (so 4 rows get inserted each minute per server) and am trying to get this data plotted onto a graph for all data between a certain timestamp.
For example, the following query can be used:
$tbl->find(
array(
"timestamp" => array('$gte' => '1396310400', '$lte' => '1396915200'),
"service" => 'a715feac3db42f54edbc50ef6fa057b3'
),
array("timestamp" => 1, "system" => 1)
);
Which spits our a bunch of rows that look like this:
Array
(
[53933ad8532965621d97dd3b] => Array
(
[_id] => MongoId Object
(
[$id] => 53933ad8532965621d97dd3b
)
[system] => Array
(
[load] => 0.55
[uptime] => 1171204.47
[processes] => 222
)
[timestamp] => 1396310403
)
)
This works fine for small data ranges, as I can pass this data directly into Flot or HighCharts and let it prettify the time scales itself. However this doesn't work for large data sets (for example querying over a month).
What I'm trying to do is group the data by hour (or 15 minutes), and return the average values (in this example, its system.load that I'm plotting) for that given time period.
I know that the aggregate function is what I need be using, but despite my best efforts I've been unable to get this working.
Right now I'm letting PHP do all of the work (grouping the results by timestamp and working out the averages) but it's extremely slow and I know MongoDB would handle it better.
Any insight would be greatly appreciated!
Edit:
I've been trying to follow the answer posted here but am still struggling - MongoDB Aggregation PHP, Group by Hours

I'm looking at your initial query at the top of your question and it immediately tells me that your "timestamp" values are actually strings. So no doubt that when you are reading this information and doing your "manual aggregation" you are actually casting these values, and possibly others into types that you can manipulate, sum and average.
So the first part here is to fix your data, that looks like it has come from a logging source but you have never converted the values. I'm considering it reasonably possible that this is not just the timestamp values but probably also your metrics under system.
This leaves you with a choice of how to store your timestamp. You can either just keep that as a timestamp number as it currently is in string form, or you can opt for converting to a BSON date type. The first one will be a simple integer cast and save back, the other you should be able to feed to the Date type that is supported by the driver and again save back the data.
When you have done this, then you can happily use the aggregation functions. So as an example for if you choose to keep this as a number, then you just apply date math in order to get the grouping boundaries:
db.collection.aggregate([
// Match documents on the range you want
{ "$match": {
"timestamp": {
"$gte": 1396310400, "$lte": 1396915200
},
"service": "a715feac3db42f54edbc50ef6fa057b3"
}},
// Group on the time intervals, 15 minutes here
{ "$group": {
"_id": {
"service": "$service",
"time": {
"$subtract": [
"$timestamp",
{ "$mod": [ "$timestamp", 60 * 15 ] }
]
}
},
"load": { "$avg": "$system.load" }
}},
// Project to the output form you want
{ "$project": {
"service": "$_id.service",
"time" : "$_id.time",
"load": 1
}}
])
Or to be php specific
$tbl->aggregate(array(
array(
'$match' => array(
'timestamp' => array(
'$gte' => 1396310400, '$lte' => 1396915200
),
'service' => 'a715feac3db42f54edbc50ef6fa057b3'
)
),
array(
'$group' => array(
'_id' => array(
'service' => '$service',
'time' => array(
'$subtract' => array(
'$timestamp',
array( '$mod' => array('$timestamp', 60 * 15 ) )
)
)
),
'load' => array( '$avg' => '$system.load' )
)
),
array(
'$project' => array(
'service' => '$_id.service',
'time' => '$_id.time',
'load' => 1
)
)
))
Otherwise if you choose to convert to BSON dates then you can use the date aggregation operators instead:
db.collection.aggregate([
{ "$match": {
"timestamp": {
"$gte": new Date("2014-04-01"), "$lte": new Date("2014-04-08")
},
"service": "a715feac3db42f54edbc50ef6fa057b3"
}},
{ "$group": {
"service": "$service",
"time": {
"dayOfYear": { "$dayOfYear": "$timestamp" },
"hour": { "$hour": "$timestamp" },
"minute": {
"$subtract": [
{ "$minute": "$timestamp" },
{
"$mod": [
{ "$minute": "$timestamp" },
15
]
}
]
}
},
"load": { "$avg": "$system.load" }
}},
{ "$project": {
"service": "$_id.service",
"time": "$_id.time",
"load": 1
}}
])
So there you have the help of the date aggregation operators to break up parts of the date your have and still use the same modulo operation in order to get interval values.
If you still prefer the date math approach you can still do this with date objects as the result of subtracting one date object from another will be the epoch timestamp value. So moving a BSON date to a epoch timestamp is just a matter of:
{
"$subtract": [
"$dateObjectField",
new Date("1970-01-01")
]
}
So any "date" values you pass in to the pipeline here you can cast using the native type methods of your driver and it will be serialized correctly when the request is sent to MongoDB. The other advantage is the same is true when you read them back, so there is no more need for conversion in client processing.

Related

comparing watch times of video with watch time coming from app to avoid duplicate entries in db

{
"start":"0",
"end":"5",
},
{
"start":"5",
"end":"25",
},
{
"start":"20"
,"end":"50",
},
{
"start":"60"
,"end":"150",
},
{
"start":"40"
,"end":"60",
},
{
"start":"0",
"end":"10",
},
{
"start":"1",
"end":"2",
},
{
"start":"2",
"end":"3",
},
{
"start":"40"
,"end":"50",
}
this is the data stored in the database and i receive payload from the app which sometime have the duplicate entries like the following request
[
[
'start' => 0,
'end' => 1
],
[
'start' => 1,
'end' => 2
],
[
'start' => 2,
'end' => 3
],
[
'start' => 4,
'end' => 5
]
]
like all these above entries are covered in the interval having start from 0 till 10 so i dont want these to be inserted in database again.
I can help you with some steps you should follow :
you take the smallest start from all the "watching" object's arrays and create a new array of array with 1st one's start as this smallest one.
then compare the end of that particular start with all other starts as <= and if it satisfies replace the end with that other array's end keep doing it until you solve all the overlaps.
calculate the time diff between start and end from all arrays and then sum the differences.
calculate the percentage with the sum.
you can save the result from 2. in your database if you need to return it to the frontend as what are the times that a user has watched.
you may not get a clear understanding of this logic by this answer but you can follow this approach to solve your problem.

How to filter elasticsearch by date field with date modification

To filter by date, I use the following queries:
'body' => [
'query' => [
'bool' => [
'filter' => [
'range' => [
'expire_at' => [
'gte' => now()
]
]
]
]
]
]
UPD: All records have another date field - last_checked. The question is how to select records in which, for example, (expire_at - 7 days) > last_checked?
Check the range query documentation: you can use date math in the range parameters. E.g., expire_at - 7 days < now() means that the expiration will be within the next 7 days. Then you can do:
"range": {
"expire_at": {
"lt": "now+7d/d"
}
}
Note that this will include also already expired items. If you want to avoid that, you can add the condition that the expiration date is not met yet:
"range": {
"expire_at": {
"lt": "now+7d/d",
"gte": "now/d"
}
}
use this code
"expire_at" => array(
"lt" => "now+7d/d"
)

Merging doc_count result from keyed buckets

I have a query like
'aggs' => [
'deadline' => [
'date_histogram' => [
'field' => 'deadline',
'interval' => 'month',
'keyed' => true,
'format' => 'MMM'
]
]
]
the result I am getting are buckets with keys as month names.
The problem I am facing is the buckets with the month names as keys for a previous year are over written by another month of the next year (because obviously the key is same).
I want results where doc-count of buckets of previous which are over written merge with the doc_count of the next.
You can either add a separate month field during indexing and perform aggregation on it or use below script
{
"size": 0,
"aggs": {
"deadline": {
"histogram": {
"script": { "inline" : "return doc['deadline'].value.getMonthOfYear()" },
"interval": 1
}
}
}
}
Creating a separate month field will have better performance
Replace the format from MMM to YYYY-MMM as below:
'aggs' => [
'deadline' => [
'date_histogram' => [
'field' => 'deadline',
'interval' => 'month',
'keyed' => true,
'format' => 'YYYY-MMM'
]
]
]
After this you can handle the merging process at your application level

Elasticsearch php : aggregations of documents with date interval

I'm trying to build a faceted search using Elasticsearch-php 6.0, but I'm having a hard time to figure out how to use a date range aggregation. Here's what I'm trying to do :
Mapping sample :
"mappings": {
"_doc": {
"properties": {
...
"timeframe": {
"properties": {
"gte": {
"type": "date",
"format": "yyyy"
},
"lte": {
"type": "date",
"format": "yyyy"
}
}
}
...
In my document, I have this property:
"timeframe":[{"gte":"1701","lte":"1800"}]
I want to be able display a facet with a date range slider, where the user can input a range (min value - max value). Ideally, those min-max values should be returned by the Elasticsearch aggregation automatically given the current query.
Here's the aggregation I'm trying to write in "pseudo code", to give you an idea:
"aggs": {
"date_range": {
"field": "timeframe",
"format": "yyyy",
"ranges": [{
"from": min(timeframe.gte),
"to": max(timeframe.lte)
}]
}
}
I think I need to use Date Range aggregation, min/max aggregation, and pipeline aggregations, but the more I read about them, the more I'm confused. I can't find how to glue this whole world together.
Keep in mind I can change the mapping and / or the document structure if this is not the correct way to achieve this.
Thanks !
As for me with the official "elasticsearch/elasticsearch" package of ES itself, I was able to find a range of my required documents with this document.
You need to read the documentation as you'll be needing the format.
$from_date = '2018-03-08T17:58:03Z';
$to_date = '2018-04-08T17:58:03Z';
$params = [
'index' => 'your_index',
'type' => 'your_type',
'body' => [
'query' => [
'range' => [
'my_date_field' => [
//gte = great than or equal, lte = less than or equal
'gte' => $from_date,
// 'lte' => $to_date,
'format' => "yyyy-MM-dd||yyyy-MM-dd'T'HH:mm:ss'Z'",
'boost' => 2.0
]
]
],
]
];
$search = $client->search($params);

Symfony/Doctrine/MongoDB Get every Nth item

I'm having a dataset which contains datapoints for every 5 seconds per day. This would result in a dataset of 17280 items a day.
This set is way too big and i want it smaller (i'm using these items to draw a graph).
Since the graph's x-axis is over time i decided a gap of 5 minutes per datapoint is good enough. This will return into 288 datapoints a day. A lot less and good enough to make a graph.
My MongoCollection looks like this:
{
"timestamp":"12323455",
"someKey":123,
"someOtherKey": 345,
"someOtherOtherKey": 6789
}
The data gets posted every 5 seconds into the database. So the timestamp will differ 5 seconds for each result.
As my x-axis is divided in 5 minutes sequences I'd love to calculate the average values of someKey, someOtherKey and someOtherOtherkey over these 5 minutes.
This new average will be one of the datapoints in my graph.
How would one get all the datapoints from 1 day with each average 5 minutes apart from eachother? (288 datapoints per day).
As for now i'm selecting every document from midnight this day:
$result = $collection
->createQueryBuilder()
->field('timestamp')->gte($todayMidnight)
->sort('timestamp', 'DSC')
->getQuery()
->execute();
How would one filter this list of data (within the same query) to get the datapoints for every 5 minutes (and the datapoint being an average of the points within these 5 minutes)?
It would be nice to have this query built with doctrine as i'll need it in my symfony application.
EDIT
I've tried to get my query first within the mongoshell working.
As in the comments suggested i should start using aggregation.
The query i've made so far is based upon another question asked here at stackoverflow
This is the current query:
db.Pizza.aggregate([
{
$match:
{
timestamp: {$gte: 1464559200}
}
},
{
$group:
{
_id:
{
$subtract: [
"$timestamp",
{"$mod": ["$timestamp", 300]}
]
},
"timestamp":{"$first":"$timestamp"},
"someKey":{"$first":"$someKey"},
"someOtherKey":{"$first":"$someOtherKey"},
"someOtherOtherKey":{"$first":"$someOtherOtherKey"}
}
}
])
This query will give me the last result for each 300 seconds (5 minutes) from today Midnight.
I want it to get all documents within those 300 seconds and calculate an average over the columns someKey, someOtherKey, someOtherOtherKey
So if we take this example dataset:
{
"timestamp":"1464559215",
"someKey":123,
"someOtherKey": 345,
"someOtherOtherKey": 6789
},
{
"timestamp":"1464559220",
"someKey":54,
"someOtherKey": 20,
"someOtherOtherKey": 511
},
{
"timestamp":"1464559225",
"someKey":654,
"someOtherKey": 10,
"someOtherOtherKey": 80
},
{
"timestamp":"1464559505",
"someKey":90,
"someOtherKey": 51,
"someOtherOtherKey": 1
}
The query should return 2 rows namely:
{
"timestamp":"1464559225",
"someKey":277,
"someOtherKey": 125,
"someOtherOtherKey": 2460
},
{
"timestamp":"1464559505",
"someKey":90,
"someOtherKey": 51,
"someOtherOtherKey": 1
}
The first result is calculated like this:
Result 1 - someKey = (123+54+654)/3 = 277
Result 1 - someOtherKey = (345+20+10)/3 = 125
Result 1 - someOtherOtherKey = (6789+511+80)/3 = 2460
How would one make this calculation within the mongoshell with the aggregation function?
Based on the given answeres here on stackoverflow i've managed to get exactly what i wanted.
This is the big aggregation query i have to make to get all my results back:
db.Pizza.aggregate([
{
$match:
{
timestamp: {$gte: 1464559200}
}
},
{
$group:
{
_id:
{
$subtract: [
'$timestamp',
{$mod: ['$timestamp', 300]}
]
},
timestamp: {$last: '$timestamp'},
someKey: {$avg: '$someKey'},
someOtherKey: {$avg: '$someOtherKey'},
someOtherOtherKey: {$avg: '$someOtherOtherKey'}
}
},
{
$project:
{
_id: 0,
timestamp: '$timestamp',
someKey: '$someKey',
someOtherKey:'$someOtherKey',
someOtherOtherKey:'$someOtherOtherKey'
}
}
])
The Match part is for getting every result after Today Midnight (timestamp of today midnight).
The Group part is the most interesting part. Here we're looping through every document we've found and calculate a modulus for every 300 seconds (5 minutes) then we fill the property timestamp with the last result of the modulus operations.
The Project part is necessary to remove the _id from the actual result as the result doesn't represent something in the database anymore.
Given answeres where this answere is based on:
MongoDB - Aggregate max/min/average for multiple variables at once
How to subtract in mongodb php
MongoDB : Aggregation framework : Get last dated document per grouping ID
Doctrine Solution
$collection->aggregate([
[
'$match' => [
'timestamp' => ['$gte' => 1464559200]
]
],
[
'$group' => [
'_id' => [
'$subtract' => [
'$timestamp',
[
'$mod' => ['$timestamp',300]
]
]
],
'timestamp' => [
'$last' => '$timestamp'
],
$someKey => [
'$avg' => '$'.$someKey
],
$someOtherKey => [
'$avg' => '$'.$someOtherKey
],
$someOtherOtherKey => [
'$avg' => '$'.$someOtherOtherKey
]
]
]
]);

Categories