I'm having a dataset which contains datapoints for every 5 seconds per day. This would result in a dataset of 17280 items a day.
This set is way too big and i want it smaller (i'm using these items to draw a graph).
Since the graph's x-axis is over time i decided a gap of 5 minutes per datapoint is good enough. This will return into 288 datapoints a day. A lot less and good enough to make a graph.
My MongoCollection looks like this:
{
"timestamp":"12323455",
"someKey":123,
"someOtherKey": 345,
"someOtherOtherKey": 6789
}
The data gets posted every 5 seconds into the database. So the timestamp will differ 5 seconds for each result.
As my x-axis is divided in 5 minutes sequences I'd love to calculate the average values of someKey, someOtherKey and someOtherOtherkey over these 5 minutes.
This new average will be one of the datapoints in my graph.
How would one get all the datapoints from 1 day with each average 5 minutes apart from eachother? (288 datapoints per day).
As for now i'm selecting every document from midnight this day:
$result = $collection
->createQueryBuilder()
->field('timestamp')->gte($todayMidnight)
->sort('timestamp', 'DSC')
->getQuery()
->execute();
How would one filter this list of data (within the same query) to get the datapoints for every 5 minutes (and the datapoint being an average of the points within these 5 minutes)?
It would be nice to have this query built with doctrine as i'll need it in my symfony application.
EDIT
I've tried to get my query first within the mongoshell working.
As in the comments suggested i should start using aggregation.
The query i've made so far is based upon another question asked here at stackoverflow
This is the current query:
db.Pizza.aggregate([
{
$match:
{
timestamp: {$gte: 1464559200}
}
},
{
$group:
{
_id:
{
$subtract: [
"$timestamp",
{"$mod": ["$timestamp", 300]}
]
},
"timestamp":{"$first":"$timestamp"},
"someKey":{"$first":"$someKey"},
"someOtherKey":{"$first":"$someOtherKey"},
"someOtherOtherKey":{"$first":"$someOtherOtherKey"}
}
}
])
This query will give me the last result for each 300 seconds (5 minutes) from today Midnight.
I want it to get all documents within those 300 seconds and calculate an average over the columns someKey, someOtherKey, someOtherOtherKey
So if we take this example dataset:
{
"timestamp":"1464559215",
"someKey":123,
"someOtherKey": 345,
"someOtherOtherKey": 6789
},
{
"timestamp":"1464559220",
"someKey":54,
"someOtherKey": 20,
"someOtherOtherKey": 511
},
{
"timestamp":"1464559225",
"someKey":654,
"someOtherKey": 10,
"someOtherOtherKey": 80
},
{
"timestamp":"1464559505",
"someKey":90,
"someOtherKey": 51,
"someOtherOtherKey": 1
}
The query should return 2 rows namely:
{
"timestamp":"1464559225",
"someKey":277,
"someOtherKey": 125,
"someOtherOtherKey": 2460
},
{
"timestamp":"1464559505",
"someKey":90,
"someOtherKey": 51,
"someOtherOtherKey": 1
}
The first result is calculated like this:
Result 1 - someKey = (123+54+654)/3 = 277
Result 1 - someOtherKey = (345+20+10)/3 = 125
Result 1 - someOtherOtherKey = (6789+511+80)/3 = 2460
How would one make this calculation within the mongoshell with the aggregation function?
Based on the given answeres here on stackoverflow i've managed to get exactly what i wanted.
This is the big aggregation query i have to make to get all my results back:
db.Pizza.aggregate([
{
$match:
{
timestamp: {$gte: 1464559200}
}
},
{
$group:
{
_id:
{
$subtract: [
'$timestamp',
{$mod: ['$timestamp', 300]}
]
},
timestamp: {$last: '$timestamp'},
someKey: {$avg: '$someKey'},
someOtherKey: {$avg: '$someOtherKey'},
someOtherOtherKey: {$avg: '$someOtherOtherKey'}
}
},
{
$project:
{
_id: 0,
timestamp: '$timestamp',
someKey: '$someKey',
someOtherKey:'$someOtherKey',
someOtherOtherKey:'$someOtherOtherKey'
}
}
])
The Match part is for getting every result after Today Midnight (timestamp of today midnight).
The Group part is the most interesting part. Here we're looping through every document we've found and calculate a modulus for every 300 seconds (5 minutes) then we fill the property timestamp with the last result of the modulus operations.
The Project part is necessary to remove the _id from the actual result as the result doesn't represent something in the database anymore.
Given answeres where this answere is based on:
MongoDB - Aggregate max/min/average for multiple variables at once
How to subtract in mongodb php
MongoDB : Aggregation framework : Get last dated document per grouping ID
Doctrine Solution
$collection->aggregate([
[
'$match' => [
'timestamp' => ['$gte' => 1464559200]
]
],
[
'$group' => [
'_id' => [
'$subtract' => [
'$timestamp',
[
'$mod' => ['$timestamp',300]
]
]
],
'timestamp' => [
'$last' => '$timestamp'
],
$someKey => [
'$avg' => '$'.$someKey
],
$someOtherKey => [
'$avg' => '$'.$someOtherKey
],
$someOtherOtherKey => [
'$avg' => '$'.$someOtherOtherKey
]
]
]
]);
Related
{
"start":"0",
"end":"5",
},
{
"start":"5",
"end":"25",
},
{
"start":"20"
,"end":"50",
},
{
"start":"60"
,"end":"150",
},
{
"start":"40"
,"end":"60",
},
{
"start":"0",
"end":"10",
},
{
"start":"1",
"end":"2",
},
{
"start":"2",
"end":"3",
},
{
"start":"40"
,"end":"50",
}
this is the data stored in the database and i receive payload from the app which sometime have the duplicate entries like the following request
[
[
'start' => 0,
'end' => 1
],
[
'start' => 1,
'end' => 2
],
[
'start' => 2,
'end' => 3
],
[
'start' => 4,
'end' => 5
]
]
like all these above entries are covered in the interval having start from 0 till 10 so i dont want these to be inserted in database again.
I can help you with some steps you should follow :
you take the smallest start from all the "watching" object's arrays and create a new array of array with 1st one's start as this smallest one.
then compare the end of that particular start with all other starts as <= and if it satisfies replace the end with that other array's end keep doing it until you solve all the overlaps.
calculate the time diff between start and end from all arrays and then sum the differences.
calculate the percentage with the sum.
you can save the result from 2. in your database if you need to return it to the frontend as what are the times that a user has watched.
you may not get a clear understanding of this logic by this answer but you can follow this approach to solve your problem.
Using Compass initially, I then need to convert it into the PHP library.
So far, I have a 1st stage that filters the documents on 2 fields using $match:
comp.id (sub-document / array)
playerId
Code is:
$match (from drop-down)
{
"comp.id" : ObjectId('607019361c071256e4f0d0d5'),
"playerId" : "609d0993906429612483cea0"
}
This returns 2 documents.
The document has a sub-array holes, for the holes played in a round of golf. This sub-array has fields (among others):
holes.no
holes.par
holes.grossScore
holes.nettPoints
So each round has 1 document, with a holes sub-array of (typically) 18 array elements (holes), or 9 for half-round. A player will play each round multiple times - hence multiple documents.
I would like to find the highest holes.nettPoints across the documents. I think I need to $group with $max on the holes.nettPoints field, so I would find the highest score for each hole across all rounds.
I have tried this, but in Compass its says its not properly formatted:
$group drop-down
{
_id: holes.no,
"highest":
{ $max: "$holes.nettPoints" }
}
"highest" can be any name I want?
EDIT FOLLOWING PROVIDED ANSWER
The answer marked as the solution was enough of a pointer for how the Aggregation Framework operates (multi-stage documents, i.e. documents as input to 1 stage become new documents as the output of that stage. And so on.
For the purposes of posterity, I ended up using the following aggregation:
[{$match: {
"comp.id" : ObjectId('607019361c071256e4f0d0d5'),
"playerId" : "609d0993906429612483cea0",
"comp.courseId" : "608955aaabebbd503ba6e116"
}
}, {$unwind: {
path : "$holes"
}}, {$group: {
_id: "$holes.no",
hole: {
$max: "$holes"
}
}}, {$sort: {
"hole": 1
}}]
In PHP speak, it looks like:
$match = [
'$match' => [
'comp.id' => new MongoDB\BSON\ObjectID( $compId ),
'playerId' => $playerId,
'comp.courseId' => $courseId
]
];
$unwind = [
'$unwind' => [
'path' => '$holes'
]
];
$group = [
'$group' => [
'_id' => '$holes.no',
'hole' => [
'$max' => '$holes'
]
]
];
$sort = [
'$sort' => [
'hole.no' => 1
]
];
$cursor = $collection->aggregate([$match, $unwind, $group, $sort]);
It is not complete (looking at adding a $sum accumulator across the courseId, not individual documents), but answers the question posted.
$match your conditions
$unwind deconstruct holes array
$sort by nettPoints in descending order
$group by no and select first holes object
[
{
$match: {
"comp.id": ObjectId("607019361c071256e4f0d0d5"),
"playerId": "609d0993906429612483cea0"
}
},
{ $unwind: "$holes" },
{ $sort: { "holes.nettPoints": -1 } },
{
$group: {
_id: "$holes.no",
highest: { $first: "$holes" }
}
}
]
I'm using MongoDB to store server statistics that are captured every 15 seconds (so 4 rows get inserted each minute per server) and am trying to get this data plotted onto a graph for all data between a certain timestamp.
For example, the following query can be used:
$tbl->find(
array(
"timestamp" => array('$gte' => '1396310400', '$lte' => '1396915200'),
"service" => 'a715feac3db42f54edbc50ef6fa057b3'
),
array("timestamp" => 1, "system" => 1)
);
Which spits our a bunch of rows that look like this:
Array
(
[53933ad8532965621d97dd3b] => Array
(
[_id] => MongoId Object
(
[$id] => 53933ad8532965621d97dd3b
)
[system] => Array
(
[load] => 0.55
[uptime] => 1171204.47
[processes] => 222
)
[timestamp] => 1396310403
)
)
This works fine for small data ranges, as I can pass this data directly into Flot or HighCharts and let it prettify the time scales itself. However this doesn't work for large data sets (for example querying over a month).
What I'm trying to do is group the data by hour (or 15 minutes), and return the average values (in this example, its system.load that I'm plotting) for that given time period.
I know that the aggregate function is what I need be using, but despite my best efforts I've been unable to get this working.
Right now I'm letting PHP do all of the work (grouping the results by timestamp and working out the averages) but it's extremely slow and I know MongoDB would handle it better.
Any insight would be greatly appreciated!
Edit:
I've been trying to follow the answer posted here but am still struggling - MongoDB Aggregation PHP, Group by Hours
I'm looking at your initial query at the top of your question and it immediately tells me that your "timestamp" values are actually strings. So no doubt that when you are reading this information and doing your "manual aggregation" you are actually casting these values, and possibly others into types that you can manipulate, sum and average.
So the first part here is to fix your data, that looks like it has come from a logging source but you have never converted the values. I'm considering it reasonably possible that this is not just the timestamp values but probably also your metrics under system.
This leaves you with a choice of how to store your timestamp. You can either just keep that as a timestamp number as it currently is in string form, or you can opt for converting to a BSON date type. The first one will be a simple integer cast and save back, the other you should be able to feed to the Date type that is supported by the driver and again save back the data.
When you have done this, then you can happily use the aggregation functions. So as an example for if you choose to keep this as a number, then you just apply date math in order to get the grouping boundaries:
db.collection.aggregate([
// Match documents on the range you want
{ "$match": {
"timestamp": {
"$gte": 1396310400, "$lte": 1396915200
},
"service": "a715feac3db42f54edbc50ef6fa057b3"
}},
// Group on the time intervals, 15 minutes here
{ "$group": {
"_id": {
"service": "$service",
"time": {
"$subtract": [
"$timestamp",
{ "$mod": [ "$timestamp", 60 * 15 ] }
]
}
},
"load": { "$avg": "$system.load" }
}},
// Project to the output form you want
{ "$project": {
"service": "$_id.service",
"time" : "$_id.time",
"load": 1
}}
])
Or to be php specific
$tbl->aggregate(array(
array(
'$match' => array(
'timestamp' => array(
'$gte' => 1396310400, '$lte' => 1396915200
),
'service' => 'a715feac3db42f54edbc50ef6fa057b3'
)
),
array(
'$group' => array(
'_id' => array(
'service' => '$service',
'time' => array(
'$subtract' => array(
'$timestamp',
array( '$mod' => array('$timestamp', 60 * 15 ) )
)
)
),
'load' => array( '$avg' => '$system.load' )
)
),
array(
'$project' => array(
'service' => '$_id.service',
'time' => '$_id.time',
'load' => 1
)
)
))
Otherwise if you choose to convert to BSON dates then you can use the date aggregation operators instead:
db.collection.aggregate([
{ "$match": {
"timestamp": {
"$gte": new Date("2014-04-01"), "$lte": new Date("2014-04-08")
},
"service": "a715feac3db42f54edbc50ef6fa057b3"
}},
{ "$group": {
"service": "$service",
"time": {
"dayOfYear": { "$dayOfYear": "$timestamp" },
"hour": { "$hour": "$timestamp" },
"minute": {
"$subtract": [
{ "$minute": "$timestamp" },
{
"$mod": [
{ "$minute": "$timestamp" },
15
]
}
]
}
},
"load": { "$avg": "$system.load" }
}},
{ "$project": {
"service": "$_id.service",
"time": "$_id.time",
"load": 1
}}
])
So there you have the help of the date aggregation operators to break up parts of the date your have and still use the same modulo operation in order to get interval values.
If you still prefer the date math approach you can still do this with date objects as the result of subtracting one date object from another will be the epoch timestamp value. So moving a BSON date to a epoch timestamp is just a matter of:
{
"$subtract": [
"$dateObjectField",
new Date("1970-01-01")
]
}
So any "date" values you pass in to the pipeline here you can cast using the native type methods of your driver and it will be serialized correctly when the request is sent to MongoDB. The other advantage is the same is true when you read them back, so there is no more need for conversion in client processing.
I have a collection with documents that look like this:
{
_id: ObjectId("516eb5d2ef4501a804000000"),
accountCreated: "2013-04-17 16:46",
accountLevel: 0,
responderCount: 0
}
I want to group and count these documents based on the accountCreated date (count per day), but I am stuck with the handling of dates since the date includes time as well.
This is what I have, but it returns the count including the time, witch means lots of entries always with 1 as accounts.
$g = $form->mCollectionUsers->aggregate(array(
array( '$group' => array( '_id' => '$accountCreated', 'accounts' => array( '$sum' => 1 ) ) )
));
Is there a way to rewrite the date to only take day in account and skip the time?
I have found this example but I canĀ“t really get figure out how to adapt it to this example.
If accountCreated is a date you can do it like this (I'll use the mongo shell syntax since I'm not familiar with the php driver):
db.mCollectionUsers.aggregate([
{$project :{
day : {"$dayOfMonth" : "$accountCreated"},
month : {"$month" : "$accountCreated"},
year : {"$year" : "$accountCreated"}
}},
{$group: {
_id : {year : "$year", month : "$month", day : "$day"},
accounts : { "$sum" : 1}
}}
]);
If you want to display the date properly:
db.mCollectionUsers.aggregate([
{
$group: {
_id: { $dateToString: { format: '%Y-%m-%d', date: '$accountCreated' } },
count: { $sum: 1 }
}
},
{
$project: {
_id: 0,
date: '$_id',
count: 1
}
}
])
The result would look like:
[
{
"date": "2020-11-11",
"count": 8
},
{
"date": "2020-11-13",
"count": 3
},
{
"date": "2020-11-16",
"count": 3
},
]
I'm trying to display data on a line graph using Google Charts. The data displays fine, however I would like to set a date range to be displayed.
The data is sent from the database in a JSON literal format:
{
"cols": [
{"label": "Week", "type": "date"},
{"label": "Speed", "type": "number"},
{"type":"string","p":{"role":"tooltip"}},
{"type":"string","p":{"role":"tooltip"}},
{"type":"string","p":{"role":"tooltip"}},
{"type":"string","p":{"role":"tooltip"}},
],
"rows": [
{"c":[{"v": "Date('.$date.')"},{"v": null},{"v": null},{"v": null},{"v": null},{"v": null}]},
{"c":[{"v": "Date('.$date.')"},{"v": null},{"v": null},{"v": null},{"v": null},{"v": null}]}
]
}
Data is either displayed by week or month (null for easy reading) for example this week:
2012, 02, 06
2012, 02, 07
2012, 02, 09
Data isn't set for everyday of the week, therefore in this example only the dates above are shown. What I would like to be shown is the start of the week (2012, 02, 06) to the end of the week (2012, 02, 12) similar to the third example here.
I managed to get the whole week showing by checking if the date exists in the database and if not append an extra row will null data, this however meant the line was not continuous and the dates where not in order.
Could anyone offer any advice on how to I could go about doing this?
Thanks!
Did you try leaving the missing dates be missing dates (i.e. let the database return 2 values instead of 7)?
The continuous axis should handle missing dates, you just need to set the axis range from start to end of the week.
UPDATE
for interactive line chart the axis ranges can be set like this (as inspired by this thread):
hAxis: {...
viewWindowMode: 'explicit',
viewWindow: {min: new Date(2007,1,1),
max: new Date(2010,1,1)}
...}
see http://jsfiddle.net/REgJu/
"I managed to get the whole week showing by checking if the date exists in the database and if not append an extra row will null data, this however meant the line was not continuous and the dates where not in order."
I think you are on the right track you just need to do it in a slightly different way. I have the function like the below to make data continuous.
$data = array(
1 => 50,
2 => 75,
4 => 65,
7 => 60,
);
$dayAgoStart = 1;
$daysAgoEnd = 14;
$continuousData = array();
for($daysAgo=$daysAgoStart ; $daysAgo<=$daysAgoEnd ; $daysAgo++){
if(array_key_exists($daysAgo, $data) == true){
$continuousData[$daysAgo] = $data[$daysAgo];
}
else{
$continuousData[$daysAgo] = 0;
}
}
continuousData will now hold:
$data = array(
1 => 50,
2 => 75,
3 => 0,
4 => 65,
5 => 0,
6 => 0,
7 => 60,
8 => 0,
9 => 0,
10 => 0,
11 => 0,
12 => 0,
13 => 0,
14 => 0,
);
in that order, and then the data can be used in the charts without any gaps.
Perhaps you can use a different chart type? Dygraphs looks like it might be helpful.
Otherwise you may have to write your own custom chart type.