MongoDB Aggregation "group" with "max" field within a sub-array - php

Using Compass initially, I then need to convert it into the PHP library.
So far, I have a 1st stage that filters the documents on 2 fields using $match:
comp.id (sub-document / array)
playerId
Code is:
$match (from drop-down)
{
"comp.id" : ObjectId('607019361c071256e4f0d0d5'),
"playerId" : "609d0993906429612483cea0"
}
This returns 2 documents.
The document has a sub-array holes, for the holes played in a round of golf. This sub-array has fields (among others):
holes.no
holes.par
holes.grossScore
holes.nettPoints
So each round has 1 document, with a holes sub-array of (typically) 18 array elements (holes), or 9 for half-round. A player will play each round multiple times - hence multiple documents.
I would like to find the highest holes.nettPoints across the documents. I think I need to $group with $max on the holes.nettPoints field, so I would find the highest score for each hole across all rounds.
I have tried this, but in Compass its says its not properly formatted:
$group drop-down
{
_id: holes.no,
"highest":
{ $max: "$holes.nettPoints" }
}
"highest" can be any name I want?
EDIT FOLLOWING PROVIDED ANSWER
The answer marked as the solution was enough of a pointer for how the Aggregation Framework operates (multi-stage documents, i.e. documents as input to 1 stage become new documents as the output of that stage. And so on.
For the purposes of posterity, I ended up using the following aggregation:
[{$match: {
"comp.id" : ObjectId('607019361c071256e4f0d0d5'),
"playerId" : "609d0993906429612483cea0",
"comp.courseId" : "608955aaabebbd503ba6e116"
}
}, {$unwind: {
path : "$holes"
}}, {$group: {
_id: "$holes.no",
hole: {
$max: "$holes"
}
}}, {$sort: {
"hole": 1
}}]
In PHP speak, it looks like:
$match = [
'$match' => [
'comp.id' => new MongoDB\BSON\ObjectID( $compId ),
'playerId' => $playerId,
'comp.courseId' => $courseId
]
];
$unwind = [
'$unwind' => [
'path' => '$holes'
]
];
$group = [
'$group' => [
'_id' => '$holes.no',
'hole' => [
'$max' => '$holes'
]
]
];
$sort = [
'$sort' => [
'hole.no' => 1
]
];
$cursor = $collection->aggregate([$match, $unwind, $group, $sort]);
It is not complete (looking at adding a $sum accumulator across the courseId, not individual documents), but answers the question posted.

$match your conditions
$unwind deconstruct holes array
$sort by nettPoints in descending order
$group by no and select first holes object
[
{
$match: {
"comp.id": ObjectId("607019361c071256e4f0d0d5"),
"playerId": "609d0993906429612483cea0"
}
},
{ $unwind: "$holes" },
{ $sort: { "holes.nettPoints": -1 } },
{
$group: {
_id: "$holes.no",
highest: { $first: "$holes" }
}
}
]

Related

Can i add tags to a "deeper" key in an Elastic Search document?

i have products with tags, and tags are inside tagtypes.
this is a sample document that i added to the index
{
"_index" : "products",
"_type" : "_doc",
"_id" : "1219",
"_score" : 1.0,
"_source" : {
"id" : "1219",
"product_no" : "26426492261",
"merchant_id" : 11,
"name" : "Apple »Magic Keyboard für das 12,9\" iPad Pro (4. Generation)« iPad-Tastatur",
"category" : "Technik>Multimedia>Zubehör>Tastatur>iPad Tastatur",
"deep_link" : "https://foo",
"short_description" : null,
"long_description" : "Apple:",
"brand" : "Apple",
"merchant_image_url" : "http://something",
"tagtypes" : [
[
{
"Memory" : [ ]
}
]
]
}
},
That tagtype "Memory" is dynamically created while indexing the products.
I tried to add tags to that key
//attach tags also to ES
$params = [
'index' => 'products',
'id' => $product['_id'],
'body' => [
'script' => [
'source' => 'if (!ctx._source.tagtypes.'.$tagType->name.'.contains(params.tag)) { ctx._source.tagtypes.'.$tagType->name.'.add(params.tag) }',
'lang' => 'painless',
'params' => [
'tag' => $tag->value
]
]
]
];
But i receive an error like
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"failed to execute script"}],"type":"illegal_argument_exception","reason":"failed to execute script","caused_by":{"type":"script_exception","reason":"runtime error","script_stack":["if (!ctx._source.tagtypes[\"Memory\"].contains(params.tag)) { "," ^---- HERE"],"script":"if (!ctx._source.tagtypes[\"Memory\"].contains(params.tag)) { ctx._source.tagtypes[\"Memory\"].add(params.tag) }","lang":"painless","position":{"offset":16,"start":0,"end":60},"caused_by":{"type":"wrong_method_type_exception","reason":"cannot convert MethodHandle(List,int)int to (Object,String)String"}}},"status":400}
Could anyone help me with that. I couldnt find any documentation about it, as the examples are often too basic.
Is it generally possible to save to "deeper keys" like this ?
Or can i just create "tags" as simple list (without any depth)
Thanks in advance
Adrian!
Your field tagtypes is an array of arrays of objects which themselves contain one-key arrays.
When you're dealing with such "deep" structures, you'll need some form of iteration to update them.
For loops are a good place start but they often lead to java.util.ConcurrentModificationExceptions. So it's easier to work with temporary copies of data and then replace the corresponding _source attribute when done with the iterations:
{
"query": {
"match_all": {}
},
"script": {
"source": """
if (ctx._source.tagtypes == null) { return; }
def originalTagtypes = ctx._source.tagtypes;
def newTagtypes = [];
for (outerGroup in originalTagtypes) {
// keep what we've got
newTagtypes.addAll(outerGroup);
// group already present?
def atLeastOneGroupContainsTag = outerGroup.stream().anyMatch(tagGroup -> tagGroup.containsKey(params.tag));
// if not, add it as a hashmap of one single empty list
if (!atLeastOneGroupContainsTag) {
Map m = new HashMap();
m.put(params.tag, []);
newTagtypes.add(m);
}
}
ctx._source.tagtypes = [newTagtypes];
""",
"lang": "painless",
"params": {
"tag": "CPU"
}
}
}
which'll end up updating the tagtypes like so:
{
...
"tagtypes" : [
[
{
"Memory" : [ ]
},
{
"CPU" : [ ] <---
}
]
],
...
}
You're right when you say that the documentation examples are too basic. Shameless plug: I recently published a handbook that aims to address exactly that. You'll find lots non-trivial scripting examples to gain a better understanding of the Painless scripting language.

Elasticsearch php : aggregations of documents with date interval

I'm trying to build a faceted search using Elasticsearch-php 6.0, but I'm having a hard time to figure out how to use a date range aggregation. Here's what I'm trying to do :
Mapping sample :
"mappings": {
"_doc": {
"properties": {
...
"timeframe": {
"properties": {
"gte": {
"type": "date",
"format": "yyyy"
},
"lte": {
"type": "date",
"format": "yyyy"
}
}
}
...
In my document, I have this property:
"timeframe":[{"gte":"1701","lte":"1800"}]
I want to be able display a facet with a date range slider, where the user can input a range (min value - max value). Ideally, those min-max values should be returned by the Elasticsearch aggregation automatically given the current query.
Here's the aggregation I'm trying to write in "pseudo code", to give you an idea:
"aggs": {
"date_range": {
"field": "timeframe",
"format": "yyyy",
"ranges": [{
"from": min(timeframe.gte),
"to": max(timeframe.lte)
}]
}
}
I think I need to use Date Range aggregation, min/max aggregation, and pipeline aggregations, but the more I read about them, the more I'm confused. I can't find how to glue this whole world together.
Keep in mind I can change the mapping and / or the document structure if this is not the correct way to achieve this.
Thanks !
As for me with the official "elasticsearch/elasticsearch" package of ES itself, I was able to find a range of my required documents with this document.
You need to read the documentation as you'll be needing the format.
$from_date = '2018-03-08T17:58:03Z';
$to_date = '2018-04-08T17:58:03Z';
$params = [
'index' => 'your_index',
'type' => 'your_type',
'body' => [
'query' => [
'range' => [
'my_date_field' => [
//gte = great than or equal, lte = less than or equal
'gte' => $from_date,
// 'lte' => $to_date,
'format' => "yyyy-MM-dd||yyyy-MM-dd'T'HH:mm:ss'Z'",
'boost' => 2.0
]
]
],
]
];
$search = $client->search($params);

PHP MongoDB Driver and Aggregations

I am at my first steps with mongoDB and php, trying to figure out how aggregations works. I have an approximate idea on how to use them from the command line but I am trying to translate this for the php driver. I am using the restaurants dexample DB, a list of records like this
{
"_id" : ObjectId("59a5211e107765480896f3f8"),
"address" : {
"building" : "284",
"coord" : [
-73.9829239,
40.6580753
],
"street" : "Prospect Park West",
"zipcode" : "11215"
},
"borough" : "Brooklyn",
"cuisine" : "American",
"grades" : [
{
"date" : ISODate("2014-11-19T00:00:00Z"),
"grade" : "A",
"score" : 11
},
{
"date" : ISODate("2013-11-14T00:00:00Z"),
"grade" : "A",
"score" : 2
},
{
"date" : ISODate("2012-12-05T00:00:00Z"),
"grade" : "A",
"score" : 13
},
{
"date" : ISODate("2012-05-17T00:00:00Z"),
"grade" : "A",
"score" : 11
}
],
"name" : "The Movable Feast",
"restaurant_id" : "40361606"
}
I just want to count how many restaurants for location, what I am doing is
$client = new MongoDB\Client("mongodb://localhost:27017");
$collection = $client->myNewDb->restaurants;
$results = $collection->aggregate(
[
'name' => '$name'
],
[
'$group' => [
'cuisine' => ['sum' => '$sum']
]
]
);
and I am getting this error
Fatal error: Uncaught exception 'MongoDB\Exception\InvalidArgumentException'
with message '$pipeline is not a list (unexpected index: "name")'
any idea? I can't find any good documentation on php.net.
thanks
M
Just take a look into documentation, and you will see, that the pipelines must be passed as an array.
The aggregate method accepts two parameters $pipelines and $options (public function aggregate(array $pipeline, array $options = [])).
Also as was mentioned before, the $group must have the _id element.
Groups documents by some specified expression and outputs to the next
stage a document for each distinct grouping. The output documents
contain an _id field which contains the distinct group by key. The
output documents can also contain computed fields that hold the values
of some accumulator expression grouped by the $group‘s _id field.
$group does not order its output documents.
https://docs.mongodb.com/manual/reference/operator/aggregation/group/
So your code must look like this:
$results = $collection->aggregate([
[
'$group' => [
'_id' => '$cuisine',
'sum' => ['$sum' => 1],
'names' => ['$push' => '$name']
]
]
]);
This code groups documents by cuisine element, counts the items and collects all name values into array.

Query where timestamp field is older than another timestamp field in MongoDB with PHP

How can I obtain an object from a MongoDB collection where a specific field1 (timestamp or date) is older/newer than another specific field2 (timestamp or date)?
Given the following example object:
// MongoDB 3.2
{
name: 'test',
updated_on: Timestamp(1474416000, 0),
export: {
active: true,
last_exported_on: Timestamp(1474329600, 0)
}
}
This object should match a query like: where export.active is true and updated_on > export.last_exported_on
I've tried it with the aggregation framework, since I've read that $where can be very slow, but without any success.
// PHP 5.4 (and MongoDB PHP lib. http://mongodb.github.io/mongo-php-library)
$collection->aggregate([
['$project' => [
'dst' => ['$cmp' => ['updated_on', 'export.last_exported_on']],
'name' => true
]],
['$match' => ['dst' => ['$gt' => 0], 'export.active' => ['$eq' => true]]],
['$limit' => 1]
]);
I can change timestamps into date or anything else, but I don't see the problem in the type.
Edit: Not all objects have the last_exported_on or the export fields at all. Besides that both can be null or empty or 000000.
That's because after you do the $project you end up only with the dst and _id fields, so you cannot $match on export.active. You need to match on export.active before the projection. After that you need another match on the dst field.
[
{
$match: {
"export.active": true
}
},
{
$project: {
dst: {
$cmp: [
"$updated_on",
"$export.last_exported_on"
]
}
}
},
{
$match: {
dst: 1
}
}
]
Edit
Alternatively, you can make sure to preserve export.active and to spare another $match:
[
{
$project: {
"export.active": 1,
cmp: {
$cmp: [
"$updated_on",
"$export.last_exported_on"
]
}
}
},
{
$match: {
cmp: 1,
"export.active": true
}
}
]

MongoDB aggregation by time interval PHP

I'm using MongoDB to store server statistics that are captured every 15 seconds (so 4 rows get inserted each minute per server) and am trying to get this data plotted onto a graph for all data between a certain timestamp.
For example, the following query can be used:
$tbl->find(
array(
"timestamp" => array('$gte' => '1396310400', '$lte' => '1396915200'),
"service" => 'a715feac3db42f54edbc50ef6fa057b3'
),
array("timestamp" => 1, "system" => 1)
);
Which spits our a bunch of rows that look like this:
Array
(
[53933ad8532965621d97dd3b] => Array
(
[_id] => MongoId Object
(
[$id] => 53933ad8532965621d97dd3b
)
[system] => Array
(
[load] => 0.55
[uptime] => 1171204.47
[processes] => 222
)
[timestamp] => 1396310403
)
)
This works fine for small data ranges, as I can pass this data directly into Flot or HighCharts and let it prettify the time scales itself. However this doesn't work for large data sets (for example querying over a month).
What I'm trying to do is group the data by hour (or 15 minutes), and return the average values (in this example, its system.load that I'm plotting) for that given time period.
I know that the aggregate function is what I need be using, but despite my best efforts I've been unable to get this working.
Right now I'm letting PHP do all of the work (grouping the results by timestamp and working out the averages) but it's extremely slow and I know MongoDB would handle it better.
Any insight would be greatly appreciated!
Edit:
I've been trying to follow the answer posted here but am still struggling - MongoDB Aggregation PHP, Group by Hours
I'm looking at your initial query at the top of your question and it immediately tells me that your "timestamp" values are actually strings. So no doubt that when you are reading this information and doing your "manual aggregation" you are actually casting these values, and possibly others into types that you can manipulate, sum and average.
So the first part here is to fix your data, that looks like it has come from a logging source but you have never converted the values. I'm considering it reasonably possible that this is not just the timestamp values but probably also your metrics under system.
This leaves you with a choice of how to store your timestamp. You can either just keep that as a timestamp number as it currently is in string form, or you can opt for converting to a BSON date type. The first one will be a simple integer cast and save back, the other you should be able to feed to the Date type that is supported by the driver and again save back the data.
When you have done this, then you can happily use the aggregation functions. So as an example for if you choose to keep this as a number, then you just apply date math in order to get the grouping boundaries:
db.collection.aggregate([
// Match documents on the range you want
{ "$match": {
"timestamp": {
"$gte": 1396310400, "$lte": 1396915200
},
"service": "a715feac3db42f54edbc50ef6fa057b3"
}},
// Group on the time intervals, 15 minutes here
{ "$group": {
"_id": {
"service": "$service",
"time": {
"$subtract": [
"$timestamp",
{ "$mod": [ "$timestamp", 60 * 15 ] }
]
}
},
"load": { "$avg": "$system.load" }
}},
// Project to the output form you want
{ "$project": {
"service": "$_id.service",
"time" : "$_id.time",
"load": 1
}}
])
Or to be php specific
$tbl->aggregate(array(
array(
'$match' => array(
'timestamp' => array(
'$gte' => 1396310400, '$lte' => 1396915200
),
'service' => 'a715feac3db42f54edbc50ef6fa057b3'
)
),
array(
'$group' => array(
'_id' => array(
'service' => '$service',
'time' => array(
'$subtract' => array(
'$timestamp',
array( '$mod' => array('$timestamp', 60 * 15 ) )
)
)
),
'load' => array( '$avg' => '$system.load' )
)
),
array(
'$project' => array(
'service' => '$_id.service',
'time' => '$_id.time',
'load' => 1
)
)
))
Otherwise if you choose to convert to BSON dates then you can use the date aggregation operators instead:
db.collection.aggregate([
{ "$match": {
"timestamp": {
"$gte": new Date("2014-04-01"), "$lte": new Date("2014-04-08")
},
"service": "a715feac3db42f54edbc50ef6fa057b3"
}},
{ "$group": {
"service": "$service",
"time": {
"dayOfYear": { "$dayOfYear": "$timestamp" },
"hour": { "$hour": "$timestamp" },
"minute": {
"$subtract": [
{ "$minute": "$timestamp" },
{
"$mod": [
{ "$minute": "$timestamp" },
15
]
}
]
}
},
"load": { "$avg": "$system.load" }
}},
{ "$project": {
"service": "$_id.service",
"time": "$_id.time",
"load": 1
}}
])
So there you have the help of the date aggregation operators to break up parts of the date your have and still use the same modulo operation in order to get interval values.
If you still prefer the date math approach you can still do this with date objects as the result of subtracting one date object from another will be the epoch timestamp value. So moving a BSON date to a epoch timestamp is just a matter of:
{
"$subtract": [
"$dateObjectField",
new Date("1970-01-01")
]
}
So any "date" values you pass in to the pipeline here you can cast using the native type methods of your driver and it will be serialized correctly when the request is sent to MongoDB. The other advantage is the same is true when you read them back, so there is no more need for conversion in client processing.

Categories