Elasticsearch php : aggregations of documents with date interval - php

I'm trying to build a faceted search using Elasticsearch-php 6.0, but I'm having a hard time to figure out how to use a date range aggregation. Here's what I'm trying to do :
Mapping sample :
"mappings": {
"_doc": {
"properties": {
...
"timeframe": {
"properties": {
"gte": {
"type": "date",
"format": "yyyy"
},
"lte": {
"type": "date",
"format": "yyyy"
}
}
}
...
In my document, I have this property:
"timeframe":[{"gte":"1701","lte":"1800"}]
I want to be able display a facet with a date range slider, where the user can input a range (min value - max value). Ideally, those min-max values should be returned by the Elasticsearch aggregation automatically given the current query.
Here's the aggregation I'm trying to write in "pseudo code", to give you an idea:
"aggs": {
"date_range": {
"field": "timeframe",
"format": "yyyy",
"ranges": [{
"from": min(timeframe.gte),
"to": max(timeframe.lte)
}]
}
}
I think I need to use Date Range aggregation, min/max aggregation, and pipeline aggregations, but the more I read about them, the more I'm confused. I can't find how to glue this whole world together.
Keep in mind I can change the mapping and / or the document structure if this is not the correct way to achieve this.
Thanks !

As for me with the official "elasticsearch/elasticsearch" package of ES itself, I was able to find a range of my required documents with this document.
You need to read the documentation as you'll be needing the format.
$from_date = '2018-03-08T17:58:03Z';
$to_date = '2018-04-08T17:58:03Z';
$params = [
'index' => 'your_index',
'type' => 'your_type',
'body' => [
'query' => [
'range' => [
'my_date_field' => [
//gte = great than or equal, lte = less than or equal
'gte' => $from_date,
// 'lte' => $to_date,
'format' => "yyyy-MM-dd||yyyy-MM-dd'T'HH:mm:ss'Z'",
'boost' => 2.0
]
]
],
]
];
$search = $client->search($params);

Related

MongoDB Aggregation "group" with "max" field within a sub-array

Using Compass initially, I then need to convert it into the PHP library.
So far, I have a 1st stage that filters the documents on 2 fields using $match:
comp.id (sub-document / array)
playerId
Code is:
$match (from drop-down)
{
"comp.id" : ObjectId('607019361c071256e4f0d0d5'),
"playerId" : "609d0993906429612483cea0"
}
This returns 2 documents.
The document has a sub-array holes, for the holes played in a round of golf. This sub-array has fields (among others):
holes.no
holes.par
holes.grossScore
holes.nettPoints
So each round has 1 document, with a holes sub-array of (typically) 18 array elements (holes), or 9 for half-round. A player will play each round multiple times - hence multiple documents.
I would like to find the highest holes.nettPoints across the documents. I think I need to $group with $max on the holes.nettPoints field, so I would find the highest score for each hole across all rounds.
I have tried this, but in Compass its says its not properly formatted:
$group drop-down
{
_id: holes.no,
"highest":
{ $max: "$holes.nettPoints" }
}
"highest" can be any name I want?
EDIT FOLLOWING PROVIDED ANSWER
The answer marked as the solution was enough of a pointer for how the Aggregation Framework operates (multi-stage documents, i.e. documents as input to 1 stage become new documents as the output of that stage. And so on.
For the purposes of posterity, I ended up using the following aggregation:
[{$match: {
"comp.id" : ObjectId('607019361c071256e4f0d0d5'),
"playerId" : "609d0993906429612483cea0",
"comp.courseId" : "608955aaabebbd503ba6e116"
}
}, {$unwind: {
path : "$holes"
}}, {$group: {
_id: "$holes.no",
hole: {
$max: "$holes"
}
}}, {$sort: {
"hole": 1
}}]
In PHP speak, it looks like:
$match = [
'$match' => [
'comp.id' => new MongoDB\BSON\ObjectID( $compId ),
'playerId' => $playerId,
'comp.courseId' => $courseId
]
];
$unwind = [
'$unwind' => [
'path' => '$holes'
]
];
$group = [
'$group' => [
'_id' => '$holes.no',
'hole' => [
'$max' => '$holes'
]
]
];
$sort = [
'$sort' => [
'hole.no' => 1
]
];
$cursor = $collection->aggregate([$match, $unwind, $group, $sort]);
It is not complete (looking at adding a $sum accumulator across the courseId, not individual documents), but answers the question posted.
$match your conditions
$unwind deconstruct holes array
$sort by nettPoints in descending order
$group by no and select first holes object
[
{
$match: {
"comp.id": ObjectId("607019361c071256e4f0d0d5"),
"playerId": "609d0993906429612483cea0"
}
},
{ $unwind: "$holes" },
{ $sort: { "holes.nettPoints": -1 } },
{
$group: {
_id: "$holes.no",
highest: { $first: "$holes" }
}
}
]

Can i add tags to a "deeper" key in an Elastic Search document?

i have products with tags, and tags are inside tagtypes.
this is a sample document that i added to the index
{
"_index" : "products",
"_type" : "_doc",
"_id" : "1219",
"_score" : 1.0,
"_source" : {
"id" : "1219",
"product_no" : "26426492261",
"merchant_id" : 11,
"name" : "Apple »Magic Keyboard für das 12,9\" iPad Pro (4. Generation)« iPad-Tastatur",
"category" : "Technik>Multimedia>Zubehör>Tastatur>iPad Tastatur",
"deep_link" : "https://foo",
"short_description" : null,
"long_description" : "Apple:",
"brand" : "Apple",
"merchant_image_url" : "http://something",
"tagtypes" : [
[
{
"Memory" : [ ]
}
]
]
}
},
That tagtype "Memory" is dynamically created while indexing the products.
I tried to add tags to that key
//attach tags also to ES
$params = [
'index' => 'products',
'id' => $product['_id'],
'body' => [
'script' => [
'source' => 'if (!ctx._source.tagtypes.'.$tagType->name.'.contains(params.tag)) { ctx._source.tagtypes.'.$tagType->name.'.add(params.tag) }',
'lang' => 'painless',
'params' => [
'tag' => $tag->value
]
]
]
];
But i receive an error like
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"failed to execute script"}],"type":"illegal_argument_exception","reason":"failed to execute script","caused_by":{"type":"script_exception","reason":"runtime error","script_stack":["if (!ctx._source.tagtypes[\"Memory\"].contains(params.tag)) { "," ^---- HERE"],"script":"if (!ctx._source.tagtypes[\"Memory\"].contains(params.tag)) { ctx._source.tagtypes[\"Memory\"].add(params.tag) }","lang":"painless","position":{"offset":16,"start":0,"end":60},"caused_by":{"type":"wrong_method_type_exception","reason":"cannot convert MethodHandle(List,int)int to (Object,String)String"}}},"status":400}
Could anyone help me with that. I couldnt find any documentation about it, as the examples are often too basic.
Is it generally possible to save to "deeper keys" like this ?
Or can i just create "tags" as simple list (without any depth)
Thanks in advance
Adrian!
Your field tagtypes is an array of arrays of objects which themselves contain one-key arrays.
When you're dealing with such "deep" structures, you'll need some form of iteration to update them.
For loops are a good place start but they often lead to java.util.ConcurrentModificationExceptions. So it's easier to work with temporary copies of data and then replace the corresponding _source attribute when done with the iterations:
{
"query": {
"match_all": {}
},
"script": {
"source": """
if (ctx._source.tagtypes == null) { return; }
def originalTagtypes = ctx._source.tagtypes;
def newTagtypes = [];
for (outerGroup in originalTagtypes) {
// keep what we've got
newTagtypes.addAll(outerGroup);
// group already present?
def atLeastOneGroupContainsTag = outerGroup.stream().anyMatch(tagGroup -> tagGroup.containsKey(params.tag));
// if not, add it as a hashmap of one single empty list
if (!atLeastOneGroupContainsTag) {
Map m = new HashMap();
m.put(params.tag, []);
newTagtypes.add(m);
}
}
ctx._source.tagtypes = [newTagtypes];
""",
"lang": "painless",
"params": {
"tag": "CPU"
}
}
}
which'll end up updating the tagtypes like so:
{
...
"tagtypes" : [
[
{
"Memory" : [ ]
},
{
"CPU" : [ ] <---
}
]
],
...
}
You're right when you say that the documentation examples are too basic. Shameless plug: I recently published a handbook that aims to address exactly that. You'll find lots non-trivial scripting examples to gain a better understanding of the Painless scripting language.

Query where timestamp field is older than another timestamp field in MongoDB with PHP

How can I obtain an object from a MongoDB collection where a specific field1 (timestamp or date) is older/newer than another specific field2 (timestamp or date)?
Given the following example object:
// MongoDB 3.2
{
name: 'test',
updated_on: Timestamp(1474416000, 0),
export: {
active: true,
last_exported_on: Timestamp(1474329600, 0)
}
}
This object should match a query like: where export.active is true and updated_on > export.last_exported_on
I've tried it with the aggregation framework, since I've read that $where can be very slow, but without any success.
// PHP 5.4 (and MongoDB PHP lib. http://mongodb.github.io/mongo-php-library)
$collection->aggregate([
['$project' => [
'dst' => ['$cmp' => ['updated_on', 'export.last_exported_on']],
'name' => true
]],
['$match' => ['dst' => ['$gt' => 0], 'export.active' => ['$eq' => true]]],
['$limit' => 1]
]);
I can change timestamps into date or anything else, but I don't see the problem in the type.
Edit: Not all objects have the last_exported_on or the export fields at all. Besides that both can be null or empty or 000000.
That's because after you do the $project you end up only with the dst and _id fields, so you cannot $match on export.active. You need to match on export.active before the projection. After that you need another match on the dst field.
[
{
$match: {
"export.active": true
}
},
{
$project: {
dst: {
$cmp: [
"$updated_on",
"$export.last_exported_on"
]
}
}
},
{
$match: {
dst: 1
}
}
]
Edit
Alternatively, you can make sure to preserve export.active and to spare another $match:
[
{
$project: {
"export.active": 1,
cmp: {
$cmp: [
"$updated_on",
"$export.last_exported_on"
]
}
}
},
{
$match: {
cmp: 1,
"export.active": true
}
}
]

Return a subset of array where a given field is present

I would like to filter the Categories embedded Array to get only those which have a parent key.
{
"_id": ObjectId("5737283639533c000978ae71"),
"name": "Swiss",
"Categories": [
{
"name": "Management",
"_id": ObjectId("5738982e39533c00070f6a53")
},
{
"name": "Relations",
"_id": ObjectId("5738984a39533c000978ae72"),
"parent": ObjectId("5738982e39533c00070f6a53")
},
{
"name": "Ambiance",
"_id": ObjectId("57389bed39533c000b148164")
}
]
}
I've tried with the find but without success.
After some research it seems that it can be done via the aggregation command but I don't like the way it works, I would prefer to use only the find command.
Also, I'm asking myself if in term of performances it wouldn't be better to store each Categories in a new collection, would it be ?
Edit, I would like to get something like this as find output :
[
{
"name": "Relations",
"_id": ObjectId("5738984a39533c000978ae72"),
"parent": ObjectId("5738982e39533c00070f6a53")
}
]
The optimal way to do is in MongoDB 3.2 using the aggregation framework. All you need is project your documents and use the $filter operator to return a subset of the "Categories" array that match your criteria, but to do this you will need to use $ifNull operator give a "default" value to the "parent" field in all those sub-documents where that field is missing then use the $ne in your cond expression which determine where a give element should be included in the subset.
db.collection.aggregate([
{ "$project" : {
"_id": 0,
"Categories": {
"$filter": {
"input": "$Categories",
"as": "catg",
"cond": {
"$ne": [
{ "$ifNull": [ "$$catg.parent", false ] },
false
]
}
}
}
}}
])
From version 3.0 backwards, you need a different approach. Instead you need to use the $map operator to return a give element if it matches your criteria or false then use the $setDifference operator to filter out all those element in the returned array which are equal to false. Of course $setDifference is fine as long as the data being filtered is "unique".
db.collection.aggregate([
{ "$project" : {
"_id": 0,
"Categories": {
"$setDifference": [
{ "$map": {
"input": "$Categories",
"as": "catg",
"in": {
"$cond": [
{ "$ne": [
{ "$ifNull": [ "$$catg.parent", false ] },
false
]},
"$$catg",
false
]}
}
},
[ false ]
]
}
}}
])
Translation in PHP gives:
db.collection.aggregate(
array(
array("$project" => array(
"_id" => 0,
"Categories" => array(
"$filter" => array(
"input" => "$Categories",
"as" => "catg",
"cond" => array(
"$ne" => array(
array("$ifNull" => array("$$catg.parent", false),
false
)
)
)
)
))
)
)
And something this:
db.collection.aggregate(
array(
array("$project" => array(
"_id" => 0,
"Categories" => array(
"$setDifference" => array(
"$map" => array(
"input" => "$Categories",
"as" => "catg",
"in" => array(
"$cond" => array(
"$ne" => array(
array("$ifNull" => array( "$$catg.parent", false ) ),
false
),
"$$catg",
false
)
),
array(false)
)
)
))
)
)
As a solution according to above mentioned description please try executing following query
db.mycoll.find({Categories:{$elemMatch:{parent:{$exists:true}}}},
{Categories:{$elemMatch:{parent:{$exists:true}}}})
The above example uses $elemMatch operator to filter elements in an embedded document.

MongoDB aggregation by time interval PHP

I'm using MongoDB to store server statistics that are captured every 15 seconds (so 4 rows get inserted each minute per server) and am trying to get this data plotted onto a graph for all data between a certain timestamp.
For example, the following query can be used:
$tbl->find(
array(
"timestamp" => array('$gte' => '1396310400', '$lte' => '1396915200'),
"service" => 'a715feac3db42f54edbc50ef6fa057b3'
),
array("timestamp" => 1, "system" => 1)
);
Which spits our a bunch of rows that look like this:
Array
(
[53933ad8532965621d97dd3b] => Array
(
[_id] => MongoId Object
(
[$id] => 53933ad8532965621d97dd3b
)
[system] => Array
(
[load] => 0.55
[uptime] => 1171204.47
[processes] => 222
)
[timestamp] => 1396310403
)
)
This works fine for small data ranges, as I can pass this data directly into Flot or HighCharts and let it prettify the time scales itself. However this doesn't work for large data sets (for example querying over a month).
What I'm trying to do is group the data by hour (or 15 minutes), and return the average values (in this example, its system.load that I'm plotting) for that given time period.
I know that the aggregate function is what I need be using, but despite my best efforts I've been unable to get this working.
Right now I'm letting PHP do all of the work (grouping the results by timestamp and working out the averages) but it's extremely slow and I know MongoDB would handle it better.
Any insight would be greatly appreciated!
Edit:
I've been trying to follow the answer posted here but am still struggling - MongoDB Aggregation PHP, Group by Hours
I'm looking at your initial query at the top of your question and it immediately tells me that your "timestamp" values are actually strings. So no doubt that when you are reading this information and doing your "manual aggregation" you are actually casting these values, and possibly others into types that you can manipulate, sum and average.
So the first part here is to fix your data, that looks like it has come from a logging source but you have never converted the values. I'm considering it reasonably possible that this is not just the timestamp values but probably also your metrics under system.
This leaves you with a choice of how to store your timestamp. You can either just keep that as a timestamp number as it currently is in string form, or you can opt for converting to a BSON date type. The first one will be a simple integer cast and save back, the other you should be able to feed to the Date type that is supported by the driver and again save back the data.
When you have done this, then you can happily use the aggregation functions. So as an example for if you choose to keep this as a number, then you just apply date math in order to get the grouping boundaries:
db.collection.aggregate([
// Match documents on the range you want
{ "$match": {
"timestamp": {
"$gte": 1396310400, "$lte": 1396915200
},
"service": "a715feac3db42f54edbc50ef6fa057b3"
}},
// Group on the time intervals, 15 minutes here
{ "$group": {
"_id": {
"service": "$service",
"time": {
"$subtract": [
"$timestamp",
{ "$mod": [ "$timestamp", 60 * 15 ] }
]
}
},
"load": { "$avg": "$system.load" }
}},
// Project to the output form you want
{ "$project": {
"service": "$_id.service",
"time" : "$_id.time",
"load": 1
}}
])
Or to be php specific
$tbl->aggregate(array(
array(
'$match' => array(
'timestamp' => array(
'$gte' => 1396310400, '$lte' => 1396915200
),
'service' => 'a715feac3db42f54edbc50ef6fa057b3'
)
),
array(
'$group' => array(
'_id' => array(
'service' => '$service',
'time' => array(
'$subtract' => array(
'$timestamp',
array( '$mod' => array('$timestamp', 60 * 15 ) )
)
)
),
'load' => array( '$avg' => '$system.load' )
)
),
array(
'$project' => array(
'service' => '$_id.service',
'time' => '$_id.time',
'load' => 1
)
)
))
Otherwise if you choose to convert to BSON dates then you can use the date aggregation operators instead:
db.collection.aggregate([
{ "$match": {
"timestamp": {
"$gte": new Date("2014-04-01"), "$lte": new Date("2014-04-08")
},
"service": "a715feac3db42f54edbc50ef6fa057b3"
}},
{ "$group": {
"service": "$service",
"time": {
"dayOfYear": { "$dayOfYear": "$timestamp" },
"hour": { "$hour": "$timestamp" },
"minute": {
"$subtract": [
{ "$minute": "$timestamp" },
{
"$mod": [
{ "$minute": "$timestamp" },
15
]
}
]
}
},
"load": { "$avg": "$system.load" }
}},
{ "$project": {
"service": "$_id.service",
"time": "$_id.time",
"load": 1
}}
])
So there you have the help of the date aggregation operators to break up parts of the date your have and still use the same modulo operation in order to get interval values.
If you still prefer the date math approach you can still do this with date objects as the result of subtracting one date object from another will be the epoch timestamp value. So moving a BSON date to a epoch timestamp is just a matter of:
{
"$subtract": [
"$dateObjectField",
new Date("1970-01-01")
]
}
So any "date" values you pass in to the pipeline here you can cast using the native type methods of your driver and it will be serialized correctly when the request is sent to MongoDB. The other advantage is the same is true when you read them back, so there is no more need for conversion in client processing.

Categories