I have the following document mapped in ES 5:
{
"appName" : {
"mappings" : {
"market_audit" : {
"properties" : {
"generation_date": {
"type": "date"
},
"customers" : {
"type" : "nested",
"properties" : {
"customer_id" : {
"type" : "integer"
},
[... other properties ...]
}
Several entries in the "customers" node may have the same customer_id, and I am trying to retrieve only the entries having a specific customer_id (ie. "1") along with the "generation_date" of the top-level document (only the latest document is to be processed).
I was able to come up with the following query:
{
"query": {},
"sort": [
{ "generation_date": "desc" }
],
"size": 1,
"aggregations": {
"nested": {
"nested": {
"path": "customers"
},
"aggregations": {
"filter": {
"filter": {
"match": {
"customers.customer_id": {
"query": "1"
}
}
},
"aggregations": {
"tophits_agg": {
"top_hits": {}
}
}
}
}
}
}
}
This query gets me the data I'm interested in, located in the "aggregations" array (along with the "hits" one that contains the whole document). The issue here is that the framework I use (ONGR's ElasticSearch bundle along with the DSL bundle, using Symfony3) complains every time I try to get access to the actual data that no buckets are available.
I've read the ES documentation but could not come up with a working query that added buckets. I'm sure I am missing something, a little help would be more than welcome. If you have an idea on how to appropriately modify the query I think I can come up with the PHP code to produce it.
EDIT: since this question got some views and no answer (and I'm still stuck), I would settle for any query that allows me to retrieve information about a specific "customer" (using customer_id) from the latest document generated (according to the "generation_date" field). The query I gave is just what I was able to come up with and I'm pretty sure there's a far better way to do that. Suggestions maybe ?
EDIT 2:
Here's the data sent to ES:
{
"index": {
"_type": "market_data_audit_document"
}
}
{
"customers": [
{
"customer_id": 1,
"colocation_name": "colo1",
"colocation_id": 26,
"device_name": "device 1",
"channels": [
{
"name": "channel1-5",
"multicast":"1.2.1.5",
"sugar_state":4,
"network_state":1
}
]
},
{
"customer_id":2,
"colocation_name":"colo2",
"colocation_id":27,
"device_name":"device 2",
"channels": [
{
"name":"channel2-5",
"multicast":"1.2.2.5",
"sugar_state":4,
"network_state":1
}
]
},
{
"customer_id":3,
"colocation_name":"colo3",
"colocation_id":28,
"device_name":"device 3",
"channels": [
{
"name":"channel3-5",
"multicast":"1.2.3.5",
"sugar_state":4,
"network_state":1
}
]
},
{
"customer_id":4,
"colocation_name":"colo4",
"colocation_id":29,
"device_name":"device 4"
,"channels": [
{
"name":"channel4-5",
"multicast":"1.2.4.5",
"sugar_state":4,
"network_state":1
}
]
},
{
"customer_id":5,
"colocation_name":"colo5",
"colocation_id":30,
"device_name":"device 5",
"channels": [
{
"name":"channel5-5",
"multicast":"1.2.5.5",
"sugar_state":4,
"network_state":1
}
]
}
],
"generation_date":"2017-02-27T10:55:45+0100"
}
Unfortunately, when I tried to send the query listed in this post, I discovered that the aggregation does not do what I expected it to do: it returns "good" data, but from ALL the stored documents ! Here's an output example:
{
"timed_out" : false,
"took" : 60,
"hits" : {
"total" : 2,
"hits" : [
{
"_source" : {
"customers" : [
{
"colocation_id" : 26,
"channels" : [
{
"name" : "channel1-5",
"sugar_state" : 4,
"network_state" : 1,
"multicast" : "1.2.1.5"
}
],
"customer_id" : 1,
"colocation_name" : "colo1",
"device_name" : "device 1"
},
{
"colocation_id" : 27,
"channels" : [
{
"multicast" : "1.2.2.5",
"network_state" : 1,
"name" : "channel2-5",
"sugar_state" : 4
}
],
"customer_id" : 2,
"device_name" : "device 2",
"colocation_name" : "colo2"
},
{
"device_name" : "device 3",
"colocation_name" : "colo3",
"customer_id" : 3,
"channels" : [
{
"multicast" : "1.2.3.5",
"network_state" : 1,
"sugar_state" : 4,
"name" : "channel3-5"
}
],
"colocation_id" : 28
},
{
"channels" : [
{
"sugar_state" : 4,
"name" : "channel4-5",
"multicast" : "1.2.4.5",
"network_state" : 1
}
],
"customer_id" : 4,
"colocation_id" : 29,
"colocation_name" : "colo4",
"device_name" : "device 4"
},
{
"device_name" : "device 5",
"colocation_name" : "colo5",
"colocation_id" : 30,
"channels" : [
{
"sugar_state" : 4,
"name" : "channel5-5",
"multicast" : "1.2.5.5",
"network_state" : 1
}
],
"customer_id" : 5
}
],
"generation_date" : "2017-02-27T11:45:37+0100"
},
"_type" : "market_data_audit_document",
"sort" : [
1488192337000
],
"_index" : "mars",
"_score" : null,
"_id" : "AVp_LPeJdrvi0cWb8CrL"
}
],
"max_score" : null
},
"aggregations" : {
"nested" : {
"doc_count" : 10,
"filter" : {
"doc_count" : 2,
"tophits_agg" : {
"hits" : {
"max_score" : 1,
"total" : 2,
"hits" : [
{
"_nested" : {
"offset" : 0,
"field" : "customers"
},
"_score" : 1,
"_source" : {
"channels" : [
{
"name" : "channel1-5",
"sugar_state" : 4,
"multicast" : "1.2.1.5",
"network_state" : 1
}
],
"customer_id" : 1,
"colocation_id" : 26,
"colocation_name" : "colo1",
"device_name" : "device 1"
}
},
{
"_source" : {
"colocation_id" : 26,
"customer_id" : 1,
"channels" : [
{
"multicast" : "1.2.1.5",
"network_state" : 1,
"name" : "channel1-5",
"sugar_state" : 4
}
],
"device_name" : "device 1",
"colocation_name" : "colo1"
},
"_nested" : {
"offset" : 0,
"field" : "customers"
},
"_score" : 1
}
]
}
}
}
}
},
"_shards" : {
"total" : 13,
"successful" : 1,
"failures" : [
{
"reason" : {
"index" : ".kibana",
"index_uuid" : "bTkwoysSQ0y8Tt9yYFRStg",
"type" : "query_shard_exception",
"reason" : "No mapping found for [generation_date] in order to sort on"
},
"shard" : 0,
"node" : "4ZUgOm4VRry6EtUK15UH3Q",
"index" : ".kibana"
},
{
"reason" : {
"index_uuid" : "lN2mVF9bRjuDtiBF2qACfA",
"index" : "archiv1_log",
"type" : "query_shard_exception",
"reason" : "No mapping found for [generation_date] in order to sort on"
},
"shard" : 0,
"node" : "4ZUgOm4VRry6EtUK15UH3Q",
"index" : "archiv1_log"
},
{
"index" : "archiv1_session",
"shard" : 0,
"node" : "4ZUgOm4VRry6EtUK15UH3Q",
"reason" : {
"type" : "query_shard_exception",
"index" : "archiv1_session",
"index_uuid" : "cmMAW04YTtCb0khEqHpNyA",
"reason" : "No mapping found for [generation_date] in order to sort on"
}
},
{
"shard" : 0,
"node" : "4ZUgOm4VRry6EtUK15UH3Q",
"reason" : {
"reason" : "No mapping found for [generation_date] in order to sort on",
"index" : "archiv1_users_dev",
"index_uuid" : "AH48gIf5T0CXSQaE7uvVRg",
"type" : "query_shard_exception"
},
"index" : "archiv1_users_dev"
}
],
"failed" : 12
}
}
Based on your description :
you store documents on elasticsearch with a bunch of properties
each document contains a list of customer within array (nested documents)
you want to extract only nested document related to a customer.id
your lib does not manage Elasticsearch response without buckets
your are expecting Elasticsearch to return Nested Documents
Problem
It exists 2 kind of aggregations :
buckets
metrics
In your case you ve 2 Aggregations under Nested Agg : Filter and Metric.
Filter :
Filter defines a single bucket of all the documents but does not provide 'bucket' keyword on results.
Top hits is a Metric and does not provides a bucket.
workaround :
I doubt that your PHP lib will handle correctly the Nested aggregation result, but you could use Filters instead of Filter Aggregations to get a bucket list
{
"aggregations": {
"nested": {
"nested": {
"path": "customers"
},
"aggregations": {
"filters_customer": {
"filters": {
"filters": [
{
"match": {
"customers.customer_id": "1"
}
}
]
},
"aggregations": {
"top_hits_customer": {
"top_hits": {}
}
}
}
}
}
}
}
Will provide something like :
{
"aggregations": {
"nested": {
"doc_count": 15,
"filters_customer": {
"buckets": [
{
"doc_count": 3,
"top_hits_customer": {
"hits": {
"total": 3,
"max_score": 1,
"hits": [
{
"_nested": {
"field": "customers",
"offset": 0
},
"_score": 1,
"_source": {
"customer_id": 1,
"foo": "bar"
}
},
{
"_nested": {
"field": "customers",
"offset": 0
},
"_score": 1,
"_source": {
"customer_id": 1,
"foo": "bar"
}
},
{
"_nested": {
"field": "customers",
"offset": 0
},
"_score": 1,
"_source": {
"customer_id": 1,
"foo": "bar"
}
}
]
}
}
}
]
}
}
}
}
Note on your EDIT 2
Elasticsearch will search over all documents, not on 'TOP 1' document based on your report date. A way to split your results by report is using term bucket on report date :
{
"query": {},
"size": 0,
"aggregations": {
"grp_report": {
"terms": {
"field": "generation_date"
},
"aggregations": {
"nested_customers": {
"nested": {
"path": "customers"
},
"aggregations": {
"filters_customer": {
"filters": {
"filters": [
{
"match": {
"customers.customer_id": "1"
}
}
]
},
"aggregations": {
"top_hits_customer": {
"top_hits": {}
}
}
}
}
}
}
}
}
}
Advices :
Avoid complex documents, prefer splitting your report in small documents with a related key (reportId for example). You will be able to filter and aggregate easily without any nested document. Add on customer document information on witch you will filter across all types (redundancy is not a problem in this case).
Use case examples :
reports listing
show customers information per reports
show history for a customer across multiple reports
Current document example : /indexName/market_audit
{
"generation_date": "...",
"customers": [
{
"id": 1,
"foo": "bar 1"
},
{
"id": 2,
"foo": "bar 2"
},
{
"id": 3,
"foo": "bar 3"
}
]
}
Reformated document :
/indexName/market_audit_report
{
"report_id" : "123456"
"generation_date": "...",
"foo":"bar"
}
/indexName/market_audit_customer documents
{
"report_id" : "123456"
"customer_id": 1,
"foo": "bar 1"
}
{
"report_id" : "123456"
"customer_id": 2,
"foo": "bar 2"
}
{
"report_id" : "123456"
"customer_id": 3,
"foo": "bar 3"
}
If you know your report id you will be able to get all your data in one request :
a filter on report id
a term aggregation on type
a filter on type report
a top_hit aggregation to get report
a filter aggregation to get only type customer and customer id 1
a top_hit aggregation to customer 1 info
Or
a filter on report id
a term aggregation on type
a filter on type report
a top_hit aggregation to get report
a term aggregation on customer id
a top_hit aggregation to retrieve information per customer
Top Hits Aggregation Size
Do not forget to provide a size in your top_hits otherwise you will get only the top 3
Reading elasticsearch first line of aggregations definition I think that you don't understand well how it works:
The aggregations framework helps provide aggregated data based on a
search query
Since your query hasn't any filter at all, returning ALL the stored documents in hits.hits objects is the expected result.
Then you use a filter aggregation that helps you to get desired documents, but they are in aggs property of returned dict
If I'm right, I'd recommend you to keep it as simple as you can, so here's my guessed query
{
"query": {
"filtered": {
"filter": {
"nested": {
"path" : "customers",
"filter": {
"bool": {
"must" : [
"term": {"customer_id" : "1"}
]
}
}
}
}
}
},
"aggregations": {
"tophits_agg": {
"top_hits": {}
}
}
}
I have this problem. I have this dataBase in mongoDB:
{
"_id" : ObjectId("585fe33d3c63b4a81e00002b"),
"class" : [
{
"name" : "class 1",
"people" : [
{
"id" : "58596",
"name" : "mark",
},
{
"id" : "45643",
"name" : "Susan",
},
{
"id" : "85952",
"name" : "Loris",
}
},
{
"name" : "class 2",
"people" : [
{
"id" : "58456",
"name" : "Sissi",
},
{
"id" : "45643",
"name" : "Susan",
}
]
}
]
}
I use php and I would like to know the names of the class with a specific name inside and save them in an array.
For example if I choose Susan i would like to have an array with ["class 1" , ["class 2"].
I have used findOne but this time i need to use find.
You can make use of MongoDB aggregation framework.
Here is a query which will give you the date in the desired form. However, I believe that there will be some better and efficient way to handle this but this is what I came up with.
db.collection.aggregate([
{"$unwind":"$class"},
{"$unwind":"$class.people"},
{"$match": {
"class.people.name":"Susan"
}
},
{"$group":{
"_id":"$class.people.name",
"classes":{"$push":"$class.name"}
}
}
])
Result :-
{ "_id" : "Susan", "classes" : [ "class 1", "class 2" ] }
Try to $match before first $unwind operation to avoid the unnecessary results in the pipeline before $unwind.
Refer Aggregation Pipeline Optimization for improved performance.
I'm trying to retrieve a list of sub arrays of a document which meets a particular condition.
"_id" : "something",
"players" : [
{
"Name" : "Sunny"
"score": 20
},
{
"Name" : "John"
"score" : 40
},
{
"Name" : "Alice"
"score" : 20
},
etc...
]
I wanted output of those with score = 20 like
{
"Name" : "Sunny"
"score": 20
},
{
"Name" : "Alice"
"score" : 20
}
But I tried querying with:
db.collection.find(
{ "players.score":20, "_id":"something" },
{ "players" :1 }
)
But it gives me all the 3 sub arrays like
{
"Name" : "Sunny"
"score": 20
},
{
"Name" : "John"
"score" : 40
},
{
"Name" : "Alice"
"score" : 20
}
If I use projector "$" or $matchelement like:
db.collection.find({ "players.$.score":20, "_id":"something" }
It gives the very first array example
{
"Name" : "Sunny"
"score": 20
}
Can anyone help me out with correct query for this .
thanks in advance :)
Yes the positional $ operator will only match the first element found in a matching array condition. In order to just filter the elements you want, use aggregate:
db.collection.aggregate([
// Uwinds the array (de-normalize)
{ "$unwind": "$players" },
// Match just the elements you want
{ "$match": { "players.score": 20 } },
// Push everything back into an array like it was
{ "$group": {
"_id": "$_id",
"players": { "$push": {
"name": "$players.name",
"score": "$players.score"
}}
}}
])
If your document has more detail and you need that back as well, see here.
For the record, the other operator you were trying other than the direct dot . notation was $elemMatch.
Can't seem to find an answer to my doubt, so I decided to post the question and see if someone can help me.
In my application, I have an array of ids which comes from the backend and which is ordered already as I want, for example:
[0] => 23, [1] => 12, [2] => 45, [3] => 21
I then "ask" elasticsearch the information corresponding to each id present in this array, using a terms filter. The problem is the results don't come in the order of the ids I sent, so the results get mixed up, like: [0] => 21, [1] => 45, [2] => 23, [3] => 12
Note that I can't sort in elasticsearch by the sorting that orders the array in the backend.
I also can't order them in php as I'm retrieving paginated results from elasticsearch, so if each oage had 2 results, elasticsearch could give me the info only for [0] => 21, [1] => 45, so I can't even order them with php.
How can I get the results ordered by the input array? Any ideas?
Thanks in advance
Here is one way you can do it, with custom scripted scoring.
First I created some dummy data:
curl -XPUT "http://localhost:9200/test_index"
curl -XPOST "http://localhost:9200/test_index/_bulk " -d'
{ "index" : { "_index" : "test_index", "_type" : "docs", "_id" : 1 } }
{ "name" : "Document 1", "id" : 1 }
{ "index" : { "_index" : "test_index", "_type" : "docs", "_id" : 2 } }
{ "name" : "Document 2", "id" : 2 }
{ "index" : { "_index" : "test_index", "_type" : "docs", "_id" : 3 } }
{ "name" : "Document 3", "id" : 3 }
{ "index" : { "_index" : "test_index", "_type" : "docs", "_id" : 4 } }
{ "name" : "Document 4", "id" : 4 }
{ "index" : { "_index" : "test_index", "_type" : "docs", "_id" : 5 } }
{ "name" : "Document 5", "id" : 5 }
{ "index" : { "_index" : "test_index", "_type" : "docs", "_id" : 6 } }
{ "name" : "Document 6", "id" : 6 }
{ "index" : { "_index" : "test_index", "_type" : "docs", "_id" : 7 } }
{ "name" : "Document 7", "id" : 7 }
{ "index" : { "_index" : "test_index", "_type" : "docs", "_id" : 8 } }
{ "name" : "Document 8", "id" : 8 }
{ "index" : { "_index" : "test_index", "_type" : "docs", "_id" : 9 } }
{ "name" : "Document 9", "id" : 9 }
{ "index" : { "_index" : "test_index", "_type" : "docs", "_id" : 10 } }
{ "name" : "Document 10", "id" : 10 }
'
I used an "id" field even though it's redundant, since the "_id" field gets converted to a string, and the scripting is easier with integers.
You can get back a specific set of docs by id with the ids filter:
curl -XPOST "http://localhost:9200/test_index/_search" -d'
{
"filter": {
"ids": {
"type": "docs",
"values": [ 1, 8, 2, 5 ]
}
}
}'
but these will not necessarily be in the order you want them. Using script based scoring, you can define your own ordering based on document ids.
Here I pass in a parameter that is a list of objects that relate ids to score. The scoring script simply loops through them until it finds the current document id and returns the predetermined score for that document (or 0 if it isn't listed).
curl -XPOST "http://localhost:9200/test_index/_search" -d'
{
"filter": {
"ids": {
"type": "docs",
"values": [ 1, 8, 2, 5 ]
}
},
"sort" : {
"_script" : {
"script" : "for(i:scoring) { if(doc[\"id\"].value == i.id) return i.score; } return 0;",
"type" : "number",
"params" : {
"scoring" : [
{ "id": 1, "score": 1 },
{ "id": 8, "score": 2 },
{ "id": 2, "score": 3 },
{ "id": 5, "score": 4 }
]
},
"order" : "asc"
}
}
}'
and the documents are returned in the proper order:
{
"took": 11,
"timed_out": false,
"_shards": {
"total": 2,
"successful": 2,
"failed": 0
},
"hits": {
"total": 4,
"max_score": null,
"hits": [
{
"_index": "test_index",
"_type": "docs",
"_id": "1",
"_score": null,
"_source": {
"name": "Document 1",
"id": 1
},
"sort": [
1
]
},
{
"_index": "test_index",
"_type": "docs",
"_id": "8",
"_score": null,
"_source": {
"name": "Document 8",
"id": 8
},
"sort": [
2
]
},
{
"_index": "test_index",
"_type": "docs",
"_id": "2",
"_score": null,
"_source": {
"name": "Document 2",
"id": 2
},
"sort": [
3
]
},
{
"_index": "test_index",
"_type": "docs",
"_id": "5",
"_score": null,
"_source": {
"name": "Document 5",
"id": 5
},
"sort": [
4
]
}
]
}
}
Here is a runnable example: http://sense.qbox.io/gist/01b28e5c038c785f0844abb7c01a71d69a32a2f4