MongoDB Ordering by average combined numbers or nested sub arrays - php

Having some issues working out the best way to do this in MongoDB, arguably its a relation data set so I will probably be slated. Still its a challenge to see if its possible.
I currently need to order by a Logistics Managers' daily average miles across the vans in their department and also in a separate list a combined weekly average.
Mr First setup in the database was as follows
{
"_id" : ObjectId("555cf04fa3ed8cc2347b23d7"),
"name" : "My Manager 1",
"vans" : [
{
"name" : "van1",
"miles" : NumberLong(56)
},
{
"name" : "van2",
"miles" : NumberLong(34)
}
]
}
But I can't see how to order by a nested array value without knowing the parent array keys (these will be standard 0-x)
So my next choice was to scrap that idea just have the name in the first collection and the vans in the second collection with Id of the manager.
So removing vans from the above example and adding this collection (vans)
{
"_id" : ObjectId("555cf04fa3ed8cc2347b23d9"),
"name" : "van1",
"miles" : NumberLong(56),
"manager_id" : "555cf04fa3ed8cc2347b23d7"
}
But because I need show the results by manager, how do I order in a query (if possible) the average miles in this collection where id=x and then display the manager by his id.
Thanks for your help

If the Manager is going to have limited number of Vans, then your first approach is better, as you do not have to make two separate calls/queries to the database to collect your information.
Then comes the question how to calculate the average milage per Manager, where the Aggregation Framework will help you a lot. Here is a query that will get you the desired data:
db.manager.aggregate([
{$unwind: "$vans"},
{$group:
{_id:
{
_id: "$_id",
name: "$name"
},
avg_milage: {$avg: "$vans.miles"}
}
},
{$sort: {"avg_milage": -1}},
{$project:
{_id: "$_id._id",
name: "$_id.name",
avg_milage: "$avg_milage"
}
}
])
The first $unwind step simply unwraps the vans array, and creates a separate documents for each element of the array.
Then the $group stage gets all documents with the same (_id, name) pair, and in the avg_milage field, counts the average value of miles field out of those documents.
The $sort stage is obvious, it just sorts the documents in the descending order, using the new avg_milage field as the sort key.
And finally, the last $project step just cleans up the documents by making appropriate projections, just for beauty :)
A similar thing is needed for your second desired result:
db.manager.aggregate([
{$unwind: "$vans"},
{$group:
{_id:
{
_id: "$_id",
name: "$name"
},
total_milage: {$sum: "$vans.miles"}
}
},
{$sort: {"total_milage": -1}},
{$project:
{_id: "$_id._id",
name: "$_id.name",
weekly_milage: {
$multiply: [
"$total_milage",
7
]
}
}
}
])
This will produce the list of Managers with their weekly milage, sorted in descending order. So you can $limit the result, and get the Manager with the highest milage for instance.
And in pretty much similar way, you can grab info for your vans:
db.manager.aggregate([
{$unwind: "$vans"},
{$group:
{_id: "$vans.name",
total_milage: {$sum: "$vans.miles"}
}
},
{$sort: {"total_milage": -1}},
{$project:
{van_name: "$_id",
weekly_milage: {
$multiply: [
"$total_milage",
7
]
}
}
}
])

First, do you require average miles for a single day, average miles over a given time period, or average miles over the life of the manager? I would consider adding a timestamp field. Yes, _id has a timestamp, but this only reflects the time the document was created, not necessarily the time of the initial day's log.
Considerations for the first data model:
Does each document represent one day, or one manager?
How many "vans" do you expect to have in the array? Does this list grow over time? Do you need to consider the 16MB max doc size in a year or two from now?
Considerations for the second data model:
Can you store the manager's name as the "manager_id" field? Can this be used as a possible unique ID for a secondary meta lookup? Doing so would limit the necessity of a secondary manager meta-data lookup just to get their name.
As #n9code has pointed out, the aggregation framework is the answer in both cases.
For the first data model, assuming each document represents one day and you want to retrieve an average for a given day or a range of days:
db.collection.aggregate([
{ $match: {
name: 'My Manager 1',
timestamp: { $gte: ISODate(...), $lt: ISODate(...) }
} },
{ $unwind: '$vans' },
{ $group: {
_id: {
_id: '$_id',
name: '$name',
timestamp: '$timestamp'
},
avg_mileage: {
$avg: '$miles'
}
} },
{ $sort: {
avg_mileage: -1
} },
{ $project: {
_id: '$_id._id',
name: '$_id.name',
timestamp: '$_id.timestamp',
avg_mileage: 1
} }
]);
If, for the first data model, each document represents a manager and the "vans" array grows daily, this particular data model is not ideal for two reasons:
"vans" array may grow beyond max document size... eventually, although that would be a lot of data
It is more difficult and memory intensive to limit a certain date range since timestamp at this point would be nested within an item of "vans" and not in the root of the document
For the sake of completeness, here is the query:
/*
Assuming data model is:
{
_id: ...,
name: ...,
vans: [
{ name: ..., miles: ..., timestamp: ... }
]
}
*/
db.collection.aggregate([
{ $match: {
name: 'My Manager 1'
} },
{ $unwind: '$vans' },
{ $match: {
'vans.timestamp': { $gte: ISODate(...), $lt: ISODate(...) }
} },
{ $group: {
_id: {
_id: '$_id',
name: '$name'
},
avg_mileage: {
$avg: '$miles'
}
} },
{ $sort: {
avg_mileage: -1
} },
{ $project: {
_id: '$_id._id',
name: '$_id.name',
avg_mileage: 1
} }
]);
For the second data model, aggregation is more straightforward. I'm assuming the inclusion of a timestamp:
db.collection.aggregate([
{ $match: {
manager_id: ObjectId('555cf04fa3ed8cc2347b23d7')
timestamp: { $gte: ISODate(...), $lt: ISODate(...) }
} },
{ $group: {
_id: '$manager_id'
},
avg_mileage: {
$avg: '$miles'
}
names: {
$addToSet: '$name'
}
} },
{ $sort: {
avg_mileage: -1
} },
{ $project: {
manager_id: '$_id',
avg_mileage: 1
names: 1
} }
]);
I have added an array of names (vehicles?) used during the average computation.
Relevant documentation:
$match, $unwind, $group, $sort, $project - Pipeline Aggregation Stages
$avg, $addToSet - Group Accumulator Operators
Date types
ObjectId.getTimestamp

Related

How do I sum multiple fields in doctrine odm?

I want to use doctrine odm's aggregation builder to build this query:
db.TeamStandings.aggregate(
// Pipeline
[
// Stage 1
{
$match: {
"team.$id": ObjectId("5a1643fdf5d8741a883c2aeb")
}
},
// Stage 2
{
$group: {
"_id": { "team": "team.$id" },
// This is the sum of multiple fields
"games": { $sum: { $sum: ["$wins", "$losses", "$ties"] } },
"wins": { $sum: "$wins" },
"losses": { $sum: "$losses" },
"ties": { $sum: "$ties" },
"homeWins" : { $sum: "$homeRecord.wins" },
"homeLosses" : { $sum: "$homeRecord.losses" },
"homeTies" : { $sum: "$homeRecord.ties" },
"roadWins" : { $sum: "$roadRecord.wins" },
"roadLosses" : { $sum: "$roadRecord.losses" },
"roadTies" : { $sum: "$roadRecord.ties" },
}
},
]
);
I executed this in Studio3T and got the following:
{
"_id" : {
"team" : "team.$id"
},
"games" : NumberInt(776),
"wins" : NumberInt(377),
"losses" : NumberInt(398),
"ties" : NumberInt(1),
"homeWins" : NumberInt(218),
"homeLosses" : NumberInt(170),
"homeTies" : NumberInt(1),
"roadWins" : NumberInt(159),
"roadLosses" : NumberInt(228),
"roadTies" : NumberInt(0)
}
How do I write this exact query using doctrine odm's aggregation builder?
This one is tricky because the documentation is not quite clear.
In theory, you create a nested sub-expression and use the sum operator there:
$builder = $this->dm->createAggregationBuilder(\Documents\BlogPost::class);
$builder->group()
->field('id')
->expression(null)
->field('games')
->sum($builder->expr()->sum('$wins', '$losses', '$ties'))
;
However, due to the way the aggregation builder is built, it doesn't quite understand the sum syntax with multiple expressions outside a $project stage, resulting in the following (wrong) result:
[{
"$group": {
"_id": null,
"games": { "$sum": { "$sum": "$wins" } }
}
}]
To work around this problem, use the $add operator instead of $sum in the nested expression:
$builder = $this->dm->createAggregationBuilder(\Documents\BlogPost::class);
$builder->group()
->field('id')
->expression(null)
->field('games')
->sum($builder->expr()->add('$wins', '$losses', '$ties'))
;
This creates the aggregation pipeline you want it to create.
The reason this is weird is because the documentation defines different behavior for $sum when used in $group and $project stages: in $group, it accepts a single argument, while in $project it accepts multiple arguments. However, it is not perfectly clear how it behaves in a nested expression within $group: the fact that the aggregation pipeline you posted works suggests that it doesn't see the sub-expression as being within a $group stage, thus allowing multiple arguments. When I built the operator, I assumed the opposite: only when in a $project stage should $sum accept multiple arguments and default to the one argument syntax otherwise.
I'll create a ticket for this in MongoDB ODM and I'll see if this can easily be fixed.

ElasticSearch sort on multiple fields with summation

I am doing a search using "filtered" query and then want to sort on SUM of 3 columns.
Eg:
"query": {
...
},
"sort": {
view_count + comment_count + like_count
order: DESC
}
The result should be in descending order of the sum of the above 3 counts.
How to achieve the SUM the columns and then order the results.
Any help is appreciated.
Use script sorting if you can't change the data/don't have control over the mapping/the three fields are changing (ie they are not insert and forget).
{
"sort": {
"_script": {
"type": "number",
"script": "return doc['view_count'].value + doc['comment_count'].value + doc['like_count'].value,
"lang": "groovy",
"order": "desc"
}
}
}
If you update the documents frequently, you can also add another field - let's call it sum - where you already index the sum of the three fields. And then you simply sort on the sum field.
In my case (Elasticsearch 5.6), below code sorts by sum of multiple fields.
GET /_search
{
"query" : {
"term" : { "user" : "kimchy" }
},
"sort" : {
"_script" : {
"type" : "number",
"script" : {
"lang": "painless",
"source": "doc['retweet_count'].value + doc['favorite_count'].value"
},
"order" : "desc"
}
}}
Source: Script-based Sorting

Percentage of OR conditions matched in mongodb

I have got my data in following format..
{
"_id" : ObjectId("534fd4662d22a05415000000"),
"product_id" : "50862224",
"ean" : "8808992479390",
"brand" : "LG",
"model" : "37LH3000",
"features" : [{
{
"key" : "Screen Format",
"value" : "16:9",
}, {
"key" : "DVD Player / Recorder",
"value" : "No",
},
"key" : "Weight in kg",
"value" : "12.6",
}
... so on
]
}
I need to compare features of one product with others and divide the result into separate categories ( 100% match, 50-99 % match) based on % of feature matches..
My initial thought was to prepare a dynamic query with or condition for each feature and do the percentage thing in php but then that means mongodb will return me even those product which only have 1 feature matching. And I I think nearly all products of a category might have some feature in common, so I fear I might be working on lot of products in php.
I have two questions basically.
is there any alternate ways?
And is the data structure I am using is good enough to support the functionality I am looking for, Or should I consider changing it
Well your solution really should be MongoDB specific otherwise you will end up doing your calculations and possible matching on the client side, and that is not going to be good for performance.
So of course what you really want is a way for that to have that processing on the server side:
db.products.aggregate([
// Match the documents that meet your conditions
{ "$match": {
"$or": [
{
"features": {
"$elemMatch": {
"key": "Screen Format",
"value": "16:9"
}
}
},
{
"features": {
"$elemMatch": {
"key" : "Weight in kg",
"value" : { "$gt": "5", "$lt": "8" }
}
}
},
]
}},
// Keep the document and a copy of the features array
{ "$project": {
"_id": {
"_id": "$_id",
"product_id": "$product_id",
"ean": "$ean",
"brand": "$brand",
"model": "$model",
"features": "$features"
},
"features": 1
}},
// Unwind the array
{ "$unwind": "$features" },
// Find the actual elements that match the conditions
{ "$match": {
"$or": [
{
"features.key": "Screen Format",
"features.value": "16:9"
},
{
"features.key" : "Weight in kg",
"features.value" : { "$gt": "5", "$lt": "8" }
},
]
}},
// Count those matched elements
{ "$group": {
"_id": "$_id",
"count": { "$sum": 1 }
}},
// Restore the document and divide the mated elements by the
// number of elements in the "or" condition
{ "$project": {
"_id": "$_id._id",
"product_id": "$_id.product_id",
"ean": "$_id.ean",
"brand": "$_id.brand",
"model": "$_id.model",
"features": "$_id.features",
"matched": { "$divide": [ "$count", 2 ] }
}},
// Sort by the matched percentage
{ "$sort": { "matched": -1 } }
])
So as you know the "length" of the $or condition being applied, then you simply need to find out how many of the elements in the "features" array match those conditions. So that is what the second $match in the pipeline is all about.
Once you have that count, you simply divide by the number of conditions what were passed in as your $or. The beauty here is that now you can do something useful with this like sort by that relevance and then even "page" the results server side.
Of course if you want some additional "categorization" of this, all you would need to do is add another $project stage to the end of the pipeline:
{ "$project": {
"product_id": 1
"ean": 1
"brand": 1
"model": 1,
"features": 1,
"matched": 1,
"category": { "$cond": [
{ "$eq": [ "$matched", 1 ] },
"100",
{ "$cond": [
{ "$gte": [ "$matched", .7 ] },
"70-99",
{ "$cond": [
"$gte": [ "$matched", .4 ] },
"40-69",
"under 40"
]}
]}
]}
}}
Or as something similar. But the $cond operator can help you here.
The architecture should be fine as you have it as you can have a compound index on the "key" and "value" for the entries in your features array and this should scale well for queries.
Of course if you actually need something more than that, such as faceted searching and results, you can look at solutions like Solr or elastic search. But the full implementation of that would be a bit lengthy for here.
I'm assuming that you'd like to compare the rest of the collection to a given product, which is a textbook example of aggregation:
lookingat = db.products.findOne({product_id:'50862224'})
matches = db.products.aggregate([
{ $unwind: '$features' },
{ $match: { features: { $in: lookingat.features }}},
{ $group: { _id: '$product_id', matchedfeatures: { $sum:1 }}},
{ $sort: { matchedfeatures: -1 }},
{ $limit: 5 },
{ $project: { _id:0, product_id: '$_id',
pctmatch: { $multiply: [ '$matchedfeatures',
100/lookingat.features.length ]}
}}
])
Walking through this briefly from the perspective of a product in the collection that has 6 features, and comparing it to the target product ('lookingat') which has 4 features, 3 of which match:
$unwind turns 1 document with 6 features into 6 otherwise-identical documents with 1 feature each
$match looks for that feature in the target's feature array (be aware that two documents are "equal" only if they have the same field names and values, in the same order), discards the 3 that don't match, and passes along the 3 that do
$group consumes those 3 matching documents and produces a new one that tells you there were 3 documents that matched that product_id
$sort and $limit give you the most relevant results and leave behind all those 1-feature matches you were concerned about
$project lets you rename the _id from the $group step back to product_id and also math the number of matching features into a percentage (we avoided a $divide operation by recognizing that 2 of the 3 terms in our calculation are constants and can be divided in JS)

Mongodb $in not return correct sequence

I have query
$cursor = $collection->find(array('id' =>array('$in'=>array(4,3,2,1))), array('name'));
foreach($cursor as $fild)
{
echo $fild['name'].'<br>';
}
return
Need for speed: Most Wanted
Pro Evolution Soccer 2014
Fifa 2014
Star Craft 2
If I change order in array like (3,2,4,1).
Return
Need for speed: Most Wanted
Pro Evolution Soccer 2014
Fifa 2014
Star Craft 2
must return
Fifa 2014
Pro Evolution Soccer 2014
Star Craft 2
Need for speed: Most Wanted
What I'm doing wrong?
Essentially this is not how the $in operator works, for MongoDB or in really any equivalent form in any database. So the order you put the arguments in is not maintained in your results as you seem to be expecting them to.
There are however a couple of approaches you can take in order to achieve this. The first is by some creative use of the aggregation pipeline:
var selections = [ 4, 2, 3, 1 ];
db.collection.aggregate([
// $match is like a standard "find" query
{ "$match": {
"id": { "$in": selections }
}},
// Project your field(s) and a sorting field
{ "$project": {
"_id": 0,
"name": 1,
"order": { "$cond": [
{ "$eq": [ "$id", 4 ] },
1,
{ "$cond": [
{ "$eq": [ "$id", 2 ] },
2,
{ "$cond": [
{ "$eq": [ "$id", 3 ] },
3,
4
]}
]}
]}
}},
// Sort on the generated field
{ "$sort": { "order": } }
])
So by using the $cond operator, which is a ternary operator, you are evaluating the current value of id in order to determine which sort order to assign. Of course you would actually generate the pipeline contents for this condition in code, using a method similar to what is shown here.
Of course if that seems all a little too complex, even though you would be best off doing it that way, then you can approach the problem is a similar way using mapReduce. But as this uses JavaScript code in an interpreter and does not use native code like aggregate does, then this will run slower:
var selections = [ 4, 2, 3, 1 ];
db.collection.mapReduce(
function() {
emit(
selections.indexOf( this.id ),
this.name
);
},
function(){},
{
"query": { "id": { "$in": selections } },
"scope": { "selections": selections },
"out": { "inline": 1 }
}
)
And this way takes advantage of how mapReduce sorts the emitted key values from the mapper, so by positioning by the index value in the array you maintain the sort order.
So that gives you a couple of approaches you can use to generate a sort order "on the fly" from something like the order of arguments in an array.

How to conditionally find items by Tag in mongodb

I am new to NoSQL and MongoDB and I am a little puzzled on what type of queries I can do and how to do them. my knowledge is limited to simpler queries
I would like to make what I think its a complicated query within MongoDB instead of using PHP to sort it but I do not know if it is possible or how to do it.
I have a tag field within my collection that is an array. {tag: ["blue","red","yellow","green","violet"]}.
First level problem: Let says I want to find all birds that have the tag blue & yellow & green, where blue is a must have tag and any other colours are optional.
Second level problem: Then I would like to order the query so that the birds that have all the queried colours appear first.
Is it possible to create this query in mongoDB? and if it is How could I do it?
You can use aggregation framework. So for the next dataset:
{ "_id":ObjectId(...), "bird":1, "tags":["blue","red","yellow","green","violet"]}
{ "_id":ObjectId(...), "bird":2, "tags":["red","yellow","green","violet"] }
{ "_id":ObjectId(...), "bird":3, "tags":["blue","yellow","violet"] }
{ "_id":ObjectId(...), "bird":4, "tags":["blue","yellow","red","violet"] }
{ "_id":ObjectId(...), "bird":5, "tags":["blue"] }
we can apply next query:
colors = ["blue","red","yellow","green"];
db.birds.aggregate(
{ $match: {tags: 'blue'} },
{ $project: {_id:0, bird:1, tags:1} },
{ $unwind: '$tags' },
{ $match: {tags: {$in: colors}} },
{ $group: {_id:'$bird', score: {$sum:1}} },
{ $sort: {score:-1} },
{ $project: {bird:'$_id', score:1, _id:0} }
)
and will get result like this:
{
"result" : [
{ "score" : 4, "bird" : 1 },
{ "score" : 3, "bird" : 4 },
{ "score" : 2, "bird" : 3 },
{ "score" : 1, "bird" : 5 }
],
"ok" : 1
}
Most of this you will have to do in your application. In order to find all documents where a bird has the tag "blue", you can do this:
db.collection.find( { tag: "blue" } );
Which colours are optional doesn't matter, as you have to find by the required tag anyway.
After finding them, you need to do a sort. But sorting like you want (by their 3 colours) is not something you can do in MongoDB, and something you will have to do in PHP instead.

Categories