PHP MongoDB selecting fields by key (distinct)

PHP MongoDB selecting fields by key (distinct) - php

I have a database with lottery games across the world.
This is how each document look like:
Each game can appear in different country_code or state_code if country have states (Canada, USA).
Selecting all game_id's and then all the countries and/or states it belongs to is done like this:
// get all games
// $colCurrent = MongoCollection Object
$gamesRes = $colCurrent->distinct('game_id');
foreach($gamesRes as $gameId) {
$disCountries = $colCurrent->distinct('country_code',array('game_id' => $gameId));
$disStates = $colCurrent->distinct('state_code',array('game_id' => $gameId));
}
I believe this is inappropriate way to do it, as it does a lot of queries to database.
I've tried using aggregate function, but it only select 1 field like distinct.
Anyone can help optimizing this query?
Thanks a lot!

Depending on what you are trying to achieve and the size of your data set, there are a few different approaches you can take.
Some examples using the Aggregation Framework in the mongo shell (MongoDB 2.2+):
1) Find all games and for each game create the set of unique country_code and state_code values:
db.games.aggregate(
{ $group: {
_id: { gameId: "$game_id" },
countries: { $addToSet: "$country_code" },
states: { $addToSet: "$state_code" }
}}
)
2) Find all games, and group by the unique combination of gameId, country_code, and state_code including a count:
db.games.aggregate(
{ $group: {
_id: {
gameId: "$game_id",
country_code: "$country_code",
state_code: "$state_code"
},
total: { $sum: 1 }
}}
)
In this second example, note that the _id used for grouping can include multiple fields.
If you don't want to group on all the documents in the collection, you could make these aggregations more efficient by starting with the $match operator to limit the pipeline to the data you need ($match can also take advantage of a suitable index).

Assuming you mean the total number of distinct countries and the total number of distinct states for each game_name (assuming one per game_id and that this is more readable [ interchange if needed ] )
Posting as mongo shell for general clarity, adapt to your driver and language as required:
db.lottery.aggregate([
{$project: { country_code: 1, state_code: 1, game_name: 1 } },
{$group: {
_id: "$game_name",
countries: {$addToSet: "$country_code"},
states: {$addToSet: "$state_code"}
}},
{$unwind: "$countries"},
{$group:{ _id: { id: "$_id", states: "$states" }, country_count: {$sum: 1 } }},
{$project: { _id: 0, game: "$_id.id", countries: "$country_count", states: "$_id.states" }},
{$unwind: "$states"},
{$group: { _id: { id: "$game", countries: "$countries" }, state_count: {$sum: 1 } }},
{$project: { _id: 0, game: "$_id.id", countries: "$_id.countries", states: "$state_count" }},
{$sort: { game: 1 }}
])
So there are a few fancy stages here:
Project the fields that are needed only
Group on the game and push each country and state into an array of its own
Now unwind the countries to get one record per country
Group a sum of the countries while retaining the game and states field array
Project --optional-- to make the records appear more natural. $group messes with _id
Unwind the states to get one record per state
Group a sum of the states while retaining game and countries count
Project into something more natural
Sorting --optional-- by what you like. In this case the name of the game
Phew! A reasonably hefty aggregate but it does show a way to work out the problem.
DISCLAIMER:
I have made the huge assumption here that your data does make some sense already and that there are not multiple records of games per country and/or per state. The additional "I didn't do it" part is that your code did not discern states within countries so "I didn't do it either" :-P
You can add in $group stages to do that though. Part of the fun of programming is learning and working out how to do things by yourself. So this should be a good place to start if not a perfect fit.
The reference is a really good place for learning how to apply all the operators used here. Apply one stage at a time ( data size permitting ) to get a good idea of what is going on in each step.

Related

Elasticsearch query date sorting parent-child relation (recurring events)

I’m currently working on an app where we are handling events.
So, in Elasticsearch, we do have a document named Event.
Previously, we only had one kind of event (unique event happening the 13 May from 9 AM to 11 AM), the sorting was simple (sort by start_date with an order)
We recently added a new feature that allows us to create recurring events, that means that we now have 2 levels inside Elasticsearch (parent-child relation).
We can have a parent event that is from the 12 May from 2 PM to the 14 May from 6 PM, linked to that event, we have the children that are daily, for example. So we’d have: 12 May 2PM-6PM, 13 May 2PM-6PM, 14 May 2PM-6PM.
The problem with the actual sort is that when we are the 12 May at 10 PM, we’ll find the recurring event on top of the list and after that, will come the unique event.
I’d like to have a sorting where the nearest date has a higher priority. In that case, the unique event should have been the first on the list.
To make that happen, I have indexed node children on recurring event parent, in order to have the children start_date.
The idea would be to get the nearest date out of the children node for every recurring event and sort that one with the start_date of every unique event.
I do not have a big experience with elasticsearch, so I’m kind of stuck, I saw a lot of information in the documentation (parent-child, nested objects, scripts, etc.) but I don’t know how to handle this case.
I hope that I have explained myself correctly if you have any questions, feel free to ask them, I would be happy to provide you with additional information.

For the future googlers, here's how I fixed it.
Had to use scripts and sort with it, here's a partial exemple of the request I'm using
GET /event/_search
{
"query" : {
"match_all": {}
},
"sort" : {
"_script" : {
"type" : "number",
"script": {
"lang": "painless",
"params": {
"currentDate": 1560230000
},
"source": """
def isRecurrenceParent = params._source.is_recurrence_parent;
def countChildren = params._source.children.length;
def currentDate = params.currentDate;
if (isRecurrenceParent === false) {
return params._source.timestamp;
}
def nearest = 0;
def lowestDiff = currentDate;
for (int i = 0; i < countChildren; i++) {
def child = params._source.children[i];
def diff = child.timestamp - currentDate;
if (diff > 0 && diff < lowestDiff) {
lowestDiff = diff;
nearest = child.timestamp;
}
}
return nearest;
"""
},
"order" : "asc"
}
}
}

First thing you should consider is parent and child docs are saved separately. It means Parent-Event::1 and Child-Event::1 are saved in a same shard (ES routes to shard where parent located by its id hash) but document types are different. So, you should fetch Parent and Children documents separately by query and sort by date.
(You can make following queries in php if works)
P.S: I have also same situation but I had to implement in Java. So, I made a ES query builder (https://github.com/mashhur/java-elasticsearch-querybuilder) which supports parent-child relationship queries too, you can take a look for the reference.
// search child events and sort by date
GET events/_search {
"query": {
"has_parent": {
"parent_type": "parent-event",
"query": {
"match_all": {}
}
},
"sort": [{"start_date": {"desc"}}]
}
}
// search parent events and sort by date
GET events/_search {
"query": {
"has_child": {
"type": "child-event",
"query": {
"match_all": {}
}
},
"sort": [{"start_date": {"desc"}}]
}
}

update value using nested element match in mongo [duplicate]

I have a document in mongodb with 2 level deep nested array of objects that I need to update, something like this:
{
id: 1,
items: [
{
id: 2,
blocks: [
{
id: 3
txt: 'hello'
}
]
}
]
}
If there was only one level deep array I could use positional operator to update objects in it but for second level the only option I've came up is to use positional operator with nested object's index, like this:
db.objects.update({'items.id': 2}, {'$set': {'items.$.blocks.0.txt': 'hi'}})
This approach works but it seems dangerous to me since I'm building a web service and index number should come from client which can send say 100000 as index and this will force mongodb to create an array with 100000 indexes with null value.
Are there any other ways to update such nested objects where I can refer to object's ID instead of it's position or maybe ways to check if supplied index is out of bounds before using it in query?

Here's the big question, do you need to leverage Mongo's "addToSet" and "push" operations? If you really plan to modify just individual items in the array, then you should probably build these arrays as objects.
Here's how I would structure this:
{
id: 1,
items:
{
"2" : { "blocks" : { "3" : { txt : 'hello' } } },
"5" : { "blocks" : { "1" : { txt : 'foo'}, "2" : { txt : 'bar'} } }
}
}
This basically transforms everything in to JSON objects instead of arrays. You lose the ability to use $push and $addToSet but I think this makes everything easier. For example, your query would look like this:
db.objects.update({'items.2': {$exists:true} }, {'$set': {'items.2.blocks.0.txt': 'hi'}})
You'll also notice that I've dumped the "IDs". When you're nesting things like this you can generally replace "ID" with simply using that number as an index. The "ID" concept is now implied.
This feature has been added in 3.6 with expressive updates.
db.objects.update( {id: 1 }, { $set: { 'items.$[itm].blocks.$[blk].txt': "hi", } }, { multi: false, arrayFilters: [ { 'itm.id': 2 }, { 'blk.id': 3} ] } )

The ids which you are using are linear number and it has to come from somewhere like an additional field such 'max_idx' or something similar.
This means one lookup for the id and then update. UUID/ObjectId can be used for ids which will ensure that you can use Distributed CRUD as well.

Building on Gates' answer, I came up with this solution which works with nested object arrays:
db.objects.updateOne({
["items.id"]: 2
}, {
$set: {
"items.$.blocks.$[block].txt": "hi",
},
}, {
arrayFilters: [{
"block.id": 3,
}],
});

MongoDB 3.6 added all positional operator $[] so if you know the id of block that need update, you can do something like:
db.objects.update({'items.blocks.id': id_here}, {'$set': {'items.$[].blocks.$.txt': 'hi'}})

db.col.update({"items.blocks.id": 3},
{ $set: {"items.$[].blocks.$[b].txt": "bonjour"}},
{ arrayFilters: [{"b.id": 3}] }
)
https://docs.mongodb.com/manual/reference/operator/update/positional-filtered/#update-nested-arrays-in-conjunction-with

This is pymongo function for find_one_and_update. I searched a lot to find the pymongo function. Hope this will be useful
find_one_and_update(filter, update, projection=None, sort=None, return_document=ReturnDocument.BEFORE, array_filters=None, hint=None, session=None, **kwargs)
Added reference and pymongo documentation in comments

Load specific relations in a nested eager loading in laravel

I have the following related tables:
tableA
- id
- value
tableB
- id
- tableA_id
- value
tableC
- id
- tableB_id
- value
tableD
- id
- tableC_id
- value
I normally use a nested eager loading to get the object of tableaA from tableD, for example:
$table_d = TableD::with('TableC.TableB.TableA')->find($id);
And I get an object like this:
{
"id": 1,
"value": "value",
"tableC_id": 1,
"tablec": {
"id": 1,
"value": "value",
"tableB_id": 1,
"tableb": {
"id": 1,
"value": "value",
"tableA_id": 1,
"tablea": {
"id": 1,
"value": "value"
}
}
}
}
What I want to achieve is to obtain only the object of table D, with its object from table A related, without having table C and table B in the final object, something like this:
{
"id": 1,
"value": "value",
"tablea": {
"id": 1,
"value": "value"
}
}
}
I tried adding this function in the model file of Table D:
public function TableA()
{
return $this->belongsTo('App\Models\TableC', 'tableC_id')
->join('tableB','tableC.tableB_id','=','tableB.id')
->join('tableA','tableB.tableA_id','=','tableA.id')
->select('tableA.id', 'tableA.value');
}
but it does not work because when I do the following query, it returns some good objects and others with tableA = null:
$tables_d = TableD::with('TableA')->get()
Am I doing something wrong or is there another way to achieve what I want?

You may be able to skip a table with
this->hasManyThrough() but depending on what you really want as 'future features', you may want to have multiple relations with whatever code you desire according to your needs. QueryScopes aswell.

One can generally use a has many through relationship for mapping tables when it is just two tables and a linking table between. You have yet another join beyond that so it won't really be much better than what you have currently.
Have you considered another mapping table from D to A directly or a bit of denormalization? If you always need to load it like that you might benefit from having a bit of duplicated fks to save on the joins.
This will really depend on your needs and it is not 3NF (third normal form), maybe it's not even 2NF, but that's why denormalization is like comma use...follow the rules generally but break them for specific reasons; in this case to reduce the number of required joins by duplicating a FK reference in a table.
https://laravel.com/docs/5.6/eloquent-relationships#has-many-through

You can try to do this:
- add a method in TableD Model:
public function table_a()
{
return $this->TableC->TableB->TableA();
}
then use: TableD::with(table_a);

MongoDB Ordering by average combined numbers or nested sub arrays

Having some issues working out the best way to do this in MongoDB, arguably its a relation data set so I will probably be slated. Still its a challenge to see if its possible.
I currently need to order by a Logistics Managers' daily average miles across the vans in their department and also in a separate list a combined weekly average.
Mr First setup in the database was as follows
{
"_id" : ObjectId("555cf04fa3ed8cc2347b23d7"),
"name" : "My Manager 1",
"vans" : [
{
"name" : "van1",
"miles" : NumberLong(56)
},
{
"name" : "van2",
"miles" : NumberLong(34)
}
]
}
But I can't see how to order by a nested array value without knowing the parent array keys (these will be standard 0-x)
So my next choice was to scrap that idea just have the name in the first collection and the vans in the second collection with Id of the manager.
So removing vans from the above example and adding this collection (vans)
{
"_id" : ObjectId("555cf04fa3ed8cc2347b23d9"),
"name" : "van1",
"miles" : NumberLong(56),
"manager_id" : "555cf04fa3ed8cc2347b23d7"
}
But because I need show the results by manager, how do I order in a query (if possible) the average miles in this collection where id=x and then display the manager by his id.
Thanks for your help

If the Manager is going to have limited number of Vans, then your first approach is better, as you do not have to make two separate calls/queries to the database to collect your information.
Then comes the question how to calculate the average milage per Manager, where the Aggregation Framework will help you a lot. Here is a query that will get you the desired data:
db.manager.aggregate([
{$unwind: "$vans"},
{$group:
{_id:
{
_id: "$_id",
name: "$name"
},
avg_milage: {$avg: "$vans.miles"}
}
},
{$sort: {"avg_milage": -1}},
{$project:
{_id: "$_id._id",
name: "$_id.name",
avg_milage: "$avg_milage"
}
}
])
The first $unwind step simply unwraps the vans array, and creates a separate documents for each element of the array.
Then the $group stage gets all documents with the same (_id, name) pair, and in the avg_milage field, counts the average value of miles field out of those documents.
The $sort stage is obvious, it just sorts the documents in the descending order, using the new avg_milage field as the sort key.
And finally, the last $project step just cleans up the documents by making appropriate projections, just for beauty :)
A similar thing is needed for your second desired result:
db.manager.aggregate([
{$unwind: "$vans"},
{$group:
{_id:
{
_id: "$_id",
name: "$name"
},
total_milage: {$sum: "$vans.miles"}
}
},
{$sort: {"total_milage": -1}},
{$project:
{_id: "$_id._id",
name: "$_id.name",
weekly_milage: {
$multiply: [
"$total_milage",
7
]
}
}
}
])
This will produce the list of Managers with their weekly milage, sorted in descending order. So you can $limit the result, and get the Manager with the highest milage for instance.
And in pretty much similar way, you can grab info for your vans:
db.manager.aggregate([
{$unwind: "$vans"},
{$group:
{_id: "$vans.name",
total_milage: {$sum: "$vans.miles"}
}
},
{$sort: {"total_milage": -1}},
{$project:
{van_name: "$_id",
weekly_milage: {
$multiply: [
"$total_milage",
7
]
}
}
}
])

First, do you require average miles for a single day, average miles over a given time period, or average miles over the life of the manager? I would consider adding a timestamp field. Yes, _id has a timestamp, but this only reflects the time the document was created, not necessarily the time of the initial day's log.
Considerations for the first data model:
Does each document represent one day, or one manager?
How many "vans" do you expect to have in the array? Does this list grow over time? Do you need to consider the 16MB max doc size in a year or two from now?
Considerations for the second data model:
Can you store the manager's name as the "manager_id" field? Can this be used as a possible unique ID for a secondary meta lookup? Doing so would limit the necessity of a secondary manager meta-data lookup just to get their name.
As #n9code has pointed out, the aggregation framework is the answer in both cases.
For the first data model, assuming each document represents one day and you want to retrieve an average for a given day or a range of days:
db.collection.aggregate([
{ $match: {
name: 'My Manager 1',
timestamp: { $gte: ISODate(...), $lt: ISODate(...) }
} },
{ $unwind: '$vans' },
{ $group: {
_id: {
_id: '$_id',
name: '$name',
timestamp: '$timestamp'
},
avg_mileage: {
$avg: '$miles'
}
} },
{ $sort: {
avg_mileage: -1
} },
{ $project: {
_id: '$_id._id',
name: '$_id.name',
timestamp: '$_id.timestamp',
avg_mileage: 1
} }
]);
If, for the first data model, each document represents a manager and the "vans" array grows daily, this particular data model is not ideal for two reasons:
"vans" array may grow beyond max document size... eventually, although that would be a lot of data
It is more difficult and memory intensive to limit a certain date range since timestamp at this point would be nested within an item of "vans" and not in the root of the document
For the sake of completeness, here is the query:
/*
Assuming data model is:
{
_id: ...,
name: ...,
vans: [
{ name: ..., miles: ..., timestamp: ... }
]
}
*/
db.collection.aggregate([
{ $match: {
name: 'My Manager 1'
} },
{ $unwind: '$vans' },
{ $match: {
'vans.timestamp': { $gte: ISODate(...), $lt: ISODate(...) }
} },
{ $group: {
_id: {
_id: '$_id',
name: '$name'
},
avg_mileage: {
$avg: '$miles'
}
} },
{ $sort: {
avg_mileage: -1
} },
{ $project: {
_id: '$_id._id',
name: '$_id.name',
avg_mileage: 1
} }
]);
For the second data model, aggregation is more straightforward. I'm assuming the inclusion of a timestamp:
db.collection.aggregate([
{ $match: {
manager_id: ObjectId('555cf04fa3ed8cc2347b23d7')
timestamp: { $gte: ISODate(...), $lt: ISODate(...) }
} },
{ $group: {
_id: '$manager_id'
},
avg_mileage: {
$avg: '$miles'
}
names: {
$addToSet: '$name'
}
} },
{ $sort: {
avg_mileage: -1
} },
{ $project: {
manager_id: '$_id',
avg_mileage: 1
names: 1
} }
]);
I have added an array of names (vehicles?) used during the average computation.
Relevant documentation:
$match, $unwind, $group, $sort, $project - Pipeline Aggregation Stages
$avg, $addToSet - Group Accumulator Operators
Date types
ObjectId.getTimestamp

Elasticsearch - Create report filters using Bool (MUST & AND) DSL query

I am trying to create some report filters where the user can search for profiles using any fields on the report. For example: search for any profile with firstname that starts with ann and grade that starts with vi etc.
Here is a query I have written so far:
{
from: 20,
size: 20,
query: {
filtered: {
query: {
match_all: [ ]
},
filter: {
bool: {
must: [
{
prefix: {
firstname: "ann"
}
},
{
prefix: {
grade: "vi"
}
}
]
}
}
}
},
sort: {
grade: {
order: "asc"
}
}
}
If I remove one child of must (in the bool filter), it works. But it doesn't return any results once I use more than one filters and I need to be able to use any number of entries in there.
Also, if I use should instead of must, it works. I'm not sure if I'm misunderstanding the logic, but to my understanding (in this case) must should return ONLY results with firstname that starts with ann and grade that starts with vi.
They do exist, but this query just doesn't find them.
Am I missing something here?
Thanks

Since, I cannot post comments yet. I'm answering with some assumptions.
First of all, I'm using ES 0.90.2 version and your query works fine for my inputs. However, depending on your input size and the platform that you executed your query, my answer may not be the right one.
Assumption: Number of data in the index is less than 20.
I've added following inputs to my index:
'{"name": "ann", "grade": "vi"}'
'{"name": "ann", "grade": "ii"}'
'{"name": "johan", "grade": "vi"}'
'{"name": "johan", "grade": "ii"}'
And my test query was the same as yours, and here is the result:
"hits" : {
"total" : 2,
"max_score" : null,
"hits" : [ ] // <-- see this part is blank
}
As you can see, it didn't listed hits, but there are two hits. That's because of the from:20 code segment. If you change that value, you can see some results. If you want to see all results just delete that part.
Note: Well if this is not the case, sorry for bothering :(

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.