Merging doc_count result from keyed buckets

Merging doc_count result from keyed buckets - php

I have a query like
'aggs' => [
'deadline' => [
'date_histogram' => [
'field' => 'deadline',
'interval' => 'month',
'keyed' => true,
'format' => 'MMM'
]
]
]
the result I am getting are buckets with keys as month names.
The problem I am facing is the buckets with the month names as keys for a previous year are over written by another month of the next year (because obviously the key is same).
I want results where doc-count of buckets of previous which are over written merge with the doc_count of the next.

You can either add a separate month field during indexing and perform aggregation on it or use below script
{
"size": 0,
"aggs": {
"deadline": {
"histogram": {
"script": { "inline" : "return doc['deadline'].value.getMonthOfYear()" },
"interval": 1
}
}
}
}
Creating a separate month field will have better performance

Replace the format from MMM to YYYY-MMM as below:
'aggs' => [
'deadline' => [
'date_histogram' => [
'field' => 'deadline',
'interval' => 'month',
'keyed' => true,
'format' => 'YYYY-MMM'
]
]
]
After this you can handle the merging process at your application level

Related

How to filter elasticsearch by date field with date modification

To filter by date, I use the following queries:
'body' => [
'query' => [
'bool' => [
'filter' => [
'range' => [
'expire_at' => [
'gte' => now()
]
]
]
]
]
]
UPD: All records have another date field - last_checked. The question is how to select records in which, for example, (expire_at - 7 days) > last_checked?

Check the range query documentation: you can use date math in the range parameters. E.g., expire_at - 7 days < now() means that the expiration will be within the next 7 days. Then you can do:
"range": {
"expire_at": {
"lt": "now+7d/d"
}
}
Note that this will include also already expired items. If you want to avoid that, you can add the condition that the expiration date is not met yet:
"range": {
"expire_at": {
"lt": "now+7d/d",
"gte": "now/d"
}
}

use this code
"expire_at" => array(
"lt" => "now+7d/d"
)

PHP MongoDB getting data from a date range using dateFromString

I have data stored in the following format in MongoDB:
[
{
meta: {
id: 1
},
data: {
date: "03/01/2020"
}
}
],
[
{
meta: {
id: 1
},
data: {
date: "12/19/2019"
}
}
]
And I want to use PHP to get all entries that are greater than or equal to the current date, and less than or equal to the current date plus 6 months.
To do that, my query looks like this:
$query = [
'meta.id' => 1,
'$expr' => [
'$and' => [
[
'$gte' =>
[
['$dateFromString' => ['dateString' => '$data.date']],
time()
]
],
[
'$lte' =>
[
['$dateFromString' => ['dateString' => '$data.date']],
strtotime('+6 months')
]
]
]
]
];
With this query, I should get back the first entry, with a date of "03/01/2020", however I get back no results.
I've also tried using 'format' in the dateString but it doesn't appear to be supported yet, as it gives me an error.
How can I get this query to work correctly?

PHP Append array in nested array

I have below array, I need to append a new array inside $newData['_embedded']['settings']['web/vacation/filters']['data'], How can I access and append inside it ?
$newData = [
"id" => "47964173",
"email" => "abced#gmail.com",
"firstName" => "Muhammad",
"lastName" => "Taqi",
"type" => "employee",
"_embedded" => [
"settings" => [
[
"alias" => "web/essentials",
"data" => [],
"dateUpdated" => "2017-08-16T08:54:11Z"
],
[
"alias" => "web/personalization",
"data" => [],
"dateUpdated" => "2016-07-14T10:31:46Z"
],
[
"alias" => "wizard/login",
"data" => [],
"dateUpdated" => "2016-09-26T07:56:43Z"
],
[
"alias" => "web/vacation/filters",
"data" => [
"test" => [
"type" => "teams",
"value" => [
0 => "09b285ec-7687-fc95-2630-82d321764ea7",
1 => "0bf117b4-668b-a9da-72d4-66407be64a56",
2 => "16f30bfb-060b-360f-168e-1ddff04ef5cd"
],
],
"multiple teams" => [
"type" => "teams",
"value" => [
0 => "359c0f53-c9c3-3f88-87e3-aa9ec2748313"
]
]
],
"dateUpdated" => "2017-07-03T09:10:36Z"
],
[
"alias" => "web/vacation/state",
"data" => [],
"dateUpdated" => "2016-12-08T06:58:57Z"
]
]
]
];
$newData['_embedded']['settings']['web/vacation/filters']['data'] = $newArray;
Any Hint to quickly append it, I don't want to loop-in and check for keys inside loops.

The settings subarray is "indexed". You first need to search the alias column of the subarray for web/vacation/filters to find the correct index. Using a foreach loop without a break will mean your code will continue to iterate even after the index is found (bad coding practice).
There is a cleaner way that avoids a loop & condition & break, use array_search(array_column()). It will seek your associative element, return the index, and immediately stop seeking.
You can use the + operator to add the new data to the subarray. This avoids calling a function like array_merge().
Code: (Demo)
if(($index=array_search('web/vacation/filters',array_column($newData['_embedded']['settings'],'alias')))!==false){
$newData['_embedded']['settings'][$index]['data']+=$newArray;
}
var_export($newData);
Perhaps a more considered process would be to force the insert of the new data when the search returns no match, rather than just flagging the process as unsuccessful. You may have to tweak the date generation for your specific timezone or whatever... (Demo Link)
$newArray=["test2"=>[
"type" =>"teams2",
"value" => [
0 => "09b285ec-7687-fc95-2630-82d321764ea7",
1 => "0bf117b4-668b-a9da-72d4-66407be64a56",
2 => "16f30bfb-060b-360f-168e-1ddff04ef5cd"
],
]
];
if(($index=array_search('web/vacation/filters',array_column($newData['_embedded']['settings'],'alias')))!==false){
//echo $index;
$newData['_embedded']['settings'][$index]['data']+=$newArray;
}else{
//echo "couldn't find index, inserting new subarray";
$dt = new DateTime();
$dt->setTimeZone(new DateTimeZone('UTC')); // or whatever you are using
$stamp=$dt->format('Y-m-d\TH-i-s\Z');
$newData['_embedded']['settings'][]=[
"alias" => "web/vacation/filters",
"data" => $newArray,
"dateUpdated" => $stamp
];
}

You need to find the key that corresponds to web/vacation/filters. For Example you could use this.
foreach ($newData['_embedded']['settings'] as $key => $value) {
if ($value["alias"]==='web/vacation/filters') {
$indexOfWVF = $key;
}
}
$newData['_embedded']['settings'][$indexOfWVF]['data'][] = $newArray;
From the comments. Then you want to merge the arrays. Not append them.
$newData['_embedded']['settings'][$indexOfWVF]['data'] = array_merge($newData['_embedded']['settings'][$indexOfWVF]['data'],$newArray);
Or (if it's always Filter1):
$newData['_embedded']['settings'][$indexOfWVF]['data']['Filter1'] = $newArray['Filter1'];

Get looked up array count for a document

i have 2 collections : words and phrases
Each word document has an array of phrases id's. And each phrase can be active or inactive.
For example :
words : {"word" => "hello", phrases => [1,2]}{"word" => "table", phrases => [2]}
phrases :{"id" => 1, "phrase" => "hello world!", "active" => 1}{"id" => 2, "phrase" => "hello, i have already bought new table", "active" => 0}
I need to get count of active phrases for each word.
In php i do it like this:
1. get all words
2. for each word get count of active phrases with condition ['active' => 1]
Question: How can i get words with active phrases count in one request? I tried to use MapReduce, but i need to make a request for each word to get count of active phrases.
UPD:
In my test collection there are 92 000 phrases and 23 000 words.
I have already tested both variant: with php loop for each word in which i get phrases count and aggreagation function in mongo.
But i changed aggregation pipeline in commets below because of phrases_data. It is array, so i can't use $match on it. I use $unwind after $lookup.
[ '$unwind' => '$5'],
[
'$lookup' => [
'from' => 'phrases_926ee3bc9fa72b029e028ec90e282072ea0721d1',
'localField' => '5',
'foreignField' => '0',
'as' => 'phrases_data'
]
],
[ '$unwind' => '$phrases_data'],
[ '$match' => [ 'phrases_data.3' => 77] ], //phrases_data.3 => 77 it is similar to phrases_data.active => 1
[ '$group' =>
[
'_id' => ['word' => '$1', 'id' => '$0'],
'active_count' => [ '$sum' => 1]
]
],
[ '$match' => [ 'active_count' => ['$gt' => 0]] ],
[ '$sort' =>
[
'active_count' => -1
]
]
The problem is that $group command take 80% of process time. And it is much slower than php loop. Here is my results for test collection:
1. Php loop (get words-> get phrases count for each word): 10 seconds
2. Aggregation function : 20 seconds

db.words.aggregate([
{ "$unwind" : "$phrases"},
{
"$lookup": {
"from": "phrases",
"localField": "phrases",
"foreignField": "id",
"as": "phrases_data"
}
},
{ "$match" : { "phrases_data.active" : 1} },
{ "$group" : {
"_id" : "$word",
"active_count" : { $sum : 1 }
}
}
]);
You can use above aggregation pipeline :
Unwind the phrases array from words collection documen as separate document
do a lookup(join) in phrases collection using unwinded phrases
filter the phrases and check for active using $match
Finally group phrases by word and count using $sum : 1

Calculating the relevance of a User based on Specific data

I am currently in the process of trying to form an algorithm that will calculate the relevance of a user to another user based on certain bits of data.
Unfortunately, my Maths skills have deteriorated since leaving school almost a decade ago, and as such, I am very much struggling with this. I have found an algorithm online that pushes 'hot' posts to the top of a newsfeed and figure this is a good place to start. This is the algorithm/calculation I found online (in MySQL):
LOG10(ABS(activity) + 1) * SIGN(activity) + (UNIX_TIMESTAMP(created_at) / 300000)
What I am hoping to do is adapt the above concept to work with the data and models I have in my own application. Consider this user object (trimmed down):
{
"id": 1
"first_name": "Joe",
"last_name": "Bloggs",
"counts": {
"connections": 21,
"mutual_connections": 16
},
"mutual_objects": [
{
"created_at": "2017-03-26 13:30:47"
},
{
"created_at": "2017-03-26 14:25:32"
}
],
"last_seen": "2017-03-26 14:25:32",
}
There are three bits of relevant information above that need to be considered in the algorithm:
mutual_connections
mutual_objects but taking into account that older objects should not drive up the relevance as much as newer objects, hence the created_at field.
last_seen
Can anyone suggest a fairly simple (if that's possible) way of doing this?
This was my idea, but in all honesty, I have no idea what it is doing so I cannot be sure if it is a good solution and I have also missed out last_seen as I could not find a way to add this:
$mutual_date_sum = 0;
foreach ($user->mutual_objects as $mutual_object) {
$mutual_date_sum =+ strtotime($mutual_object->created_at);
}
$mutual_date_thing = $mutual_date_sum / (300000 * count($user->mutual_objects));
$relevance = log10($user->counts->mutual_connections + 1) + $mutual_date_thing;
Just to be clear, I am not looking to implement some sort of government level AI, 50,000 line algorithm from a mathematical genius. I am merely looking for a relatively simple solution that will do the trick for the moment.
UPDATE
I have had a little play and have managed to build the following test. It seems the mutual_objects very much carries the weight in this particular algorithm as I would expect to see users 4 and 5 higher up the results list given their large number of mutual_connections.
I don't know if this makes it easier to amend/play with, but this is probably the best I can do. Please help if you have any suggestions :-)
$users = [
[
'id' => 1,
'mutual_connections' => 15,
'mutual_objects' => [
[
'created_at' => '2017-03-26 14:25:32'
],
[
'created_at' => '2017-03-26 14:25:32'
],
[
'created_at' => '2017-02-26 14:25:32'
],
[
'created_at' => '2017-03-15 14:25:32'
],
[
'created_at' => '2017-01-26 14:25:32'
],
[
'created_at' => '2017-03-26 14:25:32'
],
[
'created_at' => '2016-03-26 14:25:32'
],
[
'created_at' => '2017-03-26 14:25:32'
]
],
'last_seen' => '2017-03-01 14:25:32'
],
[
'id' => 2,
'mutual_connections' => 2,
'mutual_objects' => [
[
'created_at' => '2016-03-26 14:25:32'
],
[
'created_at' => '2015-03-26 14:25:32'
],
[
'created_at' => '2017-02-26 14:25:32'
],
[
'created_at' => '2017-03-15 14:25:32'
],
[
'created_at' => '2017-01-26 14:25:32'
],
[
'created_at' => '2017-03-26 14:25:32'
],
[
'created_at' => '2016-03-26 14:25:32'
],
[
'created_at' => '2016-03-26 14:25:32'
],
[
'created_at' => '2016-03-26 14:25:32'
],
[
'created_at' => '2017-03-15 14:25:32'
],
[
'created_at' => '2017-02-26 14:25:32'
],
[
'created_at' => '2017-03-15 14:25:32'
],
[
'created_at' => '2017-01-26 14:25:32'
],
[
'created_at' => '2017-03-12 14:25:32'
],
[
'created_at' => '2016-03-13 14:25:32'
],
[
'created_at' => '2017-03-17 14:25:32'
]
],
'last_seen' => '2015-03-25 14:25:32'
],
[
'id' => 3,
'mutual_connections' => 30,
'mutual_objects' => [
[
'created_at' => '2017-02-26 14:25:32'
],
[
'created_at' => '2017-03-26 14:25:32'
]
],
'last_seen' => '2017-03-25 14:25:32'
],
[
'id' => 4,
'mutual_connections' => 107,
'mutual_objects' => [],
'last_seen' => '2017-03-26 14:25:32'
],
[
'id' => 5,
'mutual_connections' => 500,
'mutual_objects' => [],
'last_seen' => '2017-03-26 20:25:32'
],
[
'id' => 6,
'mutual_connections' => 5,
'mutual_objects' => [
[
'created_at' => '2017-03-26 20:55:32'
],
[
'created_at' => '2017-03-25 14:25:32'
]
],
'last_seen' => '2017-03-25 14:25:32'
]
];
$relevance = [];
foreach ($users as $user) {
$mutual_date_sum = 0;
foreach ($user['mutual_objects'] as $bubble) {
$mutual_date_sum =+ strtotime($bubble['created_at']);
}
$mutual_date_thing = empty($mutual_date_sum) ? 1 : $mutual_date_sum / (300000 * count($user['mutual_objects']));
$relevance[] = [
'id' => $user['id'],
'relevance' => log10($user['mutual_connections'] + 1) + $mutual_date_thing
];
}
$relevance = collect($relevance)->sortByDesc('relevance');
print_r($relevance->values()->all());
This prints out:
Array
(
[0] => Array
(
[id] => 3
[relevance] => 2485.7219150272
)
[1] => Array
(
[id] => 6
[relevance] => 2484.8647045837
)
[2] => Array
(
[id] => 1
[relevance] => 622.26175831599
)
[3] => Array
(
[id] => 2
[relevance] => 310.84394042139
)
[4] => Array
(
[id] => 5
[relevance] => 3.6998377258672
)
[5] => Array
(
[id] => 4
[relevance] => 3.0334237554869
)
)

This problem is a candidate for machine learning. Look for an introductory book, because I think that it is not very complex and you could do it. If not, depending on the income you make with your website, you might consider hiring someone who does it for you.
If you prefer to do it "manually"; you will build your own model with specific weights to different factors. Be aware that our brains deceive us very often and what you think is a perfect model might be far from optimal.
I would suggest you to start right away storing data on which users each user interacts more with; so you can compare your results with real data. Also, in the future you will have a foundation to build a proper machine learning system.
Having said that, here is my proposal:
In the end, you want a list like this (with 3 users):
A->B: relevance
----------------
User1->User2: 0.59
User1->User3: 0.17
User2->User1: 0.78
User2->User3: 0.63
User3->User1: 0.76
User3->User2: 0.45
1) For each user
1.1) Compute and cache the age of every user's 'last_seen', in days, integer rounding down (floor).
1.2) Store max(age(last_seen)) -let's call it just max-. This is one value, not one per user. But you can only compute it once you have previously computed the age of every user
1.3) For each user, change the stored age value with the result of (max-age)/max to get a value between 0 and 1.
1.4) Compute and cache also every object's 'created_at', in days.
2) For each user, comparing with every other user
2.1) Regarding mutual connections, think of this: if A has 100 connections, 10 of them shared with B, and C has 500 connections, 10 of them shared with D, do you really take 10 as the value for the calculation in both cases? I would take the percentage. For A->B it would be 10 and for C->D it would be 2. And then /100 to have a value between 0 and 1.
2.2) Pick a maximum age for mutual objects to be relevant. Let's take 365 days.
2.3) In user A, remove objects older than 365 days. Do not really remove them, just filter them out for the sake of these calculations.
2.4) From the remaining objects, compute the percentage of mutual objects with each of the other users.
2.5) For each one of these other users, compute the average age of the objects in common from the previous step. Take the maximum age (365), subtract the computed average and /365 to have a value between 0 and 1.
2.6) Retrieve the age value of the other user.
So, for each combination of A->B, you have four values between 0 and 1:
MC: mutual connections A-B
MO: mutual objects A-B
OA: avg mutual object age A-B
BA: age of B
Now you have to assign weights to each one of them in order to find the optimal solution. Assign percentages which sum 100 to make your life easier:
Relevance = 40 * MC + 30 * MO + 10 * OA + 20 * BA
In this case, since OA is so related to MO, you can mix them:
Relevance = 40 * MC + 20 * MO + 20 * MO * OA + 20 * BA
I would suggest running this overnight, every day. There are many ways to improve and optimize the process... have fun!

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.