mongodb aggregation subquery : Mongodb php adapter - php

I have below collection:
**S_SERVER** – **S_PORT** – **D_PORT** – **D_SERVER** – **MBYTES**
L0T410R84LDYL – 2481 – 139 – MRMCRUNCHAPP – 10
MRMCASTLE – 1904 – 445 – MRMCRUNCHAPP – 25
MRMXPSCRUNCH01 – 54769 – 445 – MRMCRUNCHAPP - 2
MRMCASTLE – 2254 – 139 – MRMCRUNCHAPP - 4
MRMCASTLE – 2253 – 445 – MRMCRUNCHAPP -35
MRMCASTLE – 987 – 445 – MRMCRUNCHAPP – 100
MRMCASTLE – 2447 – 445 – MRMCRUNCHAPP – 12
L0T410R84LDYL – 2481 – 139 – MRMCRUNCHAPP - 90
MRMCRUNCHAPP – 61191 – 1640 – OEMGCPDB – 10
Firstly, I need the top 30 S_SERVER as per total MBYTES transferred from each S_SERVER. This is I am able to get with following query :
$sourcePipeline = array(
array(
'$group' => array(
'_id' => array('sourceServer' => '$S_SERVER'),
'MBYTES' => array('$sum' => '$MBYTES')
),
),
array(
'$sort' => array("MBYTES" => -1),
),
array(
'$limit' => 30
)
);
$sourceServers = $collection->aggregate($sourcePipeline);
I also need Top 30 D_PORT as per total MBYTES transferred from each D_PORT for individual S_SERVER. I am doing this by running for loop from above servers results and getting them individually one by one for each S_SERVER.
$targetPortPipeline = array(
array(
'$project' => array('S_SERVER' => '$S_SERVER', 'D_PORT' => '$D_PORT', 'MBYTES' => '$MBYTES')
),
array(
'$match' => array('S_SERVER' => S_SERVER(find from above query, passed one by one in for loop)),
),
array(
'$group' => array(
'_id' => array('D_PORT' => '$D_PORT'),
'MBYTES' => array('$sum' => '$MBYTES')
),
),
array(
'$sort' => array("MBYTES" => -1),
),
array(
'$limit' => $limit
)
);
$targetPorts = $collection->aggregate($targetPortPipeline);
But this process is taking too much time. I need an efficient way to get required results in same query. I know i am using Mongodb php adapter to accomplish this. You can let me know the aggregation function in javascript format also. I will convert it into php.

You problem here essentially seems to be that you are issuing 30 more queries for your initial 30 results. There is no simple solution to this, and a single query seems right out at the moment, but there are a few things you can consider.
As an additional note, you are not alone in this as this is a question I have seen before, which we can refer to as a "top N results problem". Essentially what you really want is some way to combine the two result sets so that each grouping boundary (source server) itself only has a maximum N results, while at that top level you are also restricting those results again to the top N result values.
Your first aggregation query you the results for the top 30 "source servers" you want, and that is just fine. But rather than looping additional queries from this you could try creating an array with just the "source server" values from this result and passing that to your second query using the $in operator instead:
db.collection.aggregate([
// Match should be first
{ "$match": { "S_SERVER": { "$in": sourceServers } } },
// Group both keys
{ "$group": {
"_id": {
"S_SERVER": "$S_SERVER",
"D_SERVER": "$D_SERVER"
},
"value": { "$sum": "$MBYTES" }
}},
// Sort in order of key and largest "MBYTES"
{ "$sort": { "S_SERVER": 1, "value": -1 } }
])
Noting that you cannot "limit" here as this contains every "source server" from the initial match. You can also not "limit" on the grouping boundary, which is essentially what is missing from the aggregation framework to make this a two query result otherwise.
As this contains every "dest server" result and possibly beyond the "top 30" you would be processing the result in code and skipping returned results after the "top 30" are retrieved at each grouping (source server) level. Depending on how many results you have, this may or may not be the most practical solution.
Moving on where this is not so practical, you are then pretty much stuck with getting that output into another collection as interim step. If you have a MongoDB 2.6 version or above, this can be as simple as adding an $out pipeline stage at the end of the statement. For earlier versions, you can do the equivalent statement using mapReduce:
db.collection.mapReduce(
function() {
emit(
{
"S_SERVER": this["S_SERVER"],
"D_SERVER": this["D_SERVER"]
},
this.MBYTES
);
},
function(key,values) {
return Array.sum( values );
},
{
"query": { "S_SERVER": { "$in": sourceServers } },
"out": { "replace": "output" }
}
)
That is essentially the same process as the previous aggregation statement, while also noting that mapReduce does not sort the output. That is what is covered by an additional mapReduce operation on the resulting collection:
db.output.mapReduce(
function() {
if ( cServer != this._id["S_SERVER"] ) {
cServer = this._id["S_SERVER"];
counter = 0;
}
if ( counter < 30 )
emit( this._id, this.value );
counter++;
},
function(){}, // reducer is not actually called
{
"sort": { "_id.S_SERVER": 1, "value": -1 },
"scope": { "cServer": "", "counter": 0 }
}
)
The implementation here is the "server side" version of the "cursor skipping" that was mentioned earlier. So you are still processing every result, but the returned results over the wire are limited to the top 30 results under each "source server".
As stated, still pretty horrible in that this must scan "programmatically" through the results to discard the ones you do not want, and depending again on your volume, you might be better of simply issuing a a .find() for each "source server" value in these results while sorting and limiting the results
sourceServers.forEach(function(source) {
var cur = db.output.find({ "_id.S_SERVER": source })
.sort({ "value": -1 }).limit(30);
// do something with those results
);
And that is still 30 additional queries but at least you are not "aggregating" every time as that work is already done.
As a final note that is actually too much detail to go into in full, you could possibly approach this using a modified form of the initial aggregation query that was shown. This comes more as a footnote for if you have read this far without the other approaches seeming reasonable then this is likely the worst fit due to the memory constraints this is likely to hit.
The best way to introduce this is with the "ideal" case for a "top N results" aggregation, which of course does not actually exist but ideally the end of the pipeline would look something like this:
{ "$group": {
"_id": "$S_SERVER",
"results": {
"$push": {
"D_SERVER": "$_id.D_SERVER",
"MBYTES": "$value"
},
"$limit": 30
}
}}
So the "non-existing" factor here is the ability to "limit" the number of results that were "pushed" into the resulting array for each "source server" value. The world would certainly be a better place if this or similar functionality were implemented as it makes solving problems such as this quite easy.
Since it does not exist, you are left with resorting to other methods to get the same result and end up with listings like this example, except actually a lot more complex in this case.
Considering that code you would be doing something along the lines of:
Group all the results back into an array per server
Group all of that back into a single document
Unwind the first server results
Get the first entry and group back again.
Unwind the results again
Project and match the found entry
Discard the matching result
Rinse and repeat steps 4 - 7 30 times
Store back the first server document of 30 results
Rinse and repeat for 3 - 9 for each server, so 30 times
You would never actually code this directly, and would have to code generate the pipeline stages. It is very likely to blow up the 16MB limit, probably not on the pipeline document itself but very likely to do so on the actual result set as you are pushing everything into arrays.
You might also note how easy this would be to blow up completely if your actual results did not contain at least 30 top values for each server.
The whole scenario comes down to a trade-off on which method suits your data and performance considerations the best:
Live with 30 aggregation queries from your initial result.
Reduce to 2 queries and discard unwanted results in client code.
Output to a temporary collection and use a server cursor skip to discard results.
Issue the 30 queries from a pre-aggregated collection.
Actually go to the trouble of implementing a pipeline that produces the results.
In any case, due to the complexities of aggregating this in general, I would go for producing your result set periodically and storing it in it's own collection rather that trying to do this in real time.
The data would not be the "latest" result and only as fresh as how often you update it. But actually retrieving those results for display becomes a simple query, returning at maximum 900 fairly compact results without the overhead of aggregating for every request.

Related

php Mongo driver cursor traveling take long so much

I have a query like this
$results = $collection->find([
'status' => "pending",
'short_code' => intval($shortCode),
'create_time' => ['$lte' => time()],
])
->limit(self::BATCH_NUMBER)
->sort(["priority" => -1, "create_time" => 1]);
Where BATCH_SIZE is 70.
and i use the result of query like below :
foreach ($results as $mongoId => $result) {
}
or trying to convert in to array like :
iterator_to_array($results);
mongo fetch data and traveling on iterate timing is :
FetchTime: 0.003173828125 ms
IteratorTime: 4065.1459960938 ms
As you can see, fetching data by mongo is too fast, but iterating (in both case of using iterator_to_array or using foreach) is slow.
It is a queue for sending messages to another server. Destination server accept less than 70 documents per each request. So i forced to fetch 70 document. anyway. I want to fetch 70 documents from 1,300,000 documents and we have problem here.
query try to fetch first 70 documents which have query conditions, send them and finally delete them from collection.
can anybody help? why it takes long? or is there any config for accelerating for php or mongo?
Another thing, when total number of data is like 100,000 (isntead of 1,300,000) the traveling is fast. traveling time will increase by increasing number of total documents.
That was because of sorting.
The problem :
Fetching data from mongo was fast, but traveling in iterator was slow using foreach.
solution :
there was a sort which we used. sorting by priority DESC and create_time ASC. These fileds was index ASC seperatly. Indexing priority DESC and create_time ASC together fixed problem.
db.queue_1.createIndex( { "priority" : -1, "create_time" : 1 } )
order of fileds on indexing is important. means you should use priority at first. then use create_time.
because when you try to sort your query, you sort them like below :
.sort({priority : -1, create_time : 1});

Continues mongo db querying gets slower

I'm running a tool that runs on a mongo collection with 7,766,558 documents in following template.
{
"_id" : ObjectId("53f602685a38d0bf5e8c6df1"),
"ids" : "5667",
"h" : [
{
"s" : "Briefly, 36 h following transfection, the cells were harvested and lysed in 500 µL HTE buffer containing calcium or calcium chelating reagents according to the experimental condition (150 mM NaCl, 20 mM HEPES (pH 7.4), 50 mM NaF, 1 mM Na3VO4, 1% Triton X-100 and 5 mM EDTA + 5 mM EGTA or 100 µM CaCl2 and complete protease inhibitor cocktail (Roche Applied Science)).",
"p" : "roche"
}
],
"ct" : 1445951888.0000000000000000
}
What the query does is filters the documents on "ids" (supplier id) and "ct" (created date) > [given date] and then run a mongo regex match on "h.s". Three instances of this job runs on 3 threads on a given time which each process difference suppliers. At the start of these threads, all 3 runs their queries quite fast. But the longer they run (approximately after 12 hours or so) the querying tend to get slower and slower.
The query that i'm running is
db.collection.find({'ids':4296, 'ct':{"$gte":1455062400}, 'h.s':{ '$regex': "/[^a-zA-Z]" . $catalogId . "[^0-9a-zA-Z]/i" }})
the solution doesn't throw any errors. The problem i'm facing is that the time that the query takes to perform drastically drops based on time. If the tool query 10 times per second at start. after like 10-12 hours its only around 5 times per second.
Any ideas into any performance updates i could do to over come this?
Thanks

MongoDB query to get number of times a Key occurs

So I have a MongoDB document that tracks logons to our app. Basic structure appears thusly:
[_id] => MongoId Object
(
[$id] => 50f6da28686ba94b49000003
)
[userId] => 50ef542a686ba95971000004
[action] => login
[time] => 1358354984
Now- the challenge is this: there are about 20,000 of these entries. I have been challenged to look at the number of times each user logged in (as defined by userId)...so I am looking for a good way to do this. There are a couple of possible approaches that I've seen (in SQL, for example, I might pull down number of logins by grouping by UserID and doing a count on it- something like SELECT userID, count(*) from....group by UserId...and then sub-selecting on that (CASE WHEN or something in the top select).
Anyways- wondering if anyone has any suggestions on the best way to do this. Worst case scenario I can limit the result set and do the grouping in memory- but ideally would like to get the full answer directly from Mongo.
The other limitation (even after I get past the first set) is that I am looking to do a unique count by date...which will be even tougher!
Now- the challenge is this: there are about 20,000 of these entries.
At 20,000 you will probably be better off with the aggregation framework ( http://docs.mongodb.org/manual/applications/aggregation/ ):
$db->user->aggregate(array(
array( '$group' => array( '_id' => '$userId', 'num_logins' => array( '$sum' => 1 ) ) )
));
That will group ( http://docs.mongodb.org/manual/reference/aggregation/#_S_group ) by userId and count (sum: http://docs.mongodb.org/manual/reference/aggregation/sum/#_S_sum ) the amount of grouped login there are.
Note: As stated in the comments, the aggregate helper is in version 1.3+ of the PHP driver. Before version 1.3 you must use the command function directly.
You can use MapReduce to group the results by user ID
http://docs.mongodb.org/manual/applications/map-reduce/#map-reduce-examples
Or you can use the Group method:
db.logins.aggregate(
{ $group : {
_id : "$userId",
loginsPerUser : { $sum : 1 }
}}
);
For MongoDB 20K or even more won't be a problem to walk and combine them so no worries about performance.
http://docs.mongodb.org/manual/reference/command/group/
db.user.group({key: {userId: 1}, $reduce: function ( curr, result ) { result.total++ }, initial: {total: 0}});
I ran this on 191000 rows in just a couple seconds but group is limited to 20,000 unique entries so it really isn't a solution for you.

MongoDB PHP findAndModify Multiple Performance

I have a documents in a collection called Reports that are to be processed. I do a query like
$collectionReports->find(array('processed' => 0))
(anywhere between 50 and 2000 items). I process them how I need to and insert the results into another collection, but I need to update the original Report to set processed to the current system time. Right now it looks something like:
$reports = $collectionReports->find(array('processed' => 0));
$toUpdate = array();
foreach ($reports as $report) {
//Perform the operations on them now
$toUpdate = $report['_id'];
}
foreach ($toUpdate as $reportID) {
$criteria = array('_id' => new MongoId($reportID));
$data = array('$set' => array('processed' => round(microtime(true)*1000)));
$collectionReports->findAndModify($criteria, $data);
}
My problem with this is that it is horribly inefficient. Processing the reports and inserting them into the collection takes maybe 700ms for 2000 reports, but just updating the processed times takes at least 1500ms for those same 2000 reports. Any tips to speed this up? Thanks in advance.
EDIT: The processed time doesn't have to be exact, it can just be the time the script is ran (+/- 10 seconds or so), if it would be possible to take the object ($report) and update the time directly like that, it would be better than just searching after the first foreach.
Thanks Sammaye, changing from findAndModify() to update() seems to work much better and faster.

MongoDB count very slow

I have a collection with 1.5 million documents. I'm counting using PHP:
$db->some->ensureIndex(array("sometext" => 1));
$db->some->ensureIndex(array("datsbla" => 1));
$arr["sometext"] = $string;
$arr["datsbla"] = array('$gte' => $some, '$lte' => $thing);
$count = $db->some->count($arr);
I turned on the profiler and every count like that is like 4500 ms. I have 20 counters like that in my page, so it makes my web page VERY VERY SLOW.
What should I do to make it faster (< 100 ms) ? Is it even possible using MongoDB?
Thanks.
You have two separate individual indexes - a query can only use 1 index at a time, so you are not taking advantage of the indexing fully. Try a compound index on both fields and you should see a significant improvement.

Categories