I have a query like this
$results = $collection->find([
'status' => "pending",
'short_code' => intval($shortCode),
'create_time' => ['$lte' => time()],
])
->limit(self::BATCH_NUMBER)
->sort(["priority" => -1, "create_time" => 1]);
Where BATCH_SIZE is 70.
and i use the result of query like below :
foreach ($results as $mongoId => $result) {
}
or trying to convert in to array like :
iterator_to_array($results);
mongo fetch data and traveling on iterate timing is :
FetchTime: 0.003173828125 ms
IteratorTime: 4065.1459960938 ms
As you can see, fetching data by mongo is too fast, but iterating (in both case of using iterator_to_array or using foreach) is slow.
It is a queue for sending messages to another server. Destination server accept less than 70 documents per each request. So i forced to fetch 70 document. anyway. I want to fetch 70 documents from 1,300,000 documents and we have problem here.
query try to fetch first 70 documents which have query conditions, send them and finally delete them from collection.
can anybody help? why it takes long? or is there any config for accelerating for php or mongo?
Another thing, when total number of data is like 100,000 (isntead of 1,300,000) the traveling is fast. traveling time will increase by increasing number of total documents.
That was because of sorting.
The problem :
Fetching data from mongo was fast, but traveling in iterator was slow using foreach.
solution :
there was a sort which we used. sorting by priority DESC and create_time ASC. These fileds was index ASC seperatly. Indexing priority DESC and create_time ASC together fixed problem.
db.queue_1.createIndex( { "priority" : -1, "create_time" : 1 } )
order of fileds on indexing is important. means you should use priority at first. then use create_time.
because when you try to sort your query, you sort them like below :
.sort({priority : -1, create_time : 1});
Related
I'm retrieving some traffic data of a website using "scan" option in Dynamodb. I have used filterExpression to filter those out.
I will be doing scanning against a large table which will have more than 20GB of data.
I found that DynamoDB scans throguh the entire table and filter the results out. The document says it only returns 1MB of data and then i have to loop through again to get the rest. It seems to be bad way to make this work.
got the reference from here: Dynamodb filter expression not returning all results
For a small table that should be fine.
MySQL dose the same I guess. I'm not sure.
Which is faster to read is it MySQL select or DynamoDB scan on a large set of data. ?
Is there any other alternative? what are your thoughts and suggestions?
I'm trying to migrate those traffic data into Dynamodb table and then query it out. It seems like a bad idea to me now.
$params = [
'TableName' => $tableName,
'FilterExpression' => $this->filter.'=:'.$this->filter.' AND #dy > :since AND #dy < :now',
'ExpressionAttributeNames'=> [ '#dy' => 'day' ],
'ExpressionAttributeValues'=> $eav
];
var_dump($params);
try {
$result = $dynamodb->scan($params);
After considering the suggestion this is what worked for me
$params = [
'TableName' => $tableName,
'IndexName' => self::GLOBAL_SECONDARY_INDEX_NAME,
'ProjectionExpression' => '#dy, t_counter , traffic_type_id', 'KeyConditionExpression' => 'country=:country AND #dy between :since AND :to',
'FilterExpression' => 'traffic_type_id=:traffic_type_id' 'ExpressionAttributeNames' => ['#dy' => 'day'],
'ExpressionAttributeValues' => $eav
];
If your data is like Key-Value pair and you have fixed fields on which you want to index, use DynamoDB - you can create indexes on all fields you want to query and it will work great
If you require complex querying on multiple indexes, then any RDBMS is good.
If you can query on just about anything, think about Elastic search
If your queries are very simple, but you have large data to be retrieved in each query. Think about S3. Maybe you can index metadata in DynamoDb and actual data can be in S3
I have 3 columns id, msg and created_at in my Model table. created_at is a timestamp and id is primary key.
I also have 5 datas, world => time4, hello => time2,haha => time1,hihio => time5 and dunno => time3 and these datas are arranged in ascending order (as arranged here) based on their id.
In laravel 4, I want to fetch these data, arrange them in ascending order and take the last n(in this case, 3) number of records. So, I want to get dunno,world and hihio rows displayed like this in a div :
dunno,time3
world,time4
hihio,time5
What I have tried
Model::orderBy('created_at','asc')->take(3);
undesired result :
haha,time1
hello,time2
dunno,time3
Also tried
Model::orderBy('created_at','desc')->take(3);
undesired result :
hihio,time5
world,time4
dunno,time3
I have also tried the reverse with no luck
Model::take(3)->orderBy('created_at','asc');
This problem seems fairly simple but I just can't seem to get my logic right. I'm still fairly new in Laravel 4 so I would give bonus points to better solutions than using orderBy() and take() if there is. Thank you very much!
You are very close.
It sounds like you want to first order the array by descending order
Model::orderBy('created_at','desc')->take(3);
but then reverse the array. You can do this one of two ways, either the traditional PHP (using array_reverse).
$_dates = Model::orderBy('created_at','desc')->take(3);
$dates = array_reverse($_dates);
Or the laravel way, using the reverse function in Laravel's Collection class.
$_dates = Model::orderBy('created_at','desc')->take(3)->reverse();
Check out Laravel's Collection documentation at their API site at http://laravel.com/api/class-Illuminate.Support.Collection.html
Now $dates will contain the output you desire.
dunno,time3
world,time4
hihio,time5
You're pretty close with your second attempt. After retrieving the rows from the database, you just need to reverse the array. Assuming you have an instance of Illuminate\Support\Collection, you just need to the following:
$expectedResult = $collection->reverse();
To get last three rows in ascending order:
$_dates = Model::orderBy('created_at','desc')->take(3)->reverse();
Now, the json output of $_dates will give you a object of objects.
To get array of objects use:
$_dates = Model::orderBy('created_at','desc')->take(3)->reverse()->values();
$reverse = Model::orderBy('created_at','desc')->take(3);
$show = $reverse->reverse();
I have below collection:
**S_SERVER** – **S_PORT** – **D_PORT** – **D_SERVER** – **MBYTES**
L0T410R84LDYL – 2481 – 139 – MRMCRUNCHAPP – 10
MRMCASTLE – 1904 – 445 – MRMCRUNCHAPP – 25
MRMXPSCRUNCH01 – 54769 – 445 – MRMCRUNCHAPP - 2
MRMCASTLE – 2254 – 139 – MRMCRUNCHAPP - 4
MRMCASTLE – 2253 – 445 – MRMCRUNCHAPP -35
MRMCASTLE – 987 – 445 – MRMCRUNCHAPP – 100
MRMCASTLE – 2447 – 445 – MRMCRUNCHAPP – 12
L0T410R84LDYL – 2481 – 139 – MRMCRUNCHAPP - 90
MRMCRUNCHAPP – 61191 – 1640 – OEMGCPDB – 10
Firstly, I need the top 30 S_SERVER as per total MBYTES transferred from each S_SERVER. This is I am able to get with following query :
$sourcePipeline = array(
array(
'$group' => array(
'_id' => array('sourceServer' => '$S_SERVER'),
'MBYTES' => array('$sum' => '$MBYTES')
),
),
array(
'$sort' => array("MBYTES" => -1),
),
array(
'$limit' => 30
)
);
$sourceServers = $collection->aggregate($sourcePipeline);
I also need Top 30 D_PORT as per total MBYTES transferred from each D_PORT for individual S_SERVER. I am doing this by running for loop from above servers results and getting them individually one by one for each S_SERVER.
$targetPortPipeline = array(
array(
'$project' => array('S_SERVER' => '$S_SERVER', 'D_PORT' => '$D_PORT', 'MBYTES' => '$MBYTES')
),
array(
'$match' => array('S_SERVER' => S_SERVER(find from above query, passed one by one in for loop)),
),
array(
'$group' => array(
'_id' => array('D_PORT' => '$D_PORT'),
'MBYTES' => array('$sum' => '$MBYTES')
),
),
array(
'$sort' => array("MBYTES" => -1),
),
array(
'$limit' => $limit
)
);
$targetPorts = $collection->aggregate($targetPortPipeline);
But this process is taking too much time. I need an efficient way to get required results in same query. I know i am using Mongodb php adapter to accomplish this. You can let me know the aggregation function in javascript format also. I will convert it into php.
You problem here essentially seems to be that you are issuing 30 more queries for your initial 30 results. There is no simple solution to this, and a single query seems right out at the moment, but there are a few things you can consider.
As an additional note, you are not alone in this as this is a question I have seen before, which we can refer to as a "top N results problem". Essentially what you really want is some way to combine the two result sets so that each grouping boundary (source server) itself only has a maximum N results, while at that top level you are also restricting those results again to the top N result values.
Your first aggregation query you the results for the top 30 "source servers" you want, and that is just fine. But rather than looping additional queries from this you could try creating an array with just the "source server" values from this result and passing that to your second query using the $in operator instead:
db.collection.aggregate([
// Match should be first
{ "$match": { "S_SERVER": { "$in": sourceServers } } },
// Group both keys
{ "$group": {
"_id": {
"S_SERVER": "$S_SERVER",
"D_SERVER": "$D_SERVER"
},
"value": { "$sum": "$MBYTES" }
}},
// Sort in order of key and largest "MBYTES"
{ "$sort": { "S_SERVER": 1, "value": -1 } }
])
Noting that you cannot "limit" here as this contains every "source server" from the initial match. You can also not "limit" on the grouping boundary, which is essentially what is missing from the aggregation framework to make this a two query result otherwise.
As this contains every "dest server" result and possibly beyond the "top 30" you would be processing the result in code and skipping returned results after the "top 30" are retrieved at each grouping (source server) level. Depending on how many results you have, this may or may not be the most practical solution.
Moving on where this is not so practical, you are then pretty much stuck with getting that output into another collection as interim step. If you have a MongoDB 2.6 version or above, this can be as simple as adding an $out pipeline stage at the end of the statement. For earlier versions, you can do the equivalent statement using mapReduce:
db.collection.mapReduce(
function() {
emit(
{
"S_SERVER": this["S_SERVER"],
"D_SERVER": this["D_SERVER"]
},
this.MBYTES
);
},
function(key,values) {
return Array.sum( values );
},
{
"query": { "S_SERVER": { "$in": sourceServers } },
"out": { "replace": "output" }
}
)
That is essentially the same process as the previous aggregation statement, while also noting that mapReduce does not sort the output. That is what is covered by an additional mapReduce operation on the resulting collection:
db.output.mapReduce(
function() {
if ( cServer != this._id["S_SERVER"] ) {
cServer = this._id["S_SERVER"];
counter = 0;
}
if ( counter < 30 )
emit( this._id, this.value );
counter++;
},
function(){}, // reducer is not actually called
{
"sort": { "_id.S_SERVER": 1, "value": -1 },
"scope": { "cServer": "", "counter": 0 }
}
)
The implementation here is the "server side" version of the "cursor skipping" that was mentioned earlier. So you are still processing every result, but the returned results over the wire are limited to the top 30 results under each "source server".
As stated, still pretty horrible in that this must scan "programmatically" through the results to discard the ones you do not want, and depending again on your volume, you might be better of simply issuing a a .find() for each "source server" value in these results while sorting and limiting the results
sourceServers.forEach(function(source) {
var cur = db.output.find({ "_id.S_SERVER": source })
.sort({ "value": -1 }).limit(30);
// do something with those results
);
And that is still 30 additional queries but at least you are not "aggregating" every time as that work is already done.
As a final note that is actually too much detail to go into in full, you could possibly approach this using a modified form of the initial aggregation query that was shown. This comes more as a footnote for if you have read this far without the other approaches seeming reasonable then this is likely the worst fit due to the memory constraints this is likely to hit.
The best way to introduce this is with the "ideal" case for a "top N results" aggregation, which of course does not actually exist but ideally the end of the pipeline would look something like this:
{ "$group": {
"_id": "$S_SERVER",
"results": {
"$push": {
"D_SERVER": "$_id.D_SERVER",
"MBYTES": "$value"
},
"$limit": 30
}
}}
So the "non-existing" factor here is the ability to "limit" the number of results that were "pushed" into the resulting array for each "source server" value. The world would certainly be a better place if this or similar functionality were implemented as it makes solving problems such as this quite easy.
Since it does not exist, you are left with resorting to other methods to get the same result and end up with listings like this example, except actually a lot more complex in this case.
Considering that code you would be doing something along the lines of:
Group all the results back into an array per server
Group all of that back into a single document
Unwind the first server results
Get the first entry and group back again.
Unwind the results again
Project and match the found entry
Discard the matching result
Rinse and repeat steps 4 - 7 30 times
Store back the first server document of 30 results
Rinse and repeat for 3 - 9 for each server, so 30 times
You would never actually code this directly, and would have to code generate the pipeline stages. It is very likely to blow up the 16MB limit, probably not on the pipeline document itself but very likely to do so on the actual result set as you are pushing everything into arrays.
You might also note how easy this would be to blow up completely if your actual results did not contain at least 30 top values for each server.
The whole scenario comes down to a trade-off on which method suits your data and performance considerations the best:
Live with 30 aggregation queries from your initial result.
Reduce to 2 queries and discard unwanted results in client code.
Output to a temporary collection and use a server cursor skip to discard results.
Issue the 30 queries from a pre-aggregated collection.
Actually go to the trouble of implementing a pipeline that produces the results.
In any case, due to the complexities of aggregating this in general, I would go for producing your result set periodically and storing it in it's own collection rather that trying to do this in real time.
The data would not be the "latest" result and only as fresh as how often you update it. But actually retrieving those results for display becomes a simple query, returning at maximum 900 fairly compact results without the overhead of aggregating for every request.
Is there any performance issues with php mongo query cursor handling?
My code:
$cursor = $collection->find($searchCriteria)->limit($limit_rows);
// Sort ascending based on S_DTTM
$cursor->sort(array('S_DTTM' => 1 , 'SYMBOL' => 1 ));
// How many results found?
$num_docs = $cursor->count();
if( $num_docs > 0 )
{
// loop over the results
foreach ($cursor as $ticks)
{
See codes like
// request data
$result = $cursor->getNext();
My issue is after the first query returns ( full with limit of 100 rows ) the next query just goes on looping. Have millions of rows returning, so I wanted to put the limits with "limit".
I did do re-index just in case, still no difference.
What am I doing wrong? Does the getNext works better?
Using mongod ver 2.5.4 and the latest php mongo driver downloaded a week ago.
Collection size is 100Gb including 2 additional indexes.
mongo log shows all the query executing in less than 200ms.
Turns out to be Query Issue and not php mongo driver issue ..
Use of count() and sort() may decrease performance.
So I have a MongoDB document that tracks logons to our app. Basic structure appears thusly:
[_id] => MongoId Object
(
[$id] => 50f6da28686ba94b49000003
)
[userId] => 50ef542a686ba95971000004
[action] => login
[time] => 1358354984
Now- the challenge is this: there are about 20,000 of these entries. I have been challenged to look at the number of times each user logged in (as defined by userId)...so I am looking for a good way to do this. There are a couple of possible approaches that I've seen (in SQL, for example, I might pull down number of logins by grouping by UserID and doing a count on it- something like SELECT userID, count(*) from....group by UserId...and then sub-selecting on that (CASE WHEN or something in the top select).
Anyways- wondering if anyone has any suggestions on the best way to do this. Worst case scenario I can limit the result set and do the grouping in memory- but ideally would like to get the full answer directly from Mongo.
The other limitation (even after I get past the first set) is that I am looking to do a unique count by date...which will be even tougher!
Now- the challenge is this: there are about 20,000 of these entries.
At 20,000 you will probably be better off with the aggregation framework ( http://docs.mongodb.org/manual/applications/aggregation/ ):
$db->user->aggregate(array(
array( '$group' => array( '_id' => '$userId', 'num_logins' => array( '$sum' => 1 ) ) )
));
That will group ( http://docs.mongodb.org/manual/reference/aggregation/#_S_group ) by userId and count (sum: http://docs.mongodb.org/manual/reference/aggregation/sum/#_S_sum ) the amount of grouped login there are.
Note: As stated in the comments, the aggregate helper is in version 1.3+ of the PHP driver. Before version 1.3 you must use the command function directly.
You can use MapReduce to group the results by user ID
http://docs.mongodb.org/manual/applications/map-reduce/#map-reduce-examples
Or you can use the Group method:
db.logins.aggregate(
{ $group : {
_id : "$userId",
loginsPerUser : { $sum : 1 }
}}
);
For MongoDB 20K or even more won't be a problem to walk and combine them so no worries about performance.
http://docs.mongodb.org/manual/reference/command/group/
db.user.group({key: {userId: 1}, $reduce: function ( curr, result ) { result.total++ }, initial: {total: 0}});
I ran this on 191000 rows in just a couple seconds but group is limited to 20,000 unique entries so it really isn't a solution for you.