I have a collection with 1.5 million documents. I'm counting using PHP:
$db->some->ensureIndex(array("sometext" => 1));
$db->some->ensureIndex(array("datsbla" => 1));
$arr["sometext"] = $string;
$arr["datsbla"] = array('$gte' => $some, '$lte' => $thing);
$count = $db->some->count($arr);
I turned on the profiler and every count like that is like 4500 ms. I have 20 counters like that in my page, so it makes my web page VERY VERY SLOW.
What should I do to make it faster (< 100 ms) ? Is it even possible using MongoDB?
Thanks.
You have two separate individual indexes - a query can only use 1 index at a time, so you are not taking advantage of the indexing fully. Try a compound index on both fields and you should see a significant improvement.
Related
I have a query like this
$results = $collection->find([
'status' => "pending",
'short_code' => intval($shortCode),
'create_time' => ['$lte' => time()],
])
->limit(self::BATCH_NUMBER)
->sort(["priority" => -1, "create_time" => 1]);
Where BATCH_SIZE is 70.
and i use the result of query like below :
foreach ($results as $mongoId => $result) {
}
or trying to convert in to array like :
iterator_to_array($results);
mongo fetch data and traveling on iterate timing is :
FetchTime: 0.003173828125 ms
IteratorTime: 4065.1459960938 ms
As you can see, fetching data by mongo is too fast, but iterating (in both case of using iterator_to_array or using foreach) is slow.
It is a queue for sending messages to another server. Destination server accept less than 70 documents per each request. So i forced to fetch 70 document. anyway. I want to fetch 70 documents from 1,300,000 documents and we have problem here.
query try to fetch first 70 documents which have query conditions, send them and finally delete them from collection.
can anybody help? why it takes long? or is there any config for accelerating for php or mongo?
Another thing, when total number of data is like 100,000 (isntead of 1,300,000) the traveling is fast. traveling time will increase by increasing number of total documents.
That was because of sorting.
The problem :
Fetching data from mongo was fast, but traveling in iterator was slow using foreach.
solution :
there was a sort which we used. sorting by priority DESC and create_time ASC. These fileds was index ASC seperatly. Indexing priority DESC and create_time ASC together fixed problem.
db.queue_1.createIndex( { "priority" : -1, "create_time" : 1 } )
order of fileds on indexing is important. means you should use priority at first. then use create_time.
because when you try to sort your query, you sort them like below :
.sort({priority : -1, create_time : 1});
I have below collection:
**S_SERVER** – **S_PORT** – **D_PORT** – **D_SERVER** – **MBYTES**
L0T410R84LDYL – 2481 – 139 – MRMCRUNCHAPP – 10
MRMCASTLE – 1904 – 445 – MRMCRUNCHAPP – 25
MRMXPSCRUNCH01 – 54769 – 445 – MRMCRUNCHAPP - 2
MRMCASTLE – 2254 – 139 – MRMCRUNCHAPP - 4
MRMCASTLE – 2253 – 445 – MRMCRUNCHAPP -35
MRMCASTLE – 987 – 445 – MRMCRUNCHAPP – 100
MRMCASTLE – 2447 – 445 – MRMCRUNCHAPP – 12
L0T410R84LDYL – 2481 – 139 – MRMCRUNCHAPP - 90
MRMCRUNCHAPP – 61191 – 1640 – OEMGCPDB – 10
Firstly, I need the top 30 S_SERVER as per total MBYTES transferred from each S_SERVER. This is I am able to get with following query :
$sourcePipeline = array(
array(
'$group' => array(
'_id' => array('sourceServer' => '$S_SERVER'),
'MBYTES' => array('$sum' => '$MBYTES')
),
),
array(
'$sort' => array("MBYTES" => -1),
),
array(
'$limit' => 30
)
);
$sourceServers = $collection->aggregate($sourcePipeline);
I also need Top 30 D_PORT as per total MBYTES transferred from each D_PORT for individual S_SERVER. I am doing this by running for loop from above servers results and getting them individually one by one for each S_SERVER.
$targetPortPipeline = array(
array(
'$project' => array('S_SERVER' => '$S_SERVER', 'D_PORT' => '$D_PORT', 'MBYTES' => '$MBYTES')
),
array(
'$match' => array('S_SERVER' => S_SERVER(find from above query, passed one by one in for loop)),
),
array(
'$group' => array(
'_id' => array('D_PORT' => '$D_PORT'),
'MBYTES' => array('$sum' => '$MBYTES')
),
),
array(
'$sort' => array("MBYTES" => -1),
),
array(
'$limit' => $limit
)
);
$targetPorts = $collection->aggregate($targetPortPipeline);
But this process is taking too much time. I need an efficient way to get required results in same query. I know i am using Mongodb php adapter to accomplish this. You can let me know the aggregation function in javascript format also. I will convert it into php.
You problem here essentially seems to be that you are issuing 30 more queries for your initial 30 results. There is no simple solution to this, and a single query seems right out at the moment, but there are a few things you can consider.
As an additional note, you are not alone in this as this is a question I have seen before, which we can refer to as a "top N results problem". Essentially what you really want is some way to combine the two result sets so that each grouping boundary (source server) itself only has a maximum N results, while at that top level you are also restricting those results again to the top N result values.
Your first aggregation query you the results for the top 30 "source servers" you want, and that is just fine. But rather than looping additional queries from this you could try creating an array with just the "source server" values from this result and passing that to your second query using the $in operator instead:
db.collection.aggregate([
// Match should be first
{ "$match": { "S_SERVER": { "$in": sourceServers } } },
// Group both keys
{ "$group": {
"_id": {
"S_SERVER": "$S_SERVER",
"D_SERVER": "$D_SERVER"
},
"value": { "$sum": "$MBYTES" }
}},
// Sort in order of key and largest "MBYTES"
{ "$sort": { "S_SERVER": 1, "value": -1 } }
])
Noting that you cannot "limit" here as this contains every "source server" from the initial match. You can also not "limit" on the grouping boundary, which is essentially what is missing from the aggregation framework to make this a two query result otherwise.
As this contains every "dest server" result and possibly beyond the "top 30" you would be processing the result in code and skipping returned results after the "top 30" are retrieved at each grouping (source server) level. Depending on how many results you have, this may or may not be the most practical solution.
Moving on where this is not so practical, you are then pretty much stuck with getting that output into another collection as interim step. If you have a MongoDB 2.6 version or above, this can be as simple as adding an $out pipeline stage at the end of the statement. For earlier versions, you can do the equivalent statement using mapReduce:
db.collection.mapReduce(
function() {
emit(
{
"S_SERVER": this["S_SERVER"],
"D_SERVER": this["D_SERVER"]
},
this.MBYTES
);
},
function(key,values) {
return Array.sum( values );
},
{
"query": { "S_SERVER": { "$in": sourceServers } },
"out": { "replace": "output" }
}
)
That is essentially the same process as the previous aggregation statement, while also noting that mapReduce does not sort the output. That is what is covered by an additional mapReduce operation on the resulting collection:
db.output.mapReduce(
function() {
if ( cServer != this._id["S_SERVER"] ) {
cServer = this._id["S_SERVER"];
counter = 0;
}
if ( counter < 30 )
emit( this._id, this.value );
counter++;
},
function(){}, // reducer is not actually called
{
"sort": { "_id.S_SERVER": 1, "value": -1 },
"scope": { "cServer": "", "counter": 0 }
}
)
The implementation here is the "server side" version of the "cursor skipping" that was mentioned earlier. So you are still processing every result, but the returned results over the wire are limited to the top 30 results under each "source server".
As stated, still pretty horrible in that this must scan "programmatically" through the results to discard the ones you do not want, and depending again on your volume, you might be better of simply issuing a a .find() for each "source server" value in these results while sorting and limiting the results
sourceServers.forEach(function(source) {
var cur = db.output.find({ "_id.S_SERVER": source })
.sort({ "value": -1 }).limit(30);
// do something with those results
);
And that is still 30 additional queries but at least you are not "aggregating" every time as that work is already done.
As a final note that is actually too much detail to go into in full, you could possibly approach this using a modified form of the initial aggregation query that was shown. This comes more as a footnote for if you have read this far without the other approaches seeming reasonable then this is likely the worst fit due to the memory constraints this is likely to hit.
The best way to introduce this is with the "ideal" case for a "top N results" aggregation, which of course does not actually exist but ideally the end of the pipeline would look something like this:
{ "$group": {
"_id": "$S_SERVER",
"results": {
"$push": {
"D_SERVER": "$_id.D_SERVER",
"MBYTES": "$value"
},
"$limit": 30
}
}}
So the "non-existing" factor here is the ability to "limit" the number of results that were "pushed" into the resulting array for each "source server" value. The world would certainly be a better place if this or similar functionality were implemented as it makes solving problems such as this quite easy.
Since it does not exist, you are left with resorting to other methods to get the same result and end up with listings like this example, except actually a lot more complex in this case.
Considering that code you would be doing something along the lines of:
Group all the results back into an array per server
Group all of that back into a single document
Unwind the first server results
Get the first entry and group back again.
Unwind the results again
Project and match the found entry
Discard the matching result
Rinse and repeat steps 4 - 7 30 times
Store back the first server document of 30 results
Rinse and repeat for 3 - 9 for each server, so 30 times
You would never actually code this directly, and would have to code generate the pipeline stages. It is very likely to blow up the 16MB limit, probably not on the pipeline document itself but very likely to do so on the actual result set as you are pushing everything into arrays.
You might also note how easy this would be to blow up completely if your actual results did not contain at least 30 top values for each server.
The whole scenario comes down to a trade-off on which method suits your data and performance considerations the best:
Live with 30 aggregation queries from your initial result.
Reduce to 2 queries and discard unwanted results in client code.
Output to a temporary collection and use a server cursor skip to discard results.
Issue the 30 queries from a pre-aggregated collection.
Actually go to the trouble of implementing a pipeline that produces the results.
In any case, due to the complexities of aggregating this in general, I would go for producing your result set periodically and storing it in it's own collection rather that trying to do this in real time.
The data would not be the "latest" result and only as fresh as how often you update it. But actually retrieving those results for display becomes a simple query, returning at maximum 900 fairly compact results without the overhead of aggregating for every request.
Is there any performance issues with php mongo query cursor handling?
My code:
$cursor = $collection->find($searchCriteria)->limit($limit_rows);
// Sort ascending based on S_DTTM
$cursor->sort(array('S_DTTM' => 1 , 'SYMBOL' => 1 ));
// How many results found?
$num_docs = $cursor->count();
if( $num_docs > 0 )
{
// loop over the results
foreach ($cursor as $ticks)
{
See codes like
// request data
$result = $cursor->getNext();
My issue is after the first query returns ( full with limit of 100 rows ) the next query just goes on looping. Have millions of rows returning, so I wanted to put the limits with "limit".
I did do re-index just in case, still no difference.
What am I doing wrong? Does the getNext works better?
Using mongod ver 2.5.4 and the latest php mongo driver downloaded a week ago.
Collection size is 100Gb including 2 additional indexes.
mongo log shows all the query executing in less than 200ms.
Turns out to be Query Issue and not php mongo driver issue ..
Use of count() and sort() may decrease performance.
I have a documents in a collection called Reports that are to be processed. I do a query like
$collectionReports->find(array('processed' => 0))
(anywhere between 50 and 2000 items). I process them how I need to and insert the results into another collection, but I need to update the original Report to set processed to the current system time. Right now it looks something like:
$reports = $collectionReports->find(array('processed' => 0));
$toUpdate = array();
foreach ($reports as $report) {
//Perform the operations on them now
$toUpdate = $report['_id'];
}
foreach ($toUpdate as $reportID) {
$criteria = array('_id' => new MongoId($reportID));
$data = array('$set' => array('processed' => round(microtime(true)*1000)));
$collectionReports->findAndModify($criteria, $data);
}
My problem with this is that it is horribly inefficient. Processing the reports and inserting them into the collection takes maybe 700ms for 2000 reports, but just updating the processed times takes at least 1500ms for those same 2000 reports. Any tips to speed this up? Thanks in advance.
EDIT: The processed time doesn't have to be exact, it can just be the time the script is ran (+/- 10 seconds or so), if it would be possible to take the object ($report) and update the time directly like that, it would be better than just searching after the first foreach.
Thanks Sammaye, changing from findAndModify() to update() seems to work much better and faster.
I'm having a strange time dealing with selecting from a table with about 30,000 rows.
It seems my script is using an outrageous amount of memory for what is a simple, forward only walk over a query result.
Please note that this example is a somewhat contrived, absolute bare minimum example which bears very little resemblance to the real code and it cannot be replaced with a simple database aggregation. It is intended to illustrate the point that each row does not need to be retained on each iteration.
<?php
$pdo = new PDO('mysql:host=127.0.0.1', 'foo', 'bar', array(
PDO::ATTR_ERRMODE=>PDO::ERRMODE_EXCEPTION,
));
$stmt = $pdo->prepare('SELECT * FROM round');
$stmt->execute();
function do_stuff($row) {}
$c = 0;
while ($row = $stmt->fetch()) {
// do something with the object that doesn't involve keeping
// it around and can't be done in SQL
do_stuff($row);
$row = null;
++$c;
}
var_dump($c);
var_dump(memory_get_usage());
var_dump(memory_get_peak_usage());
This outputs:
int(39508)
int(43005064)
int(43018120)
I don't understand why 40 meg of memory is used when hardly any data needs to be held at any one time. I have already worked out I can reduce the memory by a factor of about 6 by replacing "SELECT *" with "SELECT home, away", however I consider even this usage to be insanely high and the table is only going to get bigger.
Is there a setting I'm missing, or is there some limitation in PDO that I should be aware of? I'm happy to get rid of PDO in favour of mysqli if it can not support this, so if that's my only option, how would I perform this using mysqli instead?
After creating the connection, you need to set PDO::MYSQL_ATTR_USE_BUFFERED_QUERY to false:
<?php
$pdo = new PDO('mysql:host=127.0.0.1', 'foo', 'bar', array(
PDO::ATTR_ERRMODE=>PDO::ERRMODE_EXCEPTION,
));
$pdo->setAttribute(PDO::MYSQL_ATTR_USE_BUFFERED_QUERY, false);
// snip
var_dump(memory_get_usage());
var_dump(memory_get_peak_usage());
This outputs:
int(39508)
int(653920)
int(668136)
Regardless of the result size, the memory usage remains pretty much static.
Another option would be to do something like:
$i = $c = 0;
$query = 'SELECT home, away FROM round LIMIT 2048 OFFSET %u;';
while ($c += count($rows = codeThatFetches(sprintf($query, $i++ * 2048))) > 0)
{
foreach ($rows as $row)
{
do_stuff($row);
}
}
The whole result set (all 30,000 rows) is buffered into memory before you can start looking at it.
You should be letting the database do the aggregation and only asking it for the two numbers you need.
SELECT SUM(home) AS home, SUM(away) AS away, COUNT(*) AS c FROM round
The reality of the situation is that if you fetch all rows and expect to be able to iterate over all of them in PHP, at once, they will exist in memory.
If you really don't think using SQL powered expressions and aggregation is the solution you could consider limiting/chunking your data processing. Instead of fetching all rows at once do something like:
1) Fetch 5,000 rows
2) Aggregate/Calculate intermediary results
3) unset variables to free memory
4) Back to step 1 (fetch next set of rows)
Just an idea...
I haven't done this before in PHP, but you may consider fetching the rows using a scrollable cursor - see the fetch documentation for an example.
Instead of returning all the results of your query at once back to your PHP script, it holds the results on the server side and you use a cursor to iterate through them getting one at a time.
Whilst I have not tested this, it is bound to have other drawbacks such as utilising more server resources and most likely reduced performance due to additional communication with the server.
Altering the fetch style may also have an impact as by default the documentation indicates it will store both an associative array and well as a numerical indexed array which is bound to increase memory usage.
As others have suggested, reducing the number of results in the first place is most likely a better option if possible.