Continues mongo db querying gets slower - php

I'm running a tool that runs on a mongo collection with 7,766,558 documents in following template.
{
"_id" : ObjectId("53f602685a38d0bf5e8c6df1"),
"ids" : "5667",
"h" : [
{
"s" : "Briefly, 36 h following transfection, the cells were harvested and lysed in 500 µL HTE buffer containing calcium or calcium chelating reagents according to the experimental condition (150 mM NaCl, 20 mM HEPES (pH 7.4), 50 mM NaF, 1 mM Na3VO4, 1% Triton X-100 and 5 mM EDTA + 5 mM EGTA or 100 µM CaCl2 and complete protease inhibitor cocktail (Roche Applied Science)).",
"p" : "roche"
}
],
"ct" : 1445951888.0000000000000000
}
What the query does is filters the documents on "ids" (supplier id) and "ct" (created date) > [given date] and then run a mongo regex match on "h.s". Three instances of this job runs on 3 threads on a given time which each process difference suppliers. At the start of these threads, all 3 runs their queries quite fast. But the longer they run (approximately after 12 hours or so) the querying tend to get slower and slower.
The query that i'm running is
db.collection.find({'ids':4296, 'ct':{"$gte":1455062400}, 'h.s':{ '$regex': "/[^a-zA-Z]" . $catalogId . "[^0-9a-zA-Z]/i" }})
the solution doesn't throw any errors. The problem i'm facing is that the time that the query takes to perform drastically drops based on time. If the tool query 10 times per second at start. after like 10-12 hours its only around 5 times per second.
Any ideas into any performance updates i could do to over come this?
Thanks

Related

What is the correct format to use in PHP when denoting "money" attribute fetched from MySQL database?

I'm using the "number_format" function to denote money" attribute in PHP/MySQL.
The attribute itself is stored in my database.
Select account_balance from my_table where login = 'xxxxxx';
$correct_account_balance = number_format($account_balance,
['my_balance'],2); }
In other words : the denotation "2" will add two extra numbers after the decimal point, as follows : 10.00 (for example)
This code works fine............except for one small problem : if the amount after the decimal point has a zero at the end, it does not display!
For example : if the amount is, say, 10 dollars and 35 cents, it displays correctly : 10.35
However, if the amount is 10 dollars, and 30 cents, it displays as : 10.3 (instead of : 10.30 )
The reason is : my program also performs arithmetical operations on the account_balance AFTER I have converted it using the "number_format" function.
For example :
$correct_account_balance -= 0.25 (this will subtract 0.25 each time the program is executed)
This is why, anytime there is a "zero" at the end of the actual amount (like : 10.30), it displays as : 10.3
Is there anyway to get around this? Google doesn't seem to know;
The reason is : my program also performs arithmetical operations on the account_balance AFTER I have converted it using the "number_format" function.
You'll need to re-run number_format after doing the operations on it.
You really shouldn't run it at all until it's ready for display, either, the commas it adds to larger numbers will hugely mess up your calculations. As an example, the following:
<?php
$number = 100000.30;
$number = number_format($number, 2);
$number -= 0.25;
echo number_format($number, 2);
results in the output:
99.75
Which means you've just stolen $99,900.55 from your customers with a type conversion error.

Get and update column of all rows in a table

I'm developing a game and I want a script to run every minute (with the help of cron job, I guess).
So the script that I want to run takes the maximum HP of a character (from a column in a database table), calculates to get 10% of that value and then add those 10% to the characters current hp (which is another column in the table). I then want to iterate this over all rows in the database table.
E.g. consider the following table:
charname current_hp max_hp
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
player1 20 30
player2 15 64
player3 38 38
After the script has been run, I want the table to look like this:
charname current_hp max_hp
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
player1 23 30
player2 21 64
player3 38 38
So, I know how I technically could implement this, e.g.
$characters = $db->getCharacterList();
foreach ($characters as $character) {
$maxHp = $character['max_hp'];
$curHp = $character['current_hp'];
$hpToAdd = $maxHp / 10;
if (($curHp + $hpToAdd) >= $maxHp) {
$db->update('characters', $character['id'], array(
'current_hp' => $maxHp
));
} else {
$db->update('characters', $character['id'], array(
'current_hp' => ($curHp + $hpToAdd)
));
}
}
My only question is: Is the solution posted above an efficient way to implement this? Will this work on a table with, say, 10 000 rows, or will it take too long?
Thanks in advance.
Are you using a SQL database? If you have 10,000 rows, your solution will hit the database 10,000 times. A SQL statement like this will only make one database hit:
UPDATE characters
SET current_hp = LEAST(max_hp, current_hp + (max_hp / 10))
The database will still have to do the I/O necessary to update all 10,000 rows, but it will happen more efficiently than thousands of individual queries.
In One SQL statement will do all your requirements for more performance remove any indexes on this table to keep it fast
UPDATE test
SET current_hp= CASE
WHEN ((max_hp/10)+current_hp) >=max_hp THEN max_hp
ELSE ((max_hp/10)+current_hp)
END

mongodb aggregation subquery : Mongodb php adapter

I have below collection:
**S_SERVER** – **S_PORT** – **D_PORT** – **D_SERVER** – **MBYTES**
L0T410R84LDYL – 2481 – 139 – MRMCRUNCHAPP – 10
MRMCASTLE – 1904 – 445 – MRMCRUNCHAPP – 25
MRMXPSCRUNCH01 – 54769 – 445 – MRMCRUNCHAPP - 2
MRMCASTLE – 2254 – 139 – MRMCRUNCHAPP - 4
MRMCASTLE – 2253 – 445 – MRMCRUNCHAPP -35
MRMCASTLE – 987 – 445 – MRMCRUNCHAPP – 100
MRMCASTLE – 2447 – 445 – MRMCRUNCHAPP – 12
L0T410R84LDYL – 2481 – 139 – MRMCRUNCHAPP - 90
MRMCRUNCHAPP – 61191 – 1640 – OEMGCPDB – 10
Firstly, I need the top 30 S_SERVER as per total MBYTES transferred from each S_SERVER. This is I am able to get with following query :
$sourcePipeline = array(
array(
'$group' => array(
'_id' => array('sourceServer' => '$S_SERVER'),
'MBYTES' => array('$sum' => '$MBYTES')
),
),
array(
'$sort' => array("MBYTES" => -1),
),
array(
'$limit' => 30
)
);
$sourceServers = $collection->aggregate($sourcePipeline);
I also need Top 30 D_PORT as per total MBYTES transferred from each D_PORT for individual S_SERVER. I am doing this by running for loop from above servers results and getting them individually one by one for each S_SERVER.
$targetPortPipeline = array(
array(
'$project' => array('S_SERVER' => '$S_SERVER', 'D_PORT' => '$D_PORT', 'MBYTES' => '$MBYTES')
),
array(
'$match' => array('S_SERVER' => S_SERVER(find from above query, passed one by one in for loop)),
),
array(
'$group' => array(
'_id' => array('D_PORT' => '$D_PORT'),
'MBYTES' => array('$sum' => '$MBYTES')
),
),
array(
'$sort' => array("MBYTES" => -1),
),
array(
'$limit' => $limit
)
);
$targetPorts = $collection->aggregate($targetPortPipeline);
But this process is taking too much time. I need an efficient way to get required results in same query. I know i am using Mongodb php adapter to accomplish this. You can let me know the aggregation function in javascript format also. I will convert it into php.
You problem here essentially seems to be that you are issuing 30 more queries for your initial 30 results. There is no simple solution to this, and a single query seems right out at the moment, but there are a few things you can consider.
As an additional note, you are not alone in this as this is a question I have seen before, which we can refer to as a "top N results problem". Essentially what you really want is some way to combine the two result sets so that each grouping boundary (source server) itself only has a maximum N results, while at that top level you are also restricting those results again to the top N result values.
Your first aggregation query you the results for the top 30 "source servers" you want, and that is just fine. But rather than looping additional queries from this you could try creating an array with just the "source server" values from this result and passing that to your second query using the $in operator instead:
db.collection.aggregate([
// Match should be first
{ "$match": { "S_SERVER": { "$in": sourceServers } } },
// Group both keys
{ "$group": {
"_id": {
"S_SERVER": "$S_SERVER",
"D_SERVER": "$D_SERVER"
},
"value": { "$sum": "$MBYTES" }
}},
// Sort in order of key and largest "MBYTES"
{ "$sort": { "S_SERVER": 1, "value": -1 } }
])
Noting that you cannot "limit" here as this contains every "source server" from the initial match. You can also not "limit" on the grouping boundary, which is essentially what is missing from the aggregation framework to make this a two query result otherwise.
As this contains every "dest server" result and possibly beyond the "top 30" you would be processing the result in code and skipping returned results after the "top 30" are retrieved at each grouping (source server) level. Depending on how many results you have, this may or may not be the most practical solution.
Moving on where this is not so practical, you are then pretty much stuck with getting that output into another collection as interim step. If you have a MongoDB 2.6 version or above, this can be as simple as adding an $out pipeline stage at the end of the statement. For earlier versions, you can do the equivalent statement using mapReduce:
db.collection.mapReduce(
function() {
emit(
{
"S_SERVER": this["S_SERVER"],
"D_SERVER": this["D_SERVER"]
},
this.MBYTES
);
},
function(key,values) {
return Array.sum( values );
},
{
"query": { "S_SERVER": { "$in": sourceServers } },
"out": { "replace": "output" }
}
)
That is essentially the same process as the previous aggregation statement, while also noting that mapReduce does not sort the output. That is what is covered by an additional mapReduce operation on the resulting collection:
db.output.mapReduce(
function() {
if ( cServer != this._id["S_SERVER"] ) {
cServer = this._id["S_SERVER"];
counter = 0;
}
if ( counter < 30 )
emit( this._id, this.value );
counter++;
},
function(){}, // reducer is not actually called
{
"sort": { "_id.S_SERVER": 1, "value": -1 },
"scope": { "cServer": "", "counter": 0 }
}
)
The implementation here is the "server side" version of the "cursor skipping" that was mentioned earlier. So you are still processing every result, but the returned results over the wire are limited to the top 30 results under each "source server".
As stated, still pretty horrible in that this must scan "programmatically" through the results to discard the ones you do not want, and depending again on your volume, you might be better of simply issuing a a .find() for each "source server" value in these results while sorting and limiting the results
sourceServers.forEach(function(source) {
var cur = db.output.find({ "_id.S_SERVER": source })
.sort({ "value": -1 }).limit(30);
// do something with those results
);
And that is still 30 additional queries but at least you are not "aggregating" every time as that work is already done.
As a final note that is actually too much detail to go into in full, you could possibly approach this using a modified form of the initial aggregation query that was shown. This comes more as a footnote for if you have read this far without the other approaches seeming reasonable then this is likely the worst fit due to the memory constraints this is likely to hit.
The best way to introduce this is with the "ideal" case for a "top N results" aggregation, which of course does not actually exist but ideally the end of the pipeline would look something like this:
{ "$group": {
"_id": "$S_SERVER",
"results": {
"$push": {
"D_SERVER": "$_id.D_SERVER",
"MBYTES": "$value"
},
"$limit": 30
}
}}
So the "non-existing" factor here is the ability to "limit" the number of results that were "pushed" into the resulting array for each "source server" value. The world would certainly be a better place if this or similar functionality were implemented as it makes solving problems such as this quite easy.
Since it does not exist, you are left with resorting to other methods to get the same result and end up with listings like this example, except actually a lot more complex in this case.
Considering that code you would be doing something along the lines of:
Group all the results back into an array per server
Group all of that back into a single document
Unwind the first server results
Get the first entry and group back again.
Unwind the results again
Project and match the found entry
Discard the matching result
Rinse and repeat steps 4 - 7 30 times
Store back the first server document of 30 results
Rinse and repeat for 3 - 9 for each server, so 30 times
You would never actually code this directly, and would have to code generate the pipeline stages. It is very likely to blow up the 16MB limit, probably not on the pipeline document itself but very likely to do so on the actual result set as you are pushing everything into arrays.
You might also note how easy this would be to blow up completely if your actual results did not contain at least 30 top values for each server.
The whole scenario comes down to a trade-off on which method suits your data and performance considerations the best:
Live with 30 aggregation queries from your initial result.
Reduce to 2 queries and discard unwanted results in client code.
Output to a temporary collection and use a server cursor skip to discard results.
Issue the 30 queries from a pre-aggregated collection.
Actually go to the trouble of implementing a pipeline that produces the results.
In any case, due to the complexities of aggregating this in general, I would go for producing your result set periodically and storing it in it's own collection rather that trying to do this in real time.
The data would not be the "latest" result and only as fresh as how often you update it. But actually retrieving those results for display becomes a simple query, returning at maximum 900 fairly compact results without the overhead of aggregating for every request.

MongoDB count very slow

I have a collection with 1.5 million documents. I'm counting using PHP:
$db->some->ensureIndex(array("sometext" => 1));
$db->some->ensureIndex(array("datsbla" => 1));
$arr["sometext"] = $string;
$arr["datsbla"] = array('$gte' => $some, '$lte' => $thing);
$count = $db->some->count($arr);
I turned on the profiler and every count like that is like 4500 ms. I have 20 counters like that in my page, so it makes my web page VERY VERY SLOW.
What should I do to make it faster (< 100 ms) ? Is it even possible using MongoDB?
Thanks.
You have two separate individual indexes - a query can only use 1 index at a time, so you are not taking advantage of the indexing fully. Try a compound index on both fields and you should see a significant improvement.

How do I find all peaks and troughs of tidal data?

I'm working with some ocean tide data that's structured like this:
$data = array('date' => array('time' => array('predicted','observed')));
Here's a sample of real data that I'm using: http://pastebin.com/raw.php?i=bRc2rmpG
And this is my attempt at finding the high/low values: http://pastebin.com/8PS1frc0
Current issues with my code:
When the readings fluctuate (as seen in the 11/14/2010=>11:30:00 to 11/14/2010=>11:54:00 span in the sample data), it creates a "wobble" in the direction logic. This creates an erroneous Peak and Trough. How can I avoid/correct this?
Note: My method is very "ad-hoc".. I assumed I wouldn't need any awesome math stuff since I'm not trying to find any averages, approximations, or future estimations. I'd really appreciate a code example of a better method, even if it means throwing away the code I've written so far.
I've had to perform similar tasks on a noisy physiological data. In my opinion, you have a signal conditioning problem. Here is a process that worked for me.
Convert your time values to seconds, i.e. (HH*3600)+(MM*60)+(SS), to generate a numeric "X" value.
Smooth the resulting X and Y arrays with a sliding window, say 10 points in width. You might also consider filtering data with redundant and/or bogus timestamps in this step.
Perform an indication phase detection by comparing the smoothed Y[1] and Y[0]. Similar to the post above, if (Y[1] > Y[0]), you may assume the data are climbing to a peak. If (Y[1] < Y[0]), you may assume the data are descending to a trough.
Once you know the initial phase, peak and trough detection may be performed as described above: if Y[i] > Y[i+1] and Y[i] < Y[i-1], you have encountered a peak.
You can estimate the peak/trough time by projecting the smoothed X value back to the original X data by considering the sliding window size (in order to compensate for "signal lag induced" by the sliding window). The resulting time value (in seconds) can then be converted back to an HH:MM:SS format for reporting.
You're looking for local minima and maxima, I presume? That's really easy to do:
<?php
$data = array(1, 9, 4, 5, 6, 9, 9, 1);
function minima($data, $radius = 2)
{
$minima = array();
for ($i = 0; $i < count($data); $i += $radius)
{
$minima[] = min(array_slice($data, $i, $radius));
}
return $minima;
}
function maxima($data, $radius = 2)
{
$maxima = array();
for ($i = 0; $i < count($data); $i += $radius)
{
$maxima[] = max(array_slice($data, $i, $radius));
}
return $maxima;
}
print_r(minima($data));
print_r(maxima($data));
?>
You just have to specify a radius of search, and it will give you back an array of local minima and maxima of the data. It works in a simple way: it cuts the array into segments of length $radius and finds the minimum of that segment. This process is repeated for the whole set of data.
Be careful with the radius: usually, you want to select the radius to be the average distance from peak to trough of the data, but you will have to find that manually. It is defaulted to 2, and that will only search for minima/maxima within a radius of 2, which will probably give false positives with your set of data. Select the radius wisely.
You'll have to hack it into your script, but that shouldn't be too hard at all.
I haven't read it in detail, but your approach seems very ad-hoc. A more correct way would probably be to fit it to a function
f(A,B,w,p;t)=Asin(wt+p)+B
using a method such as non-linear least squares (which unfortunately has to be solved using an iterative method). Looking at your sample data, it seems like it would be a good fit. When you have calculated w and p, it's easy to locate the peaks and valleys by just taking the time derivative of the function and solving for zero:
t = (pi(1+2n)-2p)/w
But I suppose, that if your code really does what you want, there's no use to complicate things. Stop second-guessing yourself. :)
A problem is I think that the observations are observations and can contain small errors. That at least needs to be accounted for. For example:
Only change direction if at least the next 2 entries are also in the same direction.
Don't let decisions be made by data on a too small difference. Throw away insignificant numbers. It will be a lot better probably when you say $error = 0.10; and change your conditions to if $previous - $error > $current etcetera.
How accurate does the peak/valley detection have to be? If you just need to find the exact record where a peak or valley occurs, isn't it enough to check for inflection points?
e.g. considering a record at position 'i', if record[i-1] and record[i+1] are both "higher" than record[i], you've got a valley. and if record[i-1] and record[i+1] are both lower than record[i], you've got a peak. As long as your sampling rate is faster than the tide changes (look up Nyquist frequency), that process should get you your data's peaks/troughs.
If you need to generate a graph from this and try to extrapolate more accurate time points for the peaks/troughs, then you're in for more work.
One way may be to define an absolute or relative deviation past which you classify further peaks/troughs as new ones rather than fluctuations around an existing peak/trough.
Currently, $direction determines whether you are finding a peak or trough, so instead of transiting to the other state (finding the trough or peak) once the derivative changes in sign, you can consider changing the state only when the deviation from the current peak/trough is "large" enough.
Given that you should never see two max or 2 min in less than about 12 hours, a simple solution would be to use a sliding windows of 3-5 hr or so and find the max and min. If it ends up being the in the first or last 30 min, ignore it.
As an example, given the following data:
1 2 3 4 5 6 5 6 7 8 7 6 5 4 3 2 1 2
and a window of size 8, with the first and last 2 ignored and only looking a peeks you would see:
1 2 | 3 4 5 6 | 5 6, max = 6, ignore = Y
2 3 | 4 5 6 5 | 6 7, max = 7, ignore = Y
3 4 | 5 6 5 6 | 7 8, max = 8, ignore = Y
4 5 | 6 5 6 7 | 8 7, max = 8, ignore = Y
5 6 | 5 6 7 8 | 7 6, max = 8, ignore = N
6 5 | 6 7 8 7 | 6 5, max = 8, ignore = N
5 6 | 7 8 7 6 | 5 4, max = 8, ignore = N
6 7 | 8 7 6 5 | 4 3, max = 8, ignore = N
7 8 | 7 6 5 4 | 3 2, max = 8, ignore = Y
8 7 | 6 5 4 3 | 2 1, max = 8, ignore = Y
7 6 | 5 4 3 2 | 1 2, max = 7, ignore = Y

Categories