I have a MongoDB collection with ~4M elements.
I want to grab X number of those elements, evenly spaced through the entire collection.
E.g., Get 1000 elements from the collection - one every 4000 rows.
Right now, I am getting the whole collection in a cursor and then only writing every Nth element. This gives me what I need but the original load of the huge collection takes a long time.
Is there an easy way to do this? Right now my guessed approach is to do a JS query on an incremented index property, with a modulus. A PHP implementation of this:
db.collection.find({i:{$mod:[10000,0]}})
But this seems like it will probably take just as much time for the query to run.
Jer
Use $sample.
This returns a random sample that is roughly "every Nth document".
To receive exactly every Nth document in a result set, you would have to provide a sort order and iterate the entire result set discarding all unwanted documents in your application.
I think the main problem, is that collection can be distributed over servers and thus you have to iterate over entire collection.
Do not put the whole dataset in a cursor. Since row order is not important, just collect x random rows out of your total, return that as a result and then modify those records
Personally I would design in a "modulus" value, populate it with something that is a function representative of the data - so if your data was inserted a regular intervals throughout the day you could do a modulus of the time, if there's nothing predictable then you could use a random value; with a collection of that size it would tend toward even distribution pretty quickly.
An example using a random value...
// add the index
db.example.ensureIndex({modulus: 1});
// insert a load of data
db.example.insert({ your: 'data', modulus: Math.round((Math.random() * 1000) % 1000) });
// Get a 1/1000 of the set
db.example.find({modulus: 1});
// Get 1/3 of the set
db.example.find({modulus: { $gt: 0, $lt: 333 }});
A simple (inefficient) way to do this is with a stream.
var stream = collection.find({}).stream();
var counter = 0;
stream.on("data", function (document) {
counter++;
if (counter % 10000 == 0) {
console.log(JSON.stringify(document, null, 2));
//do something every 10,000th time
}
});
If only your data was in a sql database, as it should be, ... this question wouldn't be in PHP and the answer would be so easy and quick ...
Loading anything into a cursor instead of calculating the information directly in the db is definitely a bad idea, is it not possible to do this directly in the MongoDB thingy ?
Related
So we have an existing system, which we are trying to scale up and running out of memory retrieving close to 3M records.
I was trying to determine how viable is increasing server memory as a stop gap solution, by ascertaining data size returned by the query, by doing something like:
select sum(row_size)
from (
SELECT
ifnull(LENGTH(qr.id), 0)+
ifnull(LENGTH(qr.question_id), 0)+
ifnull(LENGTH(qr.form_response_id), 0)+
ifnull(LENGTH(qr.`value`), 0)+
ifnull(LENGTH(qr.deleted_at), 0)+
ifnull(LENGTH(qr.created_at), 0)+
ifnull(LENGTH(qr.updated_at), 0)
as row_size
FROM
....
LIMIT 500000
) as tbl1;
Which returns 30512865 which is roughly 30MB of data.
However when I cross check what PHP actually uses to store the results using:
$memBefore = memory_get_usage();
$formResponses = DB::select($responsesSQL, $questionIDsForSQL);
$memAfter = memory_get_usage();
dd($memBefore, $memAfter);
I am getting 315377552 and 22403248 which means 292974304 bytes or roughly 300MB of memory usage to store simple array!
I would like to understand why the memory footprint is 10 times the data retrieved, and is there anything I could do to reduce that footprint, short of modifying the API response from back end, and front end to not need the entire result set which will take time.
For context, current implementation uses the above results (returned by getQuestionResponses)to transform them into associative array grouped by question_id using Laravel Collections:
collect($this->questionResponseRepo->getQuestionResponses($questions))->groupBy('question_id')->toArray();
I am thinking to replace the collect with own implementation more memory efficient which will use the array returned from the query to reduce memory inflation by converting that array into Laravel's Collection, but thats still not helping with the array itself taking 300MB for 500k records responses instead of 30MB.
One of the solutions online is to use SplFixedArray but I am not sure how to force DB::select to use that instead of array?
Another possible solution involves ensuring it returns simple assoc array instead of array of standard classes https://stackoverflow.com/a/37532052/373091
But when I try that as in:
// get original model
$fetchMode = DB::getFetchMode();
// set mode to custom
DB::setFetchMode(\PDO::FETCH_ASSOC);
$memBefore = memory_get_usage();
$formResponses = DB::select($responsesSQL, $questionIDsForSQL);
DB::setFetchMode($fetchMode);
$memAfter = memory_get_usage();
dd($memBefore, $memAfter, $formResponses);
, I get error Call to undefined method Illuminate\\Database\\MySqlConnection::getFetchMode() which means apparently it can no longer be done from Laravel> 5.4 :(
Any suggestions?
I think the real problem is that you're loading all 3 million records into memory at once. You should instead either process them in chunks or use a cursor.
Chunking
To chunk records into batches, you can use the Laravel's chunk method. This method accepts two parameters, the chunk size and a callback that gets passed the subset of models or objects for processing. This will execute on query per chunk.
Here's the example taken from the documentation:
Flight::chunk(200, function ($flights) {
foreach ($flights as $flight) {
//
}
});
Cursor
Alternatively, you can also use the cursor method if you only want to execute a single query. In this case, Laravel will only hydrate one model at a time so you never have more than one model (or object if you're not using Eloquent) in memory at a time.
foreach (Flight::where('destination', 'Zurich')->cursor() as $flight) {
//
}
I need to display all the elements/objects of a JSON file. I can currently only call an endpoint that takes an offset (starting index) and limit. The maximum number of elements you could get is 100 (the limit) at one call. I was wondering how could I get all of the elements of a JSON file and store them in an array without knowing how many elements there are in the JSON file.
Initially I tried to save the first 500 elements in an array. The problem with that was that the output size of that array was 5 and not 500 because the getElements endpoint returns a list of 100 elements, so what the array actually stored was 100 elements at each index. So for example json_array[0] contains the first 100 elements, json_array[2] contains the next 100 elements etc.
$offset = 0;
$limit = 100;
$json_array = array();
while($offset < 500)
{
array_push($json_array,getElements($token,"api/Elements?offset=".$offset."&limit=".$limit));
$offset+=100;
}
echo count($json_array)
I am expecting to find a way to loop through the entire json file without knowing the number of elements that the file has. My final expectation is to find a way to display the number of all of these elements. Thank you!
I work with a similar API - there is a per_page option and a page option.
Fortunately, the API I'm hitting is set up to return everything with no error if there are less results than the per_page value, so what I do is simply loop while I fetched 100 records.
Something like:
$fetched=100;
$page=0;
$per_page=100;
$total_result=array();
while($fetched==100){
$res=json_decode(file_get_contents($API_URL."?offset=".($page*$per_page));
$fetched=count($res);
// add res to big result set
for($i=0;$i<$fetched;$i++){
$total_result[]=$res[$i];
}
$page++;
}
I'm struggling with my lack of operational knowledge with handling or arrays and variables within a private function that is within a class in an applciation I've "inherited" control of.
The function loops for X days, within that loop, a certain number of MySQL rows will be returned, from a range of different queries. Each row from each query issues a +1 to a counter.
Some maths is then performed on the value of the counter (not really relevant)
Once X days have been iterated through, I need to discard all days calcualted coutner value EXCEPT the value that was the highest. I had planned on using max($array); for this.
I will greatly strip down the code for the purpose of this example:
class Calendar {
public $heights;
private fucntion dayLoop($cellNumber) {
$heights = []; //array
$block_count = 0; //counter
while(mysqlrowdata) {
[code for mysql operations]
$block_count++; //increment the count
}
$day_height = ($block_count * 16) + 18; //do some math specific to my application
$this->heights[] = $day_height; //commit calc'd value to array
//array_push($heights, $day_height); //this was a previosu attempt, i dont think i should use array_push here..??
}
}
This function is called on other "front end" pages, and if I perform a var_dump($heights); at the bottom of one of those pages, the array returns empty with
Array ( )
Essentially, at that front end page, I need to be able to A) inspect the array with vlaues from each looped day from X days, and B) pull the largest calc'd counter value from any of the X days that were iterated through in the loop.
I have tried a myriad of things, like changing the function to be public, defining the start of the array on the front end pages instead of anywhere in the class or fucntion. I declared array name as a variabel in the class as I read that I needed to do that.
Overall, I just don't really understand how I'm MEANT to be handling this, or if I'm going about it in completely the wrong way? Solutions very welcome, but advice and words of wisdom also appreciated greatly. Thanks.
I've made a script that pretty much loads a huge array of objects from a mysql database, and then loads a huge (but smaller) list of objects from the same mysql database.
I want to iterate over each list to check for irregular behaviour, using PHP. BUT everytime I run the script it takes forever to execute (so far I haven't seen it complete). Is there any optimizations I can make so it doesn't take this long to execute...? There's roughly 64150 entries in the first list, and about 1748 entries in the second list.
This is what the code generally looks like in pseudo code.
// an array of size 64000 containing objects in the form of {"id": 1, "unique_id": "kqiweyu21a)_"}
$items_list = [];
// an array of size 5000 containing objects in the form of {"inventory: "a long string that might have the unique_id", "name": "SomeName", id": 1};
$user_list = [];
Up until this point the results are instant... But when I do this it takes forever to execute, seems like it never ends...
foreach($items_list as $item)
{
foreach($user_list as $user)
{
if(strpos($user["inventory"], $item["unique_id"]) !== false)
{
echo("Found a version of the item");
}
}
}
Note that the echo should rarely happen.... The issue isn't with MySQL as the $items_list and $user_list array populate almost instantly.. It only starts to take forever when I try to iterate over the lists...
With 130M iterations, adding a break will help somehow despite it rarely happens...
foreach($items_list as $item)
{
foreach($user_list as $user)
{
if(strpos($user["inventory"], $item["unique_id"])){
echo("Found a version of the item");
break;
}
}
}
alternate solutions 1 with PHP 5.6: You could also use PTHREADS and split your big array in chunks to pool them into threads... with break, this will certainly improve it.
alternate solutions 2: use PHP7, the performances improvements regarding arrays manipulations and loop is BIG.
Also try to sort you arrays before the loop. depends on what you are looking at but very oftenly, sorting arrays before will limit a much as possible the loop time if the condition is found.
Your example is almost impossible to reproduce. You need to provide an example that can be replicated ie the two loops as given if only accessing an array will complete extremely quickly ie 1 - 2 seconds. This means that either the string your searching is kilobytes or larger (not provided in question) or something else is happening ie a database access or something like that while the loops are running.
You can let SQL do the searching for you. Since you don't share the columns you need I'll only pull the ones I see.
SELECT i.unique_id, u.inventory
FROM items i, users u
WHERE LOCATE(i.unique_id, u inventory)
I need help to find workaround for getting over memory_limit. My limit is 128MB, from database I'm getting something about 80k rows, script stops at 66k. Thanks for help.
Code:
$posibilities = [];
foreach ($result as $item) {
$domainWord = str_replace("." . $item->tld, "", $item->address);
for ($i = 0; $i + 2 < strlen($domainWord); $i++) {
$tri = $domainWord[$i] . $domainWord[$i + 1] . $domainWord[$i + 2];
if (array_key_exists($tri, $possibilities)) {
$possibilities[$tri] += 1;
} else {
$possibilities[$tri] = 1;
}
}
}
Your bottleneck, given your algorithm, is most possibly not the database query, but the $possibilities array you're building.
If I read your code correctly, you get a list of domain names from the database. From each of the domain names you strip off the top-level-domain at the end first.
Then you walk character-by-character from left to right of the resulting string and collect triplets of the characters from that string, like this:
example.com => ['exa', 'xam', 'amp', 'mpl', 'ple']
You store those triplets in the keys of the array, which is nice idea, and you also count them, which doesn't have any effect on the memory consumption. However, my guess is that the sheer number of possible triplets, which is for 26 letters and 10 digits is 36^3 = 46656 possibilities each taking 3 bytes just for key inside array, don't know how many boilerplate code around it, take quite a lot from your memory limit.
Probably someone will tell you how PHP uses memory with its database cursors, I don't know it, but you can do one trick to profile your memory consumption.
Put the calls to memory-get-usage:
before and after each iteration, so you'll know how many memory was wasted on each cursor advancement,
before and after each addition to $possibilities.
And just print them right away. So you'll be able to run your code and see in real time what and how seriously uses your memory.
Also, try to unset the $item after each iteration. It may actually help.
Knowledge of specific database access library you are using to obtain $result iterator will help immensely.
Given the tiny (pretty useless) code snippet you've provided I want to provide you with a MySQL answer, but I'm not certain you're using MySQL?
But
- Optimise your table.
Use EXPLAIN to optimise your query. Rewrite your query to put as much of the logic in the query rather than in the PHP code.
edit: if you're using MySQL then prepend EXPLAIN before your SELECT keyword and the result will show you an explanation of actually how the query you give MySQL turns into results.
Do not use PHP strlen function as this is memory inefficient - instead you can compare by treating a string as a set of array values, thus:
for ($i = 0; !empty($domainWord[$i+2]); $i++) {
in your MySQL (if that's what you're using) then add a LIMIT clause that will break the query into 3 or 4 chunks, say of 25k rows per chunk, which will fit comfortably into your maximum operating capacity of 66k rows. Burki had this good idea.
At the end of each chunk clean all the strings and restart, set into a loop
$z = 0;
while ($z < 4){
///do grab of data from database. Preserve only your output
$z++;
}
But probably more important than any of these is provide enough details in your question!!
- What is the data you want to get?
- What are you storing your data in?
- What are the criteria for finding the data?
These answers will help people far more knowledgable than me to show you how to properly optimise your database.