mongodb deferred update while running the script - php

I do not understand what is going on with my migration script. So a have a collection with 40+m records in it, and historically that collection did not have a strict model, so I'm working on adding default values for some optional fields, for example, if the document does not have deleted_at I'll add it with the null value.
Basically, I'm taking documents in batches by 300, checking if a document should be updated and if so updating it. All was fine, I was able to update 12M documents in 9 hours. But after that, something weird started to happen, first of all, it started to work much much slower, like 100k documents in an hour which is ~10x slower than was before. Also from the logs, I can see that script updating documents pretty fast (I have a bunch of log entries related to updated documents every second), but if I run the count query to get the number of modified documents, the amount is not increasing so often. For example, depending on logs in 10 seconds 400 rows were updated, but the number of modified documents did not increase when the count query runs. The number of the modified documents simply increases once per some period of time, for example, the number can be the same for 2-3 minutes, and then at some point, it increases on 4k rows.
So I do not understand why at some point mongo starts running updates with some delay, scheduling them or something, and why it starts to work slower?
The script is pretty big, but I'll try to share the simplified version, so you can see how I'm looping through documents:
class Migration {
private Connection $connection;
public function __construct(Connection $collection)
{
$this->connection = $collection;
}
public function migrate(): void
{
$totalAmount = $this->connection->collection('collection')->count();
$chunkSize = 300;
$lastIdInBatch = null;
for ($i = 0; $i < $totalAmount; $i += $chunkSize) {
$aggregation = [];
$aggregation[] = [
'$sort' => ['_id' => 1],
];
if ($lastIdInBatch !== null) {
$aggregation[] = [
'$match' => [
'_id' => [
'$gt' => new ObjectId($lastIdInBatch),
],
],
];
}
$aggregation[] = [
'$limit' => $chunkSize,
];
$documents = $this->connection->collection('collection')->raw()->aggregate(
$aggregation
);
$lastIdInBatch = $documents[array_key_last($documents)]['_id'];
foreach ($documents as $document) {
// checks to see if we need to update the document
// ....
if (!empty($changes)) {
$updated = $this->connection
->collection('collection')
->where('_id', document['_id'])
->update($changes);
if ($updated) {
Log::info('row udpated', ['product_id' => document['_id']]) // I see multiple of this logs each seconds, but no changes in database
}
}
}
}
}
}

Issue self-healed after restart of kubernetes pod, so it seems like wasn't the issue with mongo

Related

Cake PHP prevent retrieving same model rows from database with multiple cron jobs

I'm working inside a Cake PHP 2 web application, I have a database table called jobs where data is stored, I have a Console command which runs on a cron every minute and when it runs it grabs data from my jobs table in a function called getJobsByQueuePriority and then does something.
The issue I'm facing is that I have multiple cron jobs that need to be ran every minute and need to run at the same time, when they run, they're both grabbing the same sets of data from the database table, how can I prevent this and ensure that if a column was already retrieved by one cron, the other cron picks a different row?
I Initially tried adding 'lock' => true to my queries as per the docs, but this isn't achieving the result I need as when logging data to a file both running crons are pulling the same database entry ID's.
I then tried using transactions, I put a begin before the queries and a commit afterwards, maybe this is what I need to use but am using it slightly wrong?
The function which performs the required query with my attempt of transactions is:
/**
* Get queues in order of priority
*/
public function getJobsByQueuePriority($maxWorkers = 0)
{
$jobs = [];
$queues = explode(',', $this->param('queue'));
// how many queues have been set for processing?
$queueCount = count($queues);
$this->QueueManagerJob = ClassRegistry::init('QueueManagerJob');
$this->QueueManagerJob->begin();
// let's first figure out how many jobs are in each of our queues,
// this is so that if a queue has no jobs then we can reassign
// how many jobs can be allocated based on our maximum worker
// count.
foreach ($queues as $queue) {
// count jobs in this queue
$jobCountInQueue = $this->QueueManagerJob->find('count', array(
'conditions' => array(
'QueueManagerJob.reserved_at' => null,
'QueueManagerJob.queue' => $queue
)
));
// if there's no jobs in the queue, subtract a queue
// from our queue count.
if ($jobCountInQueue <= 0) {
$queueCount = $queueCount - 1;
}
}
// just in case we end up on zero.
if ($queueCount <= 0) {
$queueCount = 1;
}
// the amount of jobs we should grab
$limit = round($maxWorkers / $queueCount);
// now let's get all of the jobs in each queue with our
// queue count limit.
foreach ($queues as $queue) {
$job = $this->QueueManagerJob->find('all', array(
'conditions' => array(
'QueueManagerJob.reserved_at' => null,
'QueueManagerJob.queue' => $queue
),
'order' => array(
'QueueManagerJob.available_at' => 'desc'
),
'limit' => $limit
));
// if there's no job for this queue
// skip to the next so that we don't add
// an empty item to our jobs array.
if (!$job) {
continue;
}
// add the job to the list of jobs
array_push($jobs, $job);
}
$this->QueueManagerJob->commit();
// return the jobs
return $jobs[0];
}
What am I missing or is there a small change I need to tweak in my function to prevent multiple crons picking the same entries?

How do I optimise laravel seeds beyond mass insertion for faster seeding/

So I am developing a laravel application and I am trying to get my seeds optimised so that they run faster.
http://bensmith.io/speeding-up-laravel-seeders
This guide helped a ton. According to this, I should minimise the number of SQL queries by doing mass insertions and it cut down the time to 10% of the original seeding time which is awesome.
So now I am doing something like:
$comments = [];
for ($i = 0; $i < 50; $i++) {
$bar->advance();
$comments[] = factory(Comment::class)->make([
'created_at' => Carbon\Carbon::now(),
'updated_at' => Carbon\Carbon::now(),
'comment_type_id' => $comment_types->shuffle()->first()->id,
'user_id' => $users->shuffle()->first()->id,
'commentable_id' => $documents->shuffle()->first()->id,
])->toArray();
}
Comment::insert($comments);
This works like a charm. It gets the queries down to a single one.
But then I have other seeders where I to work with dumps and they are more complex:
$dump = file_get_contents(database_path('seeds/dumps/serverdump.txt'));
DB::table('server')->truncate();
DB::statement($dump);
$taxonomies = DB::table('server')->get();
foreach($taxonomies as $taxonomy){
$bar->advance();
$group = PatentClassGroup::create(['name' => $taxonomy->name]);
preg_match_all('/([a-zA-Z0-9]+)/', $taxonomy->classes, $classes);
foreach(array_pop($classes) as $key => $class){
$type = strlen($class) == 4 ? 'GROUP' : 'MAIN';
$inserted_taxonomies[] = PatentClassTaxonomy::where('name', $class)->get()->count()
? PatentClassTaxonomy::where('name', $class)->get()->first()
: PatentClassTaxonomy::create(['name' => $class, 'type' => $type]);
}
foreach($inserted_taxonomies as $inserted_taxonomy){
try{
$group->taxonomies()->attach($inserted_taxonomy->id);
}catch(\Exception $e){
//
}
}
}
So here when I am attaching taxonomies to groups, I use the native eloquent code so taking the record and mass inserting is difficult.
Yes, I can fiddle around and figure out a way to mass insert that too but my problem is that I have to write and optimise all seeds and every part of those seeds to mass insert.
Is there a way, where I can listen to DB queries laravel is trying to execute while seeding. I know I can do something like this:
DB::listen(function($query) {
//
});
But it would still be executed right. What I would like to do is somehow catch the query in a variable, add it to a stack and then execute the whole stack when the seed is coming to an end. Or in between too since I might need some ids for some seeds. What is a good workaround this? And how to really optimise the seeds in laravel with a smart solution?

Laravel "Trying to get property of non-object" Setting Array with Database Values

I'm Trying to solve this error i'm having with PHP, i'm not completely familiar with the Language, so it would be nice if you would help me out, I can't figure out this error.
I have this Code Here:
public function index() {
$counterino = ClientsJobs::all()->count();
$MasterArray = array();
/* Go Through All of the Records in the Client-Jobs Table and Resolve their columns to Desired Names */
for ($i = 1; $i <= $counterino; $i++ ) {
//Temporary Array for one Name-Resolved-Row of the Table.
$tempArray = array(
'id' => ClientsJobs::find( $i )->id, // id
'client_name' => ClientsJobs::find( $i )->clients->fname , // get the first name ( based on fk )
'job_name' => ClientsJobs::find( $i )->jobs->name, // get the name of the job ( based on fk )
'wage' => ClientsJobs::find( $i )->wage, // wage for the job
'productivity'=> ClientsJobs::find( $i )->producivity // productivity level for the job
);
$MasterArray[] = $tempArray; //add the row
}
return $MasterArray;
}
This code changes the names of the of the Columns in the ClientsJobs Junction Table.
public function up()
{
Schema::create('clients-jobs', function(Blueprint $table)
{
$table->increments('id')->unsigned();
$table->integer('client_id')->unsigned();
$table->foreign('client_id')->references('id')->on('clients');
$table->integer('job_id')->unsigned();
$table->foreign('job_id')->references('id')->on('jobs');
$table->decimal('wage', 4, 2);
$table->decimal('productivity', 5, 2); // 0.00 - 100.00 (PERCENT)
$table->timestamps();
});
}
The Jobs and Clients Table are very simple.
I am having the Error in the index() function I posted above, it says
'Trying to get property of non-object'
Starting on the Line
'client_name' => ClientsJobs::find( $i )->clients->fname,
It's also mad at me for the other parts of setting the array.
I have tested the individual functions I am using to set the array and they all work, fname should also return a string, I used dd() to get the value.
I have tried:
-Using FindorFail
-Setting the Array without the for loop and setting each element manually
-Dumping out multiple parts of the function to make sure it works( counterino, all of the functions for the array, .. )
My guess is that it has to do with the type-deduction of PHP, I actually only need a string array, but would still like to use the name mappings because I am going to be passing this a View I am using for some of my other stuff. The Code was actually working earlier, but I broke it somehow (adding a new record or running a composer update?) anyway, there's some serious voodoo going on.
Thanks in Advance for the help, I am working on this project for a Non-Profit Organization for free.
P.S. I am using Laravel 4.2, and Platform 2.0
First off, this is a horrible practice:
$tempArray = array(
'id' => ClientsJobs::find( $i )->id, // id
'client_name' => ClientsJobs::find( $i )->clients->fname , // get the first name ( based on fk )
'job_name' => ClientsJobs::find( $i )->jobs->name, // get the name of the job ( based on fk )
'wage' => ClientsJobs::find( $i )->wage, // wage for the job
'productivity'=> ClientsJobs::find( $i )->producivity // productivity level for the job
);
By calling ClientJobs::find($i) multiple times, you are doing multiple times the same lookup - either to your DB, or to your cache layer if you have one configured.
Secondly, the answer to your question depends on your ClientJobs model. For your example to work, it needs:
A valid clients relations, defined as follows:
public function clients()
{
return $this->hasOne(...);
}
clients also needs to be a valid 1:1 always existing relation. i.e. there must always be one client. If there isn't, you are susceptible to the error you just got (as the `clients̀ magic would end up being null)
The same applies to jobs.
In every case, it is better to make sure everything is set first. Check using the following:
$clientJob = ClientJobs::find($i);
if (!$clientJob->clients || $clientJob->jobs) throw new \RangeException("No client or job defined for ClientJob $i");
And then catch the exception at whichever level you prefer.
Best approach
public function index() {
$masterArray = array();
ClientsJobs::with('clients', 'jobs')->chunk(200, function($records) use (&$masterArray) {
foreach ($records as $record) {
$masterArray[] = array(
'id' => $record->id, // id
'client_name' => !empty($record->clients) ? $record->clients->fname : null,
'job_name' => !empty($record->jobs) ? $record->jobs->name : null,
'wage' => $record->wage,
'productivity'=> $record->productivity,
);
}
});
return $MasterArray;
}
Your Approach is very wrong
If you want to return an array you can do like this
$counterino = ClientsJobs::all()->toArray();
This will fetch all rows from the table and the toArray will convert the object into an array

MongoDB PHP count $within method extremely slow for large datasets

Hi guys I have the method below for counting within polygons in a mongodb:
public function countWithinPolygon($polygon, $tags = array())
{
// var_dump($polygon);
// var_dump($polygon->getPoints());exit();
$query = array(
'point' => array(
'$within' => array(
'$polygon' => $polygon->getPoints()->first()->toArray(true)
)
)
);
if($tags)
{
$query['tags'] = array(
'$all' => $tags
);
}
return parent::count($query);
}
For some queries with small amounts of data it is just okay. On larger datasets containing 4000+ calls the execution time is truely pathetic and can take hours. On average it takes three hours to execute. Any ideas or hints on a better way to write this to save time and optimize this query?
The issue was fixed by ensuring an index like so : db.polygon.ensureIndex({'GeoJSON.geometry':'2dsphere'});

Map Reduce To Get Most popular tags

I have a problem that I need some help on but I feel I'm close. It involves Lithium and MongoDB Code looks like this:
http://pastium.org/view/0403d3e4f560e3f790b32053c71d0f2b
$db = PopularTags::connection();
$map = new \MongoCode("function() {
if (!this.saved_terms) {
return;
}
for (index in this.saved_terms) {
emit(this.saved_terms[index], 1);
}
}");
$reduce = new \MongoCode("function(previous, current) {
var count = 0;
for (index in current) {
count += current[index];
}
return count;
}");
$metrics = $db->connection->command(array(
'mapreduce' => 'users',
'map' => $map,
'reduce' => $reduce,
'out' => 'terms'
));
$cursor = $db->connection->selectCollection($metrics['result'])->find()->limit(1);
print_r($cursor);
/**
User Data In Mongo
{
"_id" : ObjectId("4e789f954c734cc95b000012"),
"email" : "example#bob.com",
"saved_terms" : [
null,
[
"technology",
" apple",
" iphone"
],
[
"apple",
" water",
" beryy"
]
] }
**/
I am having a user savings terms they search on and then I am try to get the most populars terms
but I keep getting errors like :Uncaught exception 'Exception' with message 'MongoDB::__construct( invalid name '. does anyone have any idea how to do this or some direction?
First off I would not store this in the user object. MongoDb objects have an upper limit of 4/16MB (depending on version). Now this limit is normally not a problem, but when logging inline in one object you might be able to reach it. However a more real problem is that every time you need to act on these objects you need to load them into RAM and it becomes consuming. I dont think you want that on your user objects.
Secondly arrays in objects are not sortable and have other limitations that might come back to bite you later.
But, if you want to have it like this (low volume of searches should not be a problem really) you can solve this most easy by using a group query.
A group query is pretty much like a group query in sql, so its a slight trick as you need to group on something most objects share. (An active field on users maybe).
So, heres a working group example that will sum words used based on your structure.
Just put this method in your model and do MyModel::searchTermUsage() to get a Document object back.
public static function searchTermUsage() {
$reduce = 'function(obj, prev) {
obj.terms.forEach(function(terms) {
terms.forEach(function(term) {
if (!(term in prev)) prev[term] = 0;
prev[term]++;
});
});
}';
return static::all(array(
'initial' => new \stdclass,
'reduce' => $reduce,
'group' => 'common-value-key' // Change this
));
}
There is no protection against non-array types in the terms field (you had a null value in your example). I removed it for simplicity, its better to probably strip this before it ends up in the database.

Categories