MongoDB PHP count $within method extremely slow for large datasets

MongoDB PHP count $within method extremely slow for large datasets - php

Hi guys I have the method below for counting within polygons in a mongodb:
public function countWithinPolygon($polygon, $tags = array())
{
// var_dump($polygon);
// var_dump($polygon->getPoints());exit();
$query = array(
'point' => array(
'$within' => array(
'$polygon' => $polygon->getPoints()->first()->toArray(true)
)
)
);
if($tags)
{
$query['tags'] = array(
'$all' => $tags
);
}
return parent::count($query);
}
For some queries with small amounts of data it is just okay. On larger datasets containing 4000+ calls the execution time is truely pathetic and can take hours. On average it takes three hours to execute. Any ideas or hints on a better way to write this to save time and optimize this query?

The issue was fixed by ensuring an index like so : db.polygon.ensureIndex({'GeoJSON.geometry':'2dsphere'});

Related

mongodb deferred update while running the script

I do not understand what is going on with my migration script. So a have a collection with 40+m records in it, and historically that collection did not have a strict model, so I'm working on adding default values for some optional fields, for example, if the document does not have deleted_at I'll add it with the null value.
Basically, I'm taking documents in batches by 300, checking if a document should be updated and if so updating it. All was fine, I was able to update 12M documents in 9 hours. But after that, something weird started to happen, first of all, it started to work much much slower, like 100k documents in an hour which is ~10x slower than was before. Also from the logs, I can see that script updating documents pretty fast (I have a bunch of log entries related to updated documents every second), but if I run the count query to get the number of modified documents, the amount is not increasing so often. For example, depending on logs in 10 seconds 400 rows were updated, but the number of modified documents did not increase when the count query runs. The number of the modified documents simply increases once per some period of time, for example, the number can be the same for 2-3 minutes, and then at some point, it increases on 4k rows.
So I do not understand why at some point mongo starts running updates with some delay, scheduling them or something, and why it starts to work slower?
The script is pretty big, but I'll try to share the simplified version, so you can see how I'm looping through documents:
class Migration {
private Connection $connection;
public function __construct(Connection $collection)
{
$this->connection = $collection;
}
public function migrate(): void
{
$totalAmount = $this->connection->collection('collection')->count();
$chunkSize = 300;
$lastIdInBatch = null;
for ($i = 0; $i < $totalAmount; $i += $chunkSize) {
$aggregation = [];
$aggregation[] = [
'$sort' => ['_id' => 1],
];
if ($lastIdInBatch !== null) {
$aggregation[] = [
'$match' => [
'_id' => [
'$gt' => new ObjectId($lastIdInBatch),
],
],
];
}
$aggregation[] = [
'$limit' => $chunkSize,
];
$documents = $this->connection->collection('collection')->raw()->aggregate(
$aggregation
);
$lastIdInBatch = $documents[array_key_last($documents)]['_id'];
foreach ($documents as $document) {
// checks to see if we need to update the document
// ....
if (!empty($changes)) {
$updated = $this->connection
->collection('collection')
->where('_id', document['_id'])
->update($changes);
if ($updated) {
Log::info('row udpated', ['product_id' => document['_id']]) // I see multiple of this logs each seconds, but no changes in database
}
}
}
}
}
}

Issue self-healed after restart of kubernetes pod, so it seems like wasn't the issue with mongo

How to optimize large data inserts using laravel eloquent

Good day, basically I want to insert some related data all at once using eloquent. My current code is:
$allStudies = Study::chunk(50, function ($studies) use ($request, $questionData, $answerData) {
foreach ($studies as $study) {
$evaluationInsert = Evaluation::create([
'study_id' => $study->id,
'questionnaire' => $request->questionnaire,
'description' => $request->description
]);
$evaluationQuestions = $evaluationInsert
->question()
->createMany($questionData);
foreach ($evaluationQuestions as $question) {
$question->answer()->createMany($answerData);
}
}
});
The result of $allStudies is a collection of Study model that currently have around 150-ish data. $questionData is just a static array of arrays that consist of 38 elements and $answerData is an array of arrays that have 4 elements which consist the answer options of each questions. However, the code does work but it takes too long time to execute because of big loops and increasing the timeout in php seems not an ideal way to solve this. What is the elegant way to solve this kind of case?

PHP - Function with foreach slowing down site

I'm working on a project that connects to an external API. I have already made the connection and I've implemented several functions to retrieve the data, that's all working fine.
The following function however, works exactly like it should, only it slows down my website significantly ( 25 seconds + ).
Is this because of the nested foreach loop? And what can i do to refactor the code?
/**
* #param $acti
*/
function getBijeenkomstenFromAct ($acti) {
$acties = array();
foreach ($acti as $act) {
$bijeenkomsten = $this->getBijeenkomstenFromID($act['id']);
if (in_array('Doorlopende activiteit', $act['type'])) {
foreach ($bijeenkomsten as $bijeenkomst) {
$acties[] = array(
'id' => $act['id'],
'Naam' => $act['titel'],
'interval' => $act['interval'],
'activiteit' => $bijeenkomst['activiteit'],
'datum' => $bijeenkomst['datum']
);
}
} else {
$acties[] = array (
'id' => $act['id'],
'type' => $act['type'],
'activiteit' => $act['titel'],
'interval' => $act['interval'],
'dag' => $act['dag'],
'starttijd' => $act['starttijd'],
'eindtijd' => $act['eindtijd']
);
}
}
return $acties;
}
The function "getBijeenkomstenfromID" is working fine and on it's own not slow at all. Just to be sure, here is the function:
/**
* #param $activitieitID
*
* #return mixed
*
*/
public function getBijeenkomstenFromID($activitieitID) {
$options = array(
//'datumVan' => date('Y-m-d'),
'activiteit' => array (
'activiteit' => $activitieitID
),
'limit' => 5,
'datumVan' => date(("Y-m-d"))
);
$bijeenkomsten = $this->webservice->activiteitBijeenkomstOverzicht($this->api_key, $options);
return $bijeenkomsten;
}

It looks like you're calling on the API from within the first foreach loop, which is not efficient.
Every time you do this:
$bijeenkomsten = $this->getBijeenkomstenFromID($act['id']);
you're adding a lot of "dead" time to your script since you have to put on with network latency, the time you need to allow for the API to actually do the work and transmit it back to you. Even though this may be quick (let's say 100ms total), if your first foreach loop iterates 100 times, you already have accumulated 10 seconds of waiting, and that's before getBijeenkomstenFromAct ($acti) has done any real processing.
The best practice here would be to split this if possible. My suggestion:
Make getBijeenkomstenFromID($activitieitID) run asynchronously on its own for all the IDs you need to lookup in the API. The key here is for it to run as a separate process and then have it pass the array it constructs to getBijeenkomstenFromAct so that it can loop and process it happily.
So yes, basically I'm suggestion that you orchestrate your process backwards for efficiency's sake

Look into curl_multi: http://php.net/manual/en/function.curl-multi-exec.php
It will let you call an external API asynchronously and process the returns all at once. Be aware that APIs often have their own limitations on asynchronous calls, and common sense dictates that you probably shouldn't be hammering a website with 200 separate calls. But if your number of calls is under a dozen or two (and the API allows it), curl_multi will do nicely.

How do I optimise laravel seeds beyond mass insertion for faster seeding/

So I am developing a laravel application and I am trying to get my seeds optimised so that they run faster.
http://bensmith.io/speeding-up-laravel-seeders
This guide helped a ton. According to this, I should minimise the number of SQL queries by doing mass insertions and it cut down the time to 10% of the original seeding time which is awesome.
So now I am doing something like:
$comments = [];
for ($i = 0; $i < 50; $i++) {
$bar->advance();
$comments[] = factory(Comment::class)->make([
'created_at' => Carbon\Carbon::now(),
'updated_at' => Carbon\Carbon::now(),
'comment_type_id' => $comment_types->shuffle()->first()->id,
'user_id' => $users->shuffle()->first()->id,
'commentable_id' => $documents->shuffle()->first()->id,
])->toArray();
}
Comment::insert($comments);
This works like a charm. It gets the queries down to a single one.
But then I have other seeders where I to work with dumps and they are more complex:
$dump = file_get_contents(database_path('seeds/dumps/serverdump.txt'));
DB::table('server')->truncate();
DB::statement($dump);
$taxonomies = DB::table('server')->get();
foreach($taxonomies as $taxonomy){
$bar->advance();
$group = PatentClassGroup::create(['name' => $taxonomy->name]);
preg_match_all('/([a-zA-Z0-9]+)/', $taxonomy->classes, $classes);
foreach(array_pop($classes) as $key => $class){
$type = strlen($class) == 4 ? 'GROUP' : 'MAIN';
$inserted_taxonomies[] = PatentClassTaxonomy::where('name', $class)->get()->count()
? PatentClassTaxonomy::where('name', $class)->get()->first()
: PatentClassTaxonomy::create(['name' => $class, 'type' => $type]);
}
foreach($inserted_taxonomies as $inserted_taxonomy){
try{
$group->taxonomies()->attach($inserted_taxonomy->id);
}catch(\Exception $e){
//
}
}
}
So here when I am attaching taxonomies to groups, I use the native eloquent code so taking the record and mass inserting is difficult.
Yes, I can fiddle around and figure out a way to mass insert that too but my problem is that I have to write and optimise all seeds and every part of those seeds to mass insert.
Is there a way, where I can listen to DB queries laravel is trying to execute while seeding. I know I can do something like this:
DB::listen(function($query) {
//
});
But it would still be executed right. What I would like to do is somehow catch the query in a variable, add it to a stack and then execute the whole stack when the seed is coming to an end. Or in between too since I might need some ids for some seeds. What is a good workaround this? And how to really optimise the seeds in laravel with a smart solution?

How to import Excelsheets in Symfony doctrine entity

I want to import an ExcelSheet to my DB using Symfony/Doctrine (imported ddeboer data-import bundle)
What is best practice to import the data and first check if the data is already imported?
I was thinking of two possibilities:
1)
$numarray = $repo->findAllAccounts();
$import = true;
foreach ($reader as $readerobjectkey => $readervalue) {
foreach ($numarray as $numkey){
if (($numkey->getNum() == $readervalue['number'])){
$import = false;
}
}
if($import){
$doctrineWriter ->disableTruncate()
->prepare()
->writeItem(
array(
'num' => $readervalue['number'],
'name' => $readervalue['name'],
'company' => $companyid
)
)
->finish();
2)
foreach ($reader as $row =>$value ) {
// check if already imported
$check = $this->checkIfExists($repo,'num', $value['number']);
if ($check){
echo $value['number']." Exists <br>";
}else{echo $value['number']." new Imported <br>";
$doctrineWriter ->disableTruncate()
->prepare()
->writeItem(
array(
'num' => $value['number'],
'name' => $value['name'],
'company' => $companyid
)
)
->finish();
public function checkIfExists($repo, $field, $value){
$check = $repo->findOneBy(array($field => $value));
return $check;
Problem is with big exceldatasheets (3000 rows +) with both solutions i get a timeout....
Error: Maximum execution time of 30 seconds exceeded
in general: for performance issues: is it prefered to generate 1000 queries to check if value exists (findOneBy) or to use two foreach loops to compare values?
Any help would be awesome!
Thx in advance...

You can try to check the filemtime of the file : http://php.net/manual/en/function.filemtime.php
I'm not sure if it would work properly but it worth a shot to try it and see if the modified date works as expected.
Otherwise you should think of another way that checking the data like this, it would take lot of resources to do so. maybe adding some metadata to the excel file :
http://docs.typo3.org/typo3cms/extensions/phpexcel_library/1.7.4/manual.html#_Toc237519906
Any other way than looping or querying database for large data is better.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

MongoDB PHP count $within method extremely slow for large datasets - php

The issue was fixed by ensuring an index like so : db.polygon.ensureIndex({'GeoJSON.geometry':'2dsphere'});

Related

mongodb deferred update while running the script

How to optimize large data inserts using laravel eloquent

PHP - Function with foreach slowing down site

How do I optimise laravel seeds beyond mass insertion for faster seeding/

How to import Excelsheets in Symfony doctrine entity

Categories

Resources