Symfony : Doctrine data fixture : how to handle large csv file? - php

I am trying to insert (in a mySQL database) datas from a "large" CSV file (3Mo / 37000 lines / 7 columns) using doctrine data fixtures.
The process is very slow and at this time I could not succeed (may be I had to wait a little bit more).
I suppose that doctrine data fixtures are not intended to manage such amount of datas ? Maybe the solution should be to import directly my csv into database ?
Any idea of how to proceed ?
Here is the code :
<?php
namespace FBN\GuideBundle\DataFixtures\ORM;
use Doctrine\Common\DataFixtures\AbstractFixture;
use Doctrine\Common\DataFixtures\OrderedFixtureInterface;
use Doctrine\Common\Persistence\ObjectManager;
use FBN\GuideBundle\Entity\CoordinatesFRCity as CoordFRCity;
class CoordinatesFRCity extends AbstractFixture implements OrderedFixtureInterface
{
public function load(ObjectManager $manager)
{
$csv = fopen(dirname(__FILE__).'/Resources/Coordinates/CoordinatesFRCity.csv', 'r');
$i = 0;
while (!feof($csv)) {
$line = fgetcsv($csv);
$coordinatesfrcity[$i] = new CoordFRCity();
$coordinatesfrcity[$i]->setAreaPre2016($line[0]);
$coordinatesfrcity[$i]->setAreaPost2016($line[1]);
$coordinatesfrcity[$i]->setDeptNum($line[2]);
$coordinatesfrcity[$i]->setDeptName($line[3]);
$coordinatesfrcity[$i]->setdistrict($line[4]);
$coordinatesfrcity[$i]->setpostCode($line[5]);
$coordinatesfrcity[$i]->setCity($line[6]);
$manager->persist($coordinatesfrcity[$i]);
$this->addReference('coordinatesfrcity-'.$i, $coordinatesfrcity[$i]);
$i = $i + 1;
}
fclose($csv);
$manager->flush();
}
public function getOrder()
{
return 1;
}
}

Two rules to follow when you create big batch imports like this:
Disable SQL Logging: ($manager->getConnection()->getConfiguration()->setSQLLogger(null);) to avoid huge memory loss.
Flush and clear frequently instead of only once at the end. I suggest you add if ($i % 25 == 0) { $manager->flush(); $manager->clear() } inside your loop, to flush every 25 INSERTs.
EDIT: One last thing I forgot: don't keep your entities inside variables when you don't need them anymore. Here, in your loop, you only need the current entity that is being processed, so don't store previous entity in a $coordinatesfrcity array. This might lead you to memory overflow if you keep doing that.

There is a great example in the Docs: http://doctrine-orm.readthedocs.org/projects/doctrine-orm/en/latest/reference/batch-processing.html
Use a modulo (x % y) expression to implement batch processing, this example will insert 20 at a time. You may be able to optimise this depending on your server.
$batchSize = 20;
for ($i = 1; $i <= 10000; ++$i) {
$user = new CmsUser;
$user->setStatus('user');
$user->setUsername('user' . $i);
$user->setName('Mr.Smith-' . $i);
$em->persist($user);
if (($i % $batchSize) === 0) {
$em->flush();
$em->clear(); // Detaches all objects from Doctrine!
}
}
$em->flush(); //Persist objects that did not make up an entire batch
$em->clear();

For fixtures which need lots of memory but don't depend on each other, I get around this problem by using the append flag to insert one entity (or smaller group of entities) at a time:
bin/console doctrine:fixtures:load --fixtures="memory_hungry_fixture.file" --append
Then I write a Bash script which runs that command as many times as I need.
In your case, you could extend the Fixtures command and have a flag which does batches of entities - the first 1000 rows, then the 2nd 1000, etc.

Related

Doctrine bulk insert - How to fix "Out of memory" with bulk insert using Doctrine / Symfony 4

When I am trying to get metadata from a supplier I am converting the data to our own metadata format. But because of the sheer size of the imported data the application gets a OutOfMemoryException.
I tried several things. Like pumping up the memory that may be used and also I tried using Doctrine Batch Processing but there is a small problem with this approach. Doctrine data processing is based on a 'for' loop with indexation.
$batchSize = 20;
for ($i = 1; $i <= 10000; ++$i) {
$user = new CmsUser;
$user->setStatus('user');
$user->setUsername('user' . $i);
$user->setName('Mr.Smith-' . $i);
$em->persist($user);
if (($i % $batchSize) === 0) {
$em->flush();
$em->clear(); // Detaches all objects from Doctrine!
}
}
$em->flush(); //Persist objects that did not make up an entire batch
$em->clear();
But the data I import is a multi-layered array which I created in a threedimensional 'foreach' loop:
$this->index = 0;
$batchSize = 100;
foreach ($response as $item) {
$item = new Item;
$item->setName($item->name);
$item->setStatus($item->status);
$em->persist($item);
if (($this->index % $batchSize) === 0) {
$em->flush();
$em->clear();
}
foreach ($item->category as $category) {
$category = new Category;
$category->setName($category->name);
$category->setStatus($category->status);
$em->persist($item);
if (($this->index % $batchSize) === 0) {
$em->flush();
$em->clear();
}
foreach ($category->suppliers as $supplier) {
$supplier = new Supplier;
$supplier->setName($supplier->name);
$supplier->setStatus($supplier->status);
$em->persist($item);
if (($this->index % $batchSize) === 0) {
$em->flush();
$em->clear();
}
}
}
}
$this->em->flush();
This is fictional code to illustrate my problem. With this the application still gets OutOfMemoryException and I do have the feeling that the batching methode isn't working properly.
I would like to get the memory usage down so the application works properly or would like some advice to try and find a other approach to this problem. Like making a background process that just takes care of the import on the background.
The way you've written your nested foreach loops you will obviously consume resources exponentially. I also suspect it's not going to achieve what you really want since you will have a LOT of duplicate Suppliers and Categorys.
Working with full entities in doctrine also carries a tremendous amount of overhead, but it does have some advantages so I'll assume that you are wanting to do so.
My approach to bulk imports like this has been to work from the bottom up. In your case it might be a variant of what I have below. The assumption is that you have data in an existing database, and each existing "entity" in the old database will have its own unique id.
1- Import all suppliers from old db to new db; in the new db have a column named oldId that references the unique id from the old db. Stop to clear cache/memory.
2- Pull all suppliers from the new database into an array indexed by their oldId. I use code like so:
$suppliers = [];
$_suppliers = $this->em->getRepository(Supplier:class)->findAll();
foreach ($_suppliers as $supplier) {
$suppliers[$supplier->getOldId()] = $supplier;
}
3- Repeat step 1 for categories. During the import your old db will have a reference to the oldId of linked suppliers. Although your code does not do this, I assume you want to maintain the link between supplier and category, so you can now reference the supplier by its oldId within a loop over the linked "old" suppliers:
$category->addSupplier($suppliers[ <<oldSupplier Id>> ]);
4- Repeat above for individual items, only this time saving the linked categories.
Obviously there are a lot of tweaks that can improve on this. The main point is that touching each supplier once, then touching each category once, and then touching each item once when done sequentially will be orders of magnitude faster and less resource intensive than trying to tackle in a deeply nested loop.

Migrating data with Doctrine becomes slow

I need to import data from one table in db A to another table in db B (same server) and I've chosen doctrine to import it.
I'm using a Symfony Commands and is all good for the first loop, just spends 0.04 secs, but then starts to become slower and slower and takes almost half an hour ...
I'm considering to build a shell script to call this Symfony command giving the offset ( I manually tried it and keeps same speed ). This is running in a docker service and the php service is around 100% CPU, however mysql service is 10%
Here part of the script:
class UserCommand extends Command
{
...
protected function execute(InputInterface $input, OutputInterface $output)
{
$container = $this->getApplication()->getKernel()->getContainer();
$this->doctrine = $container->get('doctrine');
$this->em = $this->doctrine->getManager();
$this->source = $this->doctrine->getConnection('source');
$limit = self::SQL_LIMIT;
$numRecords = 22690; // Hardcoded for debugging
$loops = intval($numRecords / $limit);
$numAccount = 0;
for ($i = 0; $i < $loops; $i++){
$offset = self::SQL_LIMIT * $i;
$users = $this->fetchSourceUsers($offset);
foreach ($users as $user) {
try{
$numAccount++;
$this->persistSourceUser($user);
if (0 === ($numAccount % self::FLUSH_FREQUENCY)) {
$this->flushEntities($output);
}
} catch(\Exception $e) {
//
}
}
}
$this->flushEntities($output);
}
private function fetchSourceUsers(int $offset = 0): array
{
$sql = <<<'SQL'
SELECT email, password, first_name
FROM source.users
ORDER by id ASC LIMIT ? OFFSET ?
SQL;
$stmt = $this->source->prepare($sql);
$stmt->bindValue(1, self::SQL_LIMIT, ParameterType::INTEGER);
$stmt->bindValue(2, $offset, ParameterType::INTEGER);
$stmt->execute();
$users = $stmt->fetchAll();
return $users;
}
}
If the time it takes to flush is getting longer every other flush then you forgot to clear entity manager (which for batch jobs should happen after flush). Reason is that you keep accumulating entities in the entity manager and during every commit Doctrine is checking each and every one for changes (I assume you're using default change tracking).
I need to import data from one table in db A to another table in db B (same server) and I've chosen doctrine to import it.
Unless you have some complex logic related to adding users (i.e. application events, something happening on the other side of the app, basically need some other PHP code to be executed) then you've chosen poorly - Doctrine is not designed for batch processing (although it can do just fine if you really know what you're doing). For "simple" migration the best choice would be to go with DBAL.

How to use DataMapper when data loading aspect needs to be optimized?

I have a DataMapper that creates an object, and loads the object with the same data from DB quite often. I have the DataMapper in a loop, to where the object that is being created essentially keeps loading the same SQL over and over again.
How can I cache or reuse the data to ease the load on the database?
Code
$initData = '...';
$result = '';
foreach($models as $model)
{
$plot = (new PlotDataMapper())->loadData($model, $initData);
$plot->compute();
$result[$i] = $plot->result();
}
class PlotDataMapper
{
function loadData($model, $initData)
{
$plot = Plot($initData);
//If the loop above executes 100 times, this SQL
//executes 100 times as well, even if $model is the same every time
$data = $db->query("SELECT * FROM .. WHERE .. $model");
$plot->setData($data);
return $plot;
}
}
My Thoughts
My line of thought is that I can use the DataMapper itself as a caching object. If a particular $model number has already been used, I store results in some table of the PlotDataMapper object and retrieve it when I need it. Does that sound good? Kind of like memoizing data from DB.

Determining which field causes Doctrine to re-query the database

I'm using Doctrine with Symfony in a couple of web app projects.
I've optimised many of the queries in these projects to select just the fields needed from the database. But over time new features have been added and - in a couple of cases - additional fields are used in the code, causing the Doctrine lazy loader to re-query the database and driving the number of queries on some pages from 3 to 100+
So I need to update the original query to include all of the required fields. However, there doesn't seem an easy way for Doctrine to log which field causes the additional query to be issued - so it becomes a painstaking job to sift through the code looking for usage of fields which aren't in the original query.
Is there a way to have Doctrine log when a getter accesses a field that hasn't been hydrated?
I have not had this issue, but just looked at Doctrine_Record class. Have you tried adding some debug output to the _get() method? I think this part is where you should look for a solution:
if (array_key_exists($fieldName, $this->_data)) {
// check if the value is the Doctrine_Null object located in self::$_null)
if ($this->_data[$fieldName] === self::$_null && $load) {
$this->load();
}
Just turn on SQL logging and you can deduce the guilty one from alias names. For how to do it in Doctrine 1.2 see this post.
Basically: create a class which extends Doctrine_EventListener:
class QueryDebuggerListener extends Doctrine_EventListener
{
protected $queries;
public function preStmtExecute(Doctrine_Event $event)
{
$query = $event->getQuery();
$params = $event->getParams();
//the below makes some naive assumptions about the queries being logged
while (sizeof($params) > 0) {
$param = array_shift($params);
if (!is_numeric($param)) {
$param = sprintf("'%s'", $param);
}
$query = substr_replace($query, $param, strpos($query, '?'), 1);
}
$this->queries[] = $query;
}
public function getQueries()
{
return $this->queries;
}
}
And add the event listener:
$c = Doctrine_Manager::connection($conn);
$queryDbg = new QueryDebuggerListener();
$c->addListener($queryDbg);

Doctrine 2: weird behavior while batch processing inserts of entities that reference other entities

I am trying out the batch processing method described here:
http://docs.doctrine-project.org/projects/doctrine-orm/en/latest/reference/batch-processing.html
my code looks like this
$limit = 10000;
$batchSize = 20;
$role = $this->em->getRepository('userRole')->find(1);
for($i = 0; $i <= $limit; $i++)
{
$user = new \Entity\User;
$user->setName('name'.$i);
$user->setEmail('email'.$i.'#email.blah');
$user->setPassword('pwd'.$i);
$user->setRole($role);
$this->em->persist($user);
if (($i % $batchSize) == 0) {
$this->em->flush();
$this->em->clear();
}
}
the problem is, that after the first call to em->flush() also the
$role gets detached and for every 20 users a new role with a new id is
created, which is not what i want
is there any workaround available for this situation? only one i could make work is to fetch the user role entity every time in the loop
thanks
clear() detaches all entities managed by the entity manager, so $role is detached too, and trying to persist a detached entity creates a new entity.
You should fetch the role again after clear:
$this->em->clear();
$role = $this->em->getRepository('userRole')->find(1);
Or just create a reference instead:
$this->em->clear();
$role = $this->em->getReference('userRole', 1);
As an alternative to arnaud576875's answer you could detach the $user from the entity manager so that it can be GC'd immediately. Like so:
$this->em->flush();
$this->em->detach($user);
Edit:
As pointed out by Geoff this will only detach the latest created user-object. So this method is not recommended.

Categories