Migrating data with Doctrine becomes slow - php

I need to import data from one table in db A to another table in db B (same server) and I've chosen doctrine to import it.
I'm using a Symfony Commands and is all good for the first loop, just spends 0.04 secs, but then starts to become slower and slower and takes almost half an hour ...
I'm considering to build a shell script to call this Symfony command giving the offset ( I manually tried it and keeps same speed ). This is running in a docker service and the php service is around 100% CPU, however mysql service is 10%
Here part of the script:
class UserCommand extends Command
{
...
protected function execute(InputInterface $input, OutputInterface $output)
{
$container = $this->getApplication()->getKernel()->getContainer();
$this->doctrine = $container->get('doctrine');
$this->em = $this->doctrine->getManager();
$this->source = $this->doctrine->getConnection('source');
$limit = self::SQL_LIMIT;
$numRecords = 22690; // Hardcoded for debugging
$loops = intval($numRecords / $limit);
$numAccount = 0;
for ($i = 0; $i < $loops; $i++){
$offset = self::SQL_LIMIT * $i;
$users = $this->fetchSourceUsers($offset);
foreach ($users as $user) {
try{
$numAccount++;
$this->persistSourceUser($user);
if (0 === ($numAccount % self::FLUSH_FREQUENCY)) {
$this->flushEntities($output);
}
} catch(\Exception $e) {
//
}
}
}
$this->flushEntities($output);
}
private function fetchSourceUsers(int $offset = 0): array
{
$sql = <<<'SQL'
SELECT email, password, first_name
FROM source.users
ORDER by id ASC LIMIT ? OFFSET ?
SQL;
$stmt = $this->source->prepare($sql);
$stmt->bindValue(1, self::SQL_LIMIT, ParameterType::INTEGER);
$stmt->bindValue(2, $offset, ParameterType::INTEGER);
$stmt->execute();
$users = $stmt->fetchAll();
return $users;
}
}

If the time it takes to flush is getting longer every other flush then you forgot to clear entity manager (which for batch jobs should happen after flush). Reason is that you keep accumulating entities in the entity manager and during every commit Doctrine is checking each and every one for changes (I assume you're using default change tracking).
I need to import data from one table in db A to another table in db B (same server) and I've chosen doctrine to import it.
Unless you have some complex logic related to adding users (i.e. application events, something happening on the other side of the app, basically need some other PHP code to be executed) then you've chosen poorly - Doctrine is not designed for batch processing (although it can do just fine if you really know what you're doing). For "simple" migration the best choice would be to go with DBAL.

Related

PHP Multi Threading - Synchronizing a cache file between threads

I created a script, that, for a game situation tries to find the best possible solution. It does this, by simulating each and every possible move, and quantifying them, thus deciding which is the best move to take (which will result in the fastest victory). To make it faster, I've implemented PHP's pthread, in the following way: each time the main thread needs to find a possible move (let's call this JOB), it calculates all the possible moves in the current depth, then starts a Pool, and adds to it, each possible move (let's call this TASK), so the threads develop the game tree for each move separately, for all the additional depths.
This would look something like this:
(1) Got a new job with 10 possible moves
(1) Created a new pool
(1) Added all jobs as tasks to the pool
(1) The tasks work concurently, and return an integer as a result, stored in a Volatile object
(1) The main thread selects a single move, and performs it
.... the same gets repeated at (1) until the fight is complete
Right now, the TASKS use their own caches, meaning while they work, they save caches and reuse them, but they do not share caches between themselves, and they do not take caches over from a JOB to another JOB. I tried to resolve this, and in a way managed, but I don't think this is the intended way, because it makes everything WAY slower.
What I tried to do is as follows: create a class, that will store all the cache hashes in arrays, then before creating the pool, add it to a Volatile object. Before a task is being run, it retrieves this cache, uses it for read/write operation, and when the task finished, it merges it with the instance which is in the Volatile object. This works, as in, the caches made in JOB 1, can be seen in JOB 2, but it makes the whole process way much slower, then it was, when each thread only used their own cache, which was built while building the tree, and then destroyed, when the thread finished. Am I doing this wrong, or the thing I want is simply not achieavable? Here's my code:
class BattlefieldWork extends Threaded {
public $taskId;
public $innerIterator;
public $thinkAhead;
public $originalBattlefield;
public $iteratedBattlefield;
public $hashes;
public function __construct($taskId, $thinkAhead, $innerIterator, Battlefield $originalBattlefield, Battlefield $iteratedBattlefield) {
$this->taskId = $taskId;
$this->innerIterator = $innerIterator;
$this->thinkAhead = $thinkAhead;
$this->originalBattlefield = $originalBattlefield;
$this->iteratedBattlefield = $iteratedBattlefield;
}
public function run() {
$result = 0;
$dataSet = $this->worker->getDataSet();
$HashClassShared = null;
$dataSet->synchronized(function ($dataSet) use(&$HashClassShared) {
$HashClassShared = $dataSet['hashes'];
}, $dataSet);
$myHashClass = clone $HashClassShared;
$thinkAhead = $this->thinkAhead;
$innerIterator = $this->innerIterator;
$originalBattlefield = $this->originalBattlefield;
$iteratedBattlefield = $this->iteratedBattlefield;
// the actual recursive function that will build the tree, and calculate a quantify for the move, this will use the hash I've created
$result = $this->performThinkAheadMoves($thinkAhead, $innerIterator, $originalBattlefield, $iteratedBattlefield, $myHashClass);
// I am trying to retrieve the common cache here, and upload the result of this thread
$HashClassShared = null;
$dataSet->synchronized(function($dataSet) use ($result, &$HashClassShared) {
// I am storing the result of this thread
$dataSet['results'][$this->taskId] = $result;
// I am merging the data I've collected in this thread with the data that is stored in the `Volatile` object
$HashClassShared = $dataSet['hashes'];
$HashClassShared = $HashClassShared->merge($myHashClass);
}, $dataSet);
}
}
This is how I create my tasks, my Volatile, and my Pool:
class Battlefield {
/* ... */
public function step() {
/* ... */
/* get the possible moves for the current depth, that is 0, and store them in an array, named $moves */
// $nextInnerIterator, is an int, which shows which hero must take an action after the current move
// $StartingBattlefield, is the zero point Battlefield, which will be used in quantification
foreach($moves as $moveid => $move) {
$moves[$moveid]['quantify'] = new BattlefieldWork($moveid, self::$thinkAhead, $nextInnerIterator, $StartingBattlefield, $this);
}
$Volatile = new Volatile();
$Volatile['results'] = array();
$Volatile['hashes'] = $this->HashClass;
$pool = new Pool(6, 'BattlefieldWorker', [$Volatile]);
foreach ($moves as $moveid => $move) {
if (is_a($moves[$moveid]['quantify'], 'BattlefieldWork')) {
$pool->submit($moves[$moveid]['quantify']);
}
}
while ($pool->collect());
$pool->shutdown();
$HashClass = $Volatile['hashes'];
$this->HashClass = $Volatile['hashes'];
foreach ($Volatile['results'] as $moveid => $partialResult) {
$moves[$moveid]['quantify'] = $partialResult;
}
/* The moves are ordered based on quantify, one is selected, and then if the battle is not yet finished, step is called again */
}
}
And here is how I am merging two hash classes:
class HashClass {
public $id = null;
public $cacheDir;
public $battlefieldHashes = array();
public $battlefieldCleanupHashes = array();
public $battlefieldMoveHashes = array();
public function merge(HashClass $HashClass) {
$this->battlefieldCleanupHashes = array_merge($this->battlefieldCleanupHashes, $HashClass->battlefieldCleanupHashes);
$this->battlefieldMoveHashes = array_merge($this->battlefieldMoveHashes, $HashClass->battlefieldMoveHashes);
return $this;
}
}
I've benchmarked each part of the code, to see where am I losing time, but everything seems to be fast enough to not warrant the time increase I am experiencing. What I am thinking is, that the problem lies in the Threads, sometimes, it seems that no job is being done at all, like they are waiting for some thread. Any insights on what could be the problem, would be greatly appreciated.

Delete expired tokens in oauth2-server-php

I'm using the bshaffer/oauth2-server-php module to authenticate my rest api. Everything works fine but meanwhile I have over 20,000 access tokens in the database.
As I read, the framework will not delete expired tokens automatically or by config parameter. So I'm trying to do the job by my own.
I know the tables which hold the tokens and I already built the delete statements. But I can't find the right place (the right class/method) to hook with my cleanup routine.
I think a good option here is to create a Command, here a small part in Symfony but can also be a plain PHP command since you know the table names and just execute it every hour:
$doctrine = $this->getContainer()->get('doctrine');
$entityManager = $doctrine->getEntityManager();
$qb = $entityManager->createQueryBuilder();
$qb->select('t')
->from('OAuth2ServerBundle:AccessToken', 't')
->where('t.expires < :now')
->setParameter('now', new \DateTime(), Type::DATETIME);
$accessTokens = $qb->getQuery()->getResult();
$cleanedTokens = 0;
foreach ($accessTokens as $token) {
$entityManager->remove($token);
$cleanedTokens++;
}
This just covers the access_token table as an example. By the way I still could not get how to change the Token expiration with this library ;)
UPDATE: To change the lifetimes just edit parameters.yml and add
oauth2.server.config:
auth_code_lifetime: 30
access_lifetime: 120
refresh_token_lifetime: 432000
I didn't read the complete source off bshaffer oauth server.
But want you can try is to create your own class by extending from class Server.
And use the __destruct() function to be executed when the object customServer is destroyed by PHP
<?php
include('src/OAuth2/Server.php'); # make sure the path is correct.
class customServer extends Server {
public __construct(($storage = array(), array $config = array(), array $grantTypes = array(), array $responseTypes = array(), TokenTypeInterface $tokenType = null, ScopeInterface $scopeUtil = null, ClientAssertionTypeInterface $clientAssertionType = null)) {
parent::_construct($storage, $config, $grantTypes, $responseTypes, $tokenType, $scopeUtil, $clientAssertionType);
}
}
public function __destruct() {
// run your cleanup SQL from here.
}
?>

Symfony : Doctrine data fixture : how to handle large csv file?

I am trying to insert (in a mySQL database) datas from a "large" CSV file (3Mo / 37000 lines / 7 columns) using doctrine data fixtures.
The process is very slow and at this time I could not succeed (may be I had to wait a little bit more).
I suppose that doctrine data fixtures are not intended to manage such amount of datas ? Maybe the solution should be to import directly my csv into database ?
Any idea of how to proceed ?
Here is the code :
<?php
namespace FBN\GuideBundle\DataFixtures\ORM;
use Doctrine\Common\DataFixtures\AbstractFixture;
use Doctrine\Common\DataFixtures\OrderedFixtureInterface;
use Doctrine\Common\Persistence\ObjectManager;
use FBN\GuideBundle\Entity\CoordinatesFRCity as CoordFRCity;
class CoordinatesFRCity extends AbstractFixture implements OrderedFixtureInterface
{
public function load(ObjectManager $manager)
{
$csv = fopen(dirname(__FILE__).'/Resources/Coordinates/CoordinatesFRCity.csv', 'r');
$i = 0;
while (!feof($csv)) {
$line = fgetcsv($csv);
$coordinatesfrcity[$i] = new CoordFRCity();
$coordinatesfrcity[$i]->setAreaPre2016($line[0]);
$coordinatesfrcity[$i]->setAreaPost2016($line[1]);
$coordinatesfrcity[$i]->setDeptNum($line[2]);
$coordinatesfrcity[$i]->setDeptName($line[3]);
$coordinatesfrcity[$i]->setdistrict($line[4]);
$coordinatesfrcity[$i]->setpostCode($line[5]);
$coordinatesfrcity[$i]->setCity($line[6]);
$manager->persist($coordinatesfrcity[$i]);
$this->addReference('coordinatesfrcity-'.$i, $coordinatesfrcity[$i]);
$i = $i + 1;
}
fclose($csv);
$manager->flush();
}
public function getOrder()
{
return 1;
}
}
Two rules to follow when you create big batch imports like this:
Disable SQL Logging: ($manager->getConnection()->getConfiguration()->setSQLLogger(null);) to avoid huge memory loss.
Flush and clear frequently instead of only once at the end. I suggest you add if ($i % 25 == 0) { $manager->flush(); $manager->clear() } inside your loop, to flush every 25 INSERTs.
EDIT: One last thing I forgot: don't keep your entities inside variables when you don't need them anymore. Here, in your loop, you only need the current entity that is being processed, so don't store previous entity in a $coordinatesfrcity array. This might lead you to memory overflow if you keep doing that.
There is a great example in the Docs: http://doctrine-orm.readthedocs.org/projects/doctrine-orm/en/latest/reference/batch-processing.html
Use a modulo (x % y) expression to implement batch processing, this example will insert 20 at a time. You may be able to optimise this depending on your server.
$batchSize = 20;
for ($i = 1; $i <= 10000; ++$i) {
$user = new CmsUser;
$user->setStatus('user');
$user->setUsername('user' . $i);
$user->setName('Mr.Smith-' . $i);
$em->persist($user);
if (($i % $batchSize) === 0) {
$em->flush();
$em->clear(); // Detaches all objects from Doctrine!
}
}
$em->flush(); //Persist objects that did not make up an entire batch
$em->clear();
For fixtures which need lots of memory but don't depend on each other, I get around this problem by using the append flag to insert one entity (or smaller group of entities) at a time:
bin/console doctrine:fixtures:load --fixtures="memory_hungry_fixture.file" --append
Then I write a Bash script which runs that command as many times as I need.
In your case, you could extend the Fixtures command and have a flag which does batches of entities - the first 1000 rows, then the 2nd 1000, etc.

How to use DataMapper when data loading aspect needs to be optimized?

I have a DataMapper that creates an object, and loads the object with the same data from DB quite often. I have the DataMapper in a loop, to where the object that is being created essentially keeps loading the same SQL over and over again.
How can I cache or reuse the data to ease the load on the database?
Code
$initData = '...';
$result = '';
foreach($models as $model)
{
$plot = (new PlotDataMapper())->loadData($model, $initData);
$plot->compute();
$result[$i] = $plot->result();
}
class PlotDataMapper
{
function loadData($model, $initData)
{
$plot = Plot($initData);
//If the loop above executes 100 times, this SQL
//executes 100 times as well, even if $model is the same every time
$data = $db->query("SELECT * FROM .. WHERE .. $model");
$plot->setData($data);
return $plot;
}
}
My Thoughts
My line of thought is that I can use the DataMapper itself as a caching object. If a particular $model number has already been used, I store results in some table of the PlotDataMapper object and retrieve it when I need it. Does that sound good? Kind of like memoizing data from DB.

Determining which field causes Doctrine to re-query the database

I'm using Doctrine with Symfony in a couple of web app projects.
I've optimised many of the queries in these projects to select just the fields needed from the database. But over time new features have been added and - in a couple of cases - additional fields are used in the code, causing the Doctrine lazy loader to re-query the database and driving the number of queries on some pages from 3 to 100+
So I need to update the original query to include all of the required fields. However, there doesn't seem an easy way for Doctrine to log which field causes the additional query to be issued - so it becomes a painstaking job to sift through the code looking for usage of fields which aren't in the original query.
Is there a way to have Doctrine log when a getter accesses a field that hasn't been hydrated?
I have not had this issue, but just looked at Doctrine_Record class. Have you tried adding some debug output to the _get() method? I think this part is where you should look for a solution:
if (array_key_exists($fieldName, $this->_data)) {
// check if the value is the Doctrine_Null object located in self::$_null)
if ($this->_data[$fieldName] === self::$_null && $load) {
$this->load();
}
Just turn on SQL logging and you can deduce the guilty one from alias names. For how to do it in Doctrine 1.2 see this post.
Basically: create a class which extends Doctrine_EventListener:
class QueryDebuggerListener extends Doctrine_EventListener
{
protected $queries;
public function preStmtExecute(Doctrine_Event $event)
{
$query = $event->getQuery();
$params = $event->getParams();
//the below makes some naive assumptions about the queries being logged
while (sizeof($params) > 0) {
$param = array_shift($params);
if (!is_numeric($param)) {
$param = sprintf("'%s'", $param);
}
$query = substr_replace($query, $param, strpos($query, '?'), 1);
}
$this->queries[] = $query;
}
public function getQueries()
{
return $this->queries;
}
}
And add the event listener:
$c = Doctrine_Manager::connection($conn);
$queryDbg = new QueryDebuggerListener();
$c->addListener($queryDbg);

Categories