Let's say I have a script which inserts rows into the database and it looks like this:
/* $start is some GET parameter. Any number between 0 and 9900 */
/* Select all objects with ids between $start and $start+99
and put their ids into $ids array */
$qb->select('object');
$qb->from('AppBundle:Object','object');
$qb->where("object.id >= $start");
$qb->andWhere("object.id < $start+100");
$objects = $qb->getQuery()->getResult();
$ids = array();
foreach($objects AS $object) {
$ids[] = $object->getId();
}
/* Create missing objects and insert them into database */
for($id=$start; $id<$start+100; ++$id) {
if(in_array($id, $ids)) continue;
/* Some calculations */
$createdObject = new Object($id, $some, $data);
$em->persist($createdObject);
}
$em->flush();
Now imagine there are no objects yet (the table is clear) and one user enters the site with start=0. The script takes like 2 seconds to complete. Before it finishes - another user enters the site with start=50.
I'm not sure what exactly would happen in such scenario, but I persume that:
First user enters - the $ids array is empty, the script is generating objects with id 0-99.
Second user enters - the $em->flush form the first entrance is not yet called, which means the $ids array is still empty (I guess?). The script is generating objects with id 50-149
There is a first $em->flush() call which comes from the first user entrance. It insert objects 0-99 into the database.
There is a second $em->flush() call which comes from the second user entrance. It tries to insert objects 50-149 into the database. It fails, because the object with id=50 already exists in the database. As a result it doesnt actually insert anything into the database.
Is that what would really happen? If so - how to prevent it and what is the best way to insert only those objects that are missing into the database?
#edit: This is just an exmaple code, but in the real script the id is actually 3 columns (x, y, z) and the object is a field on a map. The purpose of this script is that I want to have a huge map and it would take too much time to generate it all at once. So I want to generate only a little and then create the missing parts only when some user tries to access them. At some point the whole map will be created, but the process will be staggered.
You have 2 bad practices here:
You sould avoid INSERT or UPDATE operations when users enter your site (because they are slow/costly), especially if it's about adding many objects to the database like in this script. It should run in some kind of cron script, independently from your website users.
You shouldn't assign ID's to your objects beforehand. Leave it as null and Doctrine will handle it for you. Why would you need to set ID in advance?
To answer your question - if you call $em->persist() for an object with a pre-assigned ID, and in case another object exists in the database with the same ID - INSERT won't happen. Instead, the already existing object will be UPDATED with the data from your newer object (when you call em->flush() afterwards). So instead of 2 objects (as expected), you will have only 1 in the database. So that's why I really doubt if you need to pre-assign IDs in advance. You should tell me more about the purpose of this :)
Related
I am trying to do the following.
I am consulting an external database using a web service. What the web service does is bring me all the products from an ERP system my client uses. As the server and the connection are not really fast, what I decided to do is basically synchronize the database on my web server and handle most operations there, so that the website can run smoothly.
Everything works fine I just need one last step to guarantee that the inventory on the website matches the one available on the ERP. The only issue comes when they (the client) deletes something on the ERP system.
At the moment I am thinking what would be the ideal strategy (least resource and time consuming) to remove products from my Products table if I don't receive them in the web service result.
So I basically have the following process:
I query the web service for all the products, give them a little format and store them in an array. The final size is about 600 indexes.
Then what I do is I do a foreach cycle and have the following subprocess.
I query my database to check if product_id is present.
If the product is present, I just update it with the latest info, stock data.
If the product is not present, I just insert it.
So, I was thinking of doing the following, but I do not think it's the ideal way:
Do a SELECT * FROM Products and generate an array that has all the products.
Do a foreach cycle in the resulting array and in each cycle scan the ERP array to check if the specific product exists. If not I delete it, if yes, I continue with the next product.
Now considering that after all the previous steps this would involve a couple of nested foreach I am a little worried that it might consume too much memory and also take longer to process.
I was thinking that maybe something like array_diff or array map could solve the issue, but I am not really experienced with these functions, and the structure of the two arrays differs a lot, so I am not sure if it would work that easily.
What would you guys recommend?
It's actually quite simple:
SELECT id FROM Products
Then you have an array of your product Ids, for example:
[123,5679,345]
Then as you go and do your updates or inserts, remove the id from the array.
[for updates]I query my database to check if product_id is present.
This is redundant now.
There are a few ways to remove the value from the array (when you do an update), this is the way I would probably do it.
if(false !== ($index = array_search($data['product_id'],$myids))){
//note the !== type comparison because array_search can return 0 for the first index, we must check for boolean false.
//find the index of the product id in our list of id's from local DB
unset($myids[$index]);
//If our incoming product_id is in the local list we Do Update
}else{
//Otherwise we Do Insert
}
As I mentioned above when doing your updates/inserts, You no longer have to check if the ID exists, because you already know this by having an array of IDs from the database. This alone saves you (n) queries (apx 600).
Then its very simple if you have ids left over.
//I wouldn't normally concatenate variables into SQL, in this case it's a list of int IDs from the database.
//you can of course come up with a loop to make it a prepared statement if you wish, but for the sake of simplistically, I'll leave that as an exercise for another day..
'DELETE FROM Products WHERE id IN('.implode(',', $myids).')'
And because you unset these when Updating, then the only thing left is Products that no longer exist.
Conclusion:
You have no choice (other then doing on duplicate key query, or ignoring exceptions) then to pull out the product Ids. You're already doing this on a row by row basis. So we can effectively kill 2 birds with one stone.
If you need more data then just the ID, for example you check that the product was changed before doing an update. Then pull that data out, but I would recommend using PDO and the FETCH_GROUP option. I wont go into the specifics of that but to say it lets you easily build your array this way:
[{product_id} => [ {product_name}, {product_price} etc..]];
Basically the product_id, is the key with a nested array of the row data, this will make lookup easier.
This way you can look it up like this.
//then instead of array_search
//if(false !== ($index = array_search($data['product_id'],$myids))){
if(isset($myids[$data['product_id']])){
unset($myids[$data['product_id']]);
//do your checks, then your update
}else{
//do inserts
}
References:
http://php.net/manual/en/function.array-search.php
array_search — Searches the array for a given value and returns the first corresponding key if successful
WARNING This function may return Boolean FALSE, but may also return a non-Boolean value which evaluates to FALSE. Please read the section on Booleans for more information. Use the === operator for testing the return value of this function.
UPDATE
There is one other really good way to do this, and that is to add a field called sync_date, now when you do your insert or update then set the sync_date to the current data.
This way when you are done, those products with an older sync date then today can be deleted. In this case it's best to cache the time when doing it so you know the exact time.
$time = data('Y-m-d H:i:s'); //or time() if you prefer timestamp
//use this same variable for the whole coarse of the script.
Then you can do
'DELETE from products WHERE sync_time != $time'
This may actually be a bit better because it has more utility. When was the last time it was ran, Now you know.
I have a complex record stored in session. I want to update parts of this record in session then update the DB row with the session data. So far I have managed to overwrite the entire structure and not just the part of it I intend to.
Example:
base <-- whole record
base.field1 <-- single field
base.field2 <-- single field
base.field3 <-- single field
base.users <-- array of objects (users), stored as JSON column
base.details <-- single dimensional array of fields
base.cards <-- array of objects (cards), stored as JSON column
base.regions <-- array of objects (regions), stored as JSON column
This whole structure is stored as a row in a progress table.
I load and store this data in session like this:
session()->put('base', Progress::find($id));
I then update some of the fields in the cards array:
$cards = session()->get('base.cards');
$cards[$index]['points'] = 100;
I then try (unsuccessfully) to update the session variable, having tried both below:
session()->put('base.cards', $cards);
session()->push('base.cards', $cards);
session('base.cards', $cards);
Then lastly I want to store this record in the progress table by updating its instance, like this:
Progress::find($id)->update(session()->get('base'));
How do I manipulate then update in session just one of the JSON/array fields?
UPDATE: I added:
session()->put('base.cards', $cards);
session()->save();
But still, when I dd(session()->get('base')) I get the $cards array?!
I would suggest your problem is caused by you using Session in a way that was not intended. You seem to be storing an Eloquent model called Progress in the Session with the key 'base' and then expecting the get, put and push methods of Session to understand how to interact with your Model's parameters.
How about you make life simple and just store the ID in the Session...
$progress = Progress::find($id);
session()->put('progress_id', $progress->id );
session()->save();
And then when you want to pull it out to manipulate, save in database etc do this...
$progress_id = session()->get('progress_id');
$progress = Progress::find($progress_id);
$progress->cards = $cards->asJSON(); // Give your cards model a method that serialises itself
$progress->save();
Overall your code is now easier to read and everything's being used in a very 'normal' way. The session is just holding an ID, the model exists as a proper Eloquent model being accessed in the documented way.
Future you will be grateful when he comes to review this code :-)
Oh, and if you were storing the model in the session because you're worried about speed, pulling a single record from a properly indexed database is probably no slower than pulling the data from a session. We're talking thousands of a second but please do run tests to put your mind at ease.
Complex structures supported by session storage is limited to arrays.
To benefit from . syntax in key names, you need to convert the model to array, e.g.:
session()->put('base', Progress::find($id)->toArray());
So you can do later
session()->put('base.cards', $cards);
I like the technique described by the Marco Pivetta at PHP UK Conference 2016 (https://youtu.be/rzGeNYC3oz0?t=2011), he recommends to favour immutable entities and instead of changing data structures - appending them. History of changes as a bonus is a nice thing to have for many different reasons, so I would like to apply this approach on my projects. Let's have a look at the following use case:
class Task {
protected $id;
/**
* Status[]
*/
protected $statusChanges;
public function __construct()
{
$this->id = Uuid::uuid4();
$this->statusChange = new ArrayCollection();
}
public function changeStatus($status, $user){
$this->statusChange->add(new Status($status, $user, $this);
}
public function getStatus()
{
return $this->statusChange->last();
}
}
class Status {
protected $id;
protected $value;
protected $changedBy;
protected $created;
const DONE = 'Done';
public function __construct($value, User $changedBy, Task $task)
{
$this->id = Uuid::uuid4();
$this->value = $value;
$this->changedBy = $changedBy;
$this->task = $task;
$this->created = new \DateTime();
}
}
$user = $this->getUser();
$task = new Task();
$task->changeStatus(Status::DONE, $user);
$taskRepository->add($task, $persistChanges = true);
All status changes I'm planning to persist in the MySQL database. So the association will be One(Task)-To-Many(Status).
1) What is the recommended way of gettings tasks by current status? Ie. all currently opened, finished, pending tasks.
$taskRepository->getByStatus(Status::DONE);
2) What is your opinion on this technique, are there some disadvantages which may appear in the future, as the project will grow?
3) Where it is more practical to save status changes (as a serialized array in a Task field, or in a separate table?
Thanks for opinions!
I imagine this is going to get closed to some of it being based on opinion, just so you're aware.
That being said, I've been quite interested in the idea of this but I've not really looked into it a huge amount, but here's my thinking...
1. Find By Status
I think you would need to do some sort of sub query in the join to get the latest state for each task and match that. (I would like to point out that this is just guesswork from looking at SO rather than actual knowledge so it could be well off).
SELECT t, s
FROM t Task
LEFT JOIN t.status s WITH s.id = (
SELECT s2.id
FROM Status s2
WHERE s2.created = (
SELECT MAX(s3.created)
FROM Status s3
WHERE s3.task = t
)
)
WHERE s.value = :status
Or maybe just (provided the combined id & created fields are unique)...
SELECT t, s
FROM t Task
LEFT JOIN t.status s WITH s.created = (
SELECT MAX(s2.created)
FROM Status s2
WHERE s2.task = t
)
WHERE s.value = :status
2 Disadvantages
I would imagine that having to use the above type of queries for each repository call would require more work and would, therefore, be easier to get wrong. As you are only ever appending to the database it will only get bigger so storage/cache space may be an issue depending on how much data you have.
3 Where To Save Status
The main benefit of immutable entities is that they can be cached forever as they will never change. If you saved any state changes in a serialized field then the entity would need to be mutable which would defeat the purpose.
Here's what I do:
All the types of tables involved in my business
I organize my database in 4 types of tables:
Log_xxx
Data_xxx
Document_xxx
Cache_xxx
Any data I store falls in one of those 4 types of tables.
Document_xxx and Data_xxx are just for storing binary files (like PDFs of tariffs the providers send to me), and static data or super-slow-changing data (like the airports or countries or currencies in the world). They are not involved in the "main" of this explanation but worth mentioning them.
Log tables
All my "domain events" and also the "application events" go to a Log_xxx table.
Log tables are write-once, never deleteable, and I must do backups of them. This is where the "history of the business" is stored.
For example, for a "task" domain object as you mention in your question, say that the task can be "created" and then altered later, I'd use:
Log_Task_CreatedEvents
Log_Task_ChangedEvents
Also I save all the "application events": Each HTTP request with some contextual data. Each command-run... They go to:
Log_Application_Events
Never the domain can change unless there is an application that changes it (a command line, a cron, a controller attending an HTTP request, etc.). All the "domain events" have a reference to the application event that created them.
All the events, either domain events (like TaskChangedEvent) or the application events are absolutely immutable and carry several standard things like the timestamp at creation.
The "Doctrine Entities" have no setters so they can only be created and read. Never changed.
In the database I only have one relevant field. It is of type TEXT and represents the event in JSON. I have another field: WriteIndex which is autonumeric, is the primary key and is NEVER used by my software as a key. It is only used for backups and database control. When you have GBs of data sometimes you need to dump only "events starting at index XX".
Then for easiness, I have an extra field which I call "cachedEventId" which contains the very same "id" of the event, redundant to the JSON. This is why the field is named after "cached..." as it does not contain original data, it could be rebuilt from the event field. This is only for simplicity.
Although doctrine calls them "entities" those are not domain entities, they are domain value objects.
So, Log tables look like this:
INT writeIndex; // Never used by my program.
TEXT event; // Stores te event as Json
CHAR(40) cachedEventId; // Unique key, it acts really as the primary key from the point of view of my program. Rebuildable from the event field.
Sometimes I opt-in for having more cached fields, like the creation time-stamp. All those are not needed and only set there for convenience. All that should be extractable from the event.
The Cache tables
Then in the Cache_xxx I have the "accumulated data" data that can be "rebuilt" from the logs.
For example if I have a "task" domain object that has a "title" field, and a "creator", and a "due date", and the creator cannot be overwritten, by definition, and the title and due date can be re-set... then I'd have a table that looks like:
Cache_Tasks
* CHAR(40) taskId
* VARCHAR(255) title
* VARCHAR(255) creatorName
* DATE dueDate
Write model
Then, when I create an task, it writes to 2 tables:
* Log_Task_CreatedEvents // Store the creation event here as JSON
* Cache_Tasks // Store the creation event as one field per column
Then, when I modify a task, it also writes to 2 tables:
* Log_Task_ChangedEvents // Store the event of change here as JSON
* Cache_Tasks // Read the entity, change its property, flush.
Read model
To read the tasks, use the Cache_Tasks always.
They always represent the "latest state" of the object.
Deleteability
All the Cache_xxx tables are deleteable and do not need to be backed up. Just replay the events in date order and you'll get the cached entities again.
Sample code
I wrote this answer as "Task" as this was the question, but today for instance I've been working in assigning an "state" to the client's form-submissions. The clients just ask something via web and now I want to be able to "mark" this request as "new" or "processed" or "answered" or "mailValidated", etc...
I just created a new change() method to my FormSubmissionManager. It looks like this:
public function change( Id $formSubmissionId, array $arrayOfPropertiesToSet ) : ChangedEvent
{
$eventId = $this->idGenerator->generateNewId();
$applicationExecutionId = $this->application->getExecutionId();
$timeStamp = $this->systemClock->getNow();
$changedEvent = new ChangedEvent( $eventId, $applicationExecutionId, $timeStamp, $formSubmissionId, $arrayOfPropertiesToSet );
$this->entityManager->persist( $changedEvent );
$this->entityManager->flush();
$this->cacheManager->applyEventToCachedEntity( $changedEvent );
$this->entityManager->flush();
return $changedEvent;
}
Note that I do 2 flushes. This is on purpose. In the case the "write-to-the-cache" fails I don't want to loose the changedEvent.
So I "store" the event, then I cache it to the entity.
The Log_FormSubmission_ChangeEvent.event field looks like this:
{
"id":"5093ecd53d5cca81d477c845973add91e31a1dd9",
"type":"hellotrip.formSubmission.change",
"applicationExecutionId":"ff7ad4bd5ec6cebacc048650c866812ac0127ac2",
"timeStamp":"2018-04-04T02:03:11.637266Z",
"formSubmissionId":"758d3b3cf864d711d330c4e0d5c679cbf9370d9e",
"set":
{
"state":"quotationSent"
}
}
In the "row" of the cache I'll have the "quotationSent" in the column state so it can be queried normally from Doctrine even without the need of any Join.
I sell trips. You can see there many de-normalized data coming from several sources, like for example the number of adults, kids and infants travelling (coming from the creation of the form submission itself), the name of the trip he requests (coming from a repository of trips) and others.
You can also see the latest-added field "state" at the right of the image. There may be like 20 de-mapped fields in the cached row.
Answers to your questions
Q1) What is the recommended way of gettings tasks by current status? Ie. all currently opened, finished, pending tasks.
Query the cached table.
Q2) Are there some disadvantages which may appear in the future, as the project will grow?
When the project grows, intead of rebuilding the cache at the time of write, chich may be slow, you setup a queue system (for example a RabbitMq or AWS-SNS) and you just send to the queue a signal of "hey, this entity needs to be re-cached". Then you can return very quickly as saving a JSON and sending a signal to the queue is effort-less.
Then a listener to the queue will process all and every changes you make, and if re-caching is slow, you don't matter.
Q3) Where it is more practical to save status changes (as a serialized array in a Task field, or in a separate table?
Separate tables: A table for "status changes" (=log =events =value_objects, not entities), and another table for "tasks" (=cache =domain_entities).
When you make backups, place in super-secure place the backups of the logs.
Upon a critical failure, restore the logs=events and replay them to re-build the cache.
In symfony I use to create a hellotrip:cache:rebuild command that accepts as a parameter the cache I need to reconstruct. It truncates the table (deletes all cached data for that table) and re-builds it over again.
This is costly, so you only need to rebuild "all" when necessary. In normal conditions, your app should take care of having the caches up to date when there is a new event.
Documents and Data
At the very beginning I mentioned the Documents and Data tables.
Time for it, now: You can use that information when rebuilding the caches. For example, you can "de-map" the airport name into the cached entity while in the events you may only have the airport code.
You can rather safely change the cache format as your business has more complex queries, having pre-calculated data. Just change the schema, drop it, re-build the cache.
The change-events, instead, will remain "exactly the same" so the code that gets the data and saves the event has not changed, reducing the risk of regression bugs.
Hope to help!
My controller needs to create/save one or more records by looping over the request data it receives and creating corresponding records. You may be wondering: why not just use saveAll() and save them all at once. The short answer is that certain records need to reference the ID of other records created in the same loop (and those IDs don't exist yet).
My loop creates the first record successfully, subsequent iterations of the loop are unable to "see" that newly created record when I use find(). If I echo the returned array, results are there, but the newly created one is missing. Why? Is CakePHP's magic making the new record unavailable due to some sort of caching?
Here is my code that doesn't include the newest record:
$newest_parent_question = $this->Question->find('first', array(
'conditions'=>array('Question.perm_id'=>$parent_question['Question']['perm_id']),
'order' => array('Question.created DESC')
)
);
However, the new record IS returned with this:
$newest_parent_question = $this->Question->find('all');
There's something wrong with your query.
Likely the $parent_question['Question']['perm_id'] is wrong. There's nothing CakePHP will do that would make your record un-find-able.
Just debug your variables to make sure you're building the correct query with the correct id, and you'll be good to go.
(without seeing the actual way you're building the id, it's impossible to help beyond that)
I have a feeling there might be something I am missing here, but here I go anyway.
Consider this: I have a 'Booking' class that has a user_id field and the Versionable behaviour and I run the following code:
$booking = new Booking();
$booking->user_id = 1;
$booking->isValid();
$booking->user_id = 2;
$booking->save();
This results in the correct record being inserted into the 'booking' table. BUT the record that is inserted into 'booking_version' table is out of date! The user_id is set to 1 because the data is pulled off the event invoker that was created during the first isValid() call. Furthermore, the id field is set to 0 for the same reason (which means the version record cannot be linked back to the booking)
I can get around this problem by calling $booking->clearInvokedSaveHooks() before the save(), but I dont really want to do this because I dont want to run all my save hooks again on saving.
Is there a better way to get around this?
There is nothing to workaround here. Versionable behaviour is designed to store previous values of the record.
If you want to store every single temporary state of the object - you'll have to save it each time.