php gearman worker function no synchronized with doctrine - php

I observed a strange behavior of my doctrine object. In my symfony project I'm using ORM with doctrine to save my data in a mysql database. This is working normal in the most situations. I'm also using gearman in my project, this is a framework that allows applications to complete tasks in parallel. I have a gearman job-server running on the same machine where also my apache is running and I have registered a gearman worker on the same machine in a seperate 'screen' session using the screen window manager. By this method, I have always access to the standard console out of the function registered for the gearman-worker.
In the gearman-worker function I'm invoking, I have access to the doctrine object by $doctrine = $this->getContainer()->get('doctrine') and it works almost normal. But when I have changed some data in my database, doctrine is using still the old data, which were stored before in the database. I'm totally confused, because I expected that by callling:
$repo = $doctrine->getRepository("PackageManagerBundle:myRepo");
$dbElement = $repo->findOneById($Id);
I'm always getting the current data entrys from my database. This is looking like a strange caching behavior, but I have no clue what I've made wrong.
I can solve this problem, by registering the gearman worker and function new:
$worker = new \GearmanWorker();
$worker->addServer();
$worker->addFunction
After that I've back the current state of my database, until I've changing something else. I'm oberserving this behavior only in my gearman worker function. In the rest of the application everthing is synchronized with my database and normal.

This is what I think may be happening. Could be wrong though.
A gearman worker is going to be a long-running process that picks up jobs to do. The first job it gets will then cause doctrine to load the entity into its object map from the database. But, for the second job the worker receives, doctrine will not perform a database lookup, it will instead check it's identity map and find it already has the object loaded and so will simply return the one from memory. If something else, external to the worker process, has altered the database record then you'll end up with an object that is out of date.
You can tell doctrine to drop objects from its identity map, then it will perform a database lookup. To enforce loading objects from the database again instead of serving them from the identity map, you should use EntityManager#clear().
More info here:
https://www.doctrine-project.org/projects/doctrine-orm/en/2.6/reference/working-with-objects.html#entities-and-the-identity-map

Related

php laravel update or create massive data

I have a case in which i need to sync an external existing table with the website table every few minutes.
I previously had it with a simple foreach which would loop through every record, as the table grows it became slower and slower and now it is taking a long time for around 20.000 records.
I want to make sure it creates a new record or updates an existing one.
This is what I got but it doesn't seem to update the existing rows.
$no_of_data = RemoteUser::count(); // 20.000 (example)
$webUserData = array();
for ($i = 0; $i < $no_of_data; $i++) {
// I check the external user so i can match it.
$externalUser = RemoteUser::where('UserID', $i)
->first();
if($externalUser) {
$webUserData[$i]['username'] = $externalUser->username;
$webUserData[$i]['user_id'] = $externalUser->UserID;
}
}
$chunk_data = array_chunk($webUserData, 1000);
if (isset($chunk_data) && !empty($chunk_data)) {
foreach ($chunk_data as $chunk_data_val) {
\DB::table('WebUser')->updateOrInsert($chunk_data_val);
}
}
Is there something I am missing or is this the wrong approach?
Thanks in advance
I'll try to make a complete all-in-one answer on some possible event driven solutions. The ideal scenario would be to change the current situation of a static check of each and every row to an event-driven solution where each entry notifies a change.
I won't list solutions per database here and use MySQL by default.
I see three possible solutions:
using an internal solution if only one and the same database instance is at play using triggers
if the creation or modification of the eloquent models are based in one place, eloquent events could be optional
alternatively mysql replication could play a role to catch the events if modifications occur outside of the application (multiple applications modify the same database).
Using triggers
If the situation applies syncing data on the same database instance (different databases) or on the same database process (same database) and the data you copy doesn’t need intervention by an external interpreter, you can use SQL or any extension of SQL supported by your database to use triggers or prepared statements.
I assume you’re using MySQL, if not, SQL triggers are quite similar across all known databases supporting SQL.
A trigger structure has a simple layout like:
CREATE TRIGGER trigger_name
AFTER UPDATE ON table_name
FOR EACH ROW
body_to_execute
Where AFTER_UPDATE is the event to catch in this example.
So for an update event, we would like to know the data that has been changed AFTER it has been updated, so we’ll use the AFTER UPDATE trigger.
So an AFTER UPDATE for your table, calling both remote_user as original and web_user as the copy, using both user_id and username as fields, would look something like
CREATE TRIGGER user_updated
AFTER UPDATE ON remote_user
FOR EACH ROW
UPDATE web_user
SET username = NEW.username
WHERE user_id = NEW.user_id;
The variables NEW and OLD are available in triggers, where NEW owns the data after the update and OLD before the update.
For a new user that has been inserted, we have the same procedure, we just need to create the entry in web_user.
CREATE TRIGGER user_created
AFTER INSERT on remote_user
FOR EACH ROW
INSERT INTO web_user(user_id, username)
VALUES(NEW.user_id, NEW.username);
Hope this gives you a clear idea on how to use triggers with SQL. There is a lot of information to be found, guides, tutorials, you name it. SQL might be an old boring language created by old people with long beards, but to know its features gives you a great advantage to solve complicated problems with simple methods.
Using Eloquent events
Laravel has a bunch of Eloquent events that get triggered when models do stuff. If the creation or modification of a model (entry in the database) only occur in one place (e.g. on entry point or application), the use of Eloquent events could be an option.
This means that you have to guarantee that the modification and/or creation takes place using Eloquents model:
Model::create([...]);
Model::find(1)->update([...])
$model->save();
// etc
And not indirectly using DB or similar:
// won't trigger any event
DB::table('remote_users')->update()->where();
Also avoid using saveQuietly() or any method on the model that's been built deliberately to suppress events.
The simplest solution would be to directly register events in the model itself using the protected static boot method.
namespace App\Models;
use bunch\of\classes;
class SomeModel extends Model {
protected static function booted() {
static::updated(function($model) {
// access any database or service
});
static::created(function($model) {
// access any database or service
});
}
}
To put the callback on a queue, Laravel 8 and up offer the queueable function to utilize the queue.
static::updated(queueable(function($ThisModel) {
// access any database or service
}));
From Laravel 7 or lower, it would be wise to create an observer and push everything on queue using jobs.
example based on your comment
If a model is present for both databases, the Eloquent events could be used in such a way, where InternalModel presents the main model which will trigger the events (source) and ExternalModel the model to update its, to be synced, database (sync table or replication).
namespace App\Models;
use App\Models\ExternalModel;
class InternalModel extends Model {
protected static function booted() {
static::updated(function($InternalModel) {
ExternalModel::find($InternalModel->id)->update([
'whatever-needs' => 'to-be-updated'
]);
});
static::created(function($InternalModel) {
ExternalModel::create([
'whatever-is' => 'required-to-create',
'the-external' => 'model'
]);
});
static::deleted(function($InternalModel) {
// do know we only have the $InternalModel object left, the entry in the database doesn't exist anymore.
ExternalModal::destroy($InternalModel->id);
});
}
}
And remember to use the queueable() to utilize the queue if it might take longer than expected.
If indeed for some reason the InternalModel table get's updated by not using Eloquent, you can trigger each Eloquent event manually if the dispatch event() method is accessible, to keep the sync process functional. e.g.
$modal = InternalModel::find($updated_id);
// trigger the update manually
event('eloquent.updated: ' . $model::class, $model);
All Eloquent events related to the models can be triggered in such a way, so: retrieved, creating, created, updating, updated, saving, saved, deleting and so on.
I would also suggest to create an additional Console command to start the sync process once, before stepping over to the Eloquent model events. Such a command is like the foreach you already used where you check once if all data is synced, something like php artisan users:sync. This could help if sometimes events don't trigger caused by exceptions, this is rare, but it does happen once in a while.
MySQL Replication
If triggers isn't a solution and you can't guarantee the data is modified from one single source, replication would be my final solution.
Someone created a package for Laravel which uses the krowinski/php-mysql-replication or the more up to date fork moln/php-mysql-replication called huangdijia/laravel-trigger.
A few things need to be configured though:
Firstly MySQL should be configured to save all events in a log file to be read.
server-id = 1
log_bin = /var/log/mysql/mysql-bin.log
expire_logs_days = 1
max_binlog_size = 100M
binlog_row_image = full
binlog-format = row
Secondly, the database user connected with the database should be granted replication privileges:
GRANT REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'user'#'host';
GRANT SELECT ON `dbName`.* TO 'user'#'host';
The general idea here is to readout a log file MySQL generates about events that occur. Writing this answer took me a while longer because I couldn't get this package up and running within a few minutes. Though I have used it in the past and know it worked flawlessly, I wrote a smaller package which would minimize the traffic and filter out events I didn't use.
I've already opened an issue and I'm going to open several over time to get this thing up and running again.
But to grasp the idea of its use-fullness, I'm going to explain its workings anyway.
To configure an event, listeners are put in a routes file called routes/trigger.php, where you have access to the $trigger instance (manager) to bind your listeners.
If we would put this into context of your tables, a listener would look like
$trigger->on('database_name.remote_users', 'update', function($event) {
// event will contain a `EventInfo` object with changed entry data.
});
Same would go for create (write) events on the table
$trigger->on('database_name.remote_users', 'write', function($event) {
// event will contain a `EventInfo` object with changed entry data.
});
To start listening for database events use the
php artisan trigger:start
To get a list of all listeners recognized from the routes/trigger.php use
php artisan trigger:list
To get a status of which bin file has been recognized and its current position use
php artisan trigger:status
In a ideal situation you would use supervisor to start the listener (artisan trigger:start) to be run in the background. If the service needs to boot again due to updates made in your application, you can simply use php artisan trigger:terminate to reboot the service. Supervisor will notice and start again with a fresh booted application.
update on package status
They seem to respond very well and some things have already been fixed. I can definitely say for sure that this package will be up and running the in a few weeks.
Normally I won't put anything in my answers on stuff I didn't used or tested myself, though I know this worked before, I'm giving it a chance it's going to work again in the next several weeks. It's a least something to watch out for or even test it to grasp ideas on how to implement in a real case scenario.
Hope you enjoyed reading.

Persistent multi-node events in a stateless web application in PHP

Im building an OO PHP application that will be run across multiple nodes, and will be relatively stateless in nature, and I need to implement proper publisher-subscriber (http://en.wikipedia.org/wiki/Observer_pattern / http://sourcemaking.com/design_patterns/Observer/php) style events.
My question is, how can I handle events?
In my application we are using technologies like Cassandra, Redis, Mongo and RabbitMQ.
I know PHP has an event EXTENSION available, but from what I can tell it sticks within state - or if something like memcached is leveraged it can possibly be used within that node... but my application will be distributed across multiple nodes.
So let's look at an example:
On Node 1, a metric (Metric ID 37) is updated and anything that subscribes to that metric needs to be updated. This publishes Changing and Changed as it does the update.
I have something that is subscribed to Metric ID 37 being updated, for example Metric 38, may need to recalculate itself when Metric 37's value changes.
Metric 38 is currently instantiated and being used on Node 2 in Process ID 1011... How does Metric 37 tell Metric 38 on Node 2 (Process ID 1011 in this case) to run the subscribed function?
Metric 39 subscribes to Metric 38 being updated, but is not instantiated anywhere... How does Metric 39 update when Metric 38 finishes updating?
I was thinking of something like using RabbitMQ as my event queue manager, and on each node have a daemon style 'event consumer' application that reads events in the event queue (for sake of load balancing/distribution of the work).
Then the consumer sees "Metric:38:Updated" it checks something like Redis for anything subscribed to "Metric:38:Updated" and gets the value ("What:Function:Values") and does something like call_user_func_array(array($what,$function),$values); .... but this seems like it may cause a crapload of overhead and some level of synchronization issues...
I'm using Doctrine MongoDB ODM to persist my objects... To handle synchronization issues I was thinking of something like this:
Objects could have a version number... (version=1.0)
And redis could be used to maintain a quick reference to the latest version of the object (ObjectVersion:ObjectType:ObjectId)=1.1
And when a getter is called on an object property that is marked as #critical(things like isDeleted, monetary balances etc) it could check if the instance's version ID is equal to the version # in redis and update its values from mongo if it needs to...
An alternate setup is using amphp/amp (http://amphp.org/docs/amp/reactor-concepts.html) and some form of RPC to synchronize the nodes
Since I'm fairly new to web development (moving from c#) and stateless, and distributed.. I thought it would be a good idea to ask the community if anyone has better suggestions?
My question is, how can I handle events?
If you want to use an event loop implementation, there are multiple choices available:
Amp
Icicle
React
You can use a PubSub system like Redis offers: http://redis.io/topics/pubsub. Amp offers a package for Redis, other event libraries might already have an implementation available.
Redis will send an event notification to all connected and listening clients. You may not want that, because you want to synchronize your calculations and execute them only once.
You could push the actual data to a Redis list and use the event system only to poll in case of a new job, so the workers can otherwise sleep. A better solution might be to use blocking list operations, which block a Redis connection until there's new data available in a Redis list. When that event happens, you can recalculate the value and push the update to an event.
That's basically building a message queue with Redis, but essentially you will just want to look at the features of different message queue implementations and see if they suit your needs. If you want to use any of the event loop libraries, you may also want to look at the available clients and other features you need from them, because they're generally not compatible (yet).
maybe a midware should be needed , like http://redis.io/topics/pubsub or some other like message queue can support your application

PHP MongoDB execute() locking collection

I'm using MongoDB over the command line to go loop through a bunch of documents for a particular condition, move from one collection to another collection and removing from the original collection.
db.coll1.find({'status' : 'DELETED'}).forEach(
function(e) {db.deleted.insert(e); db.coll1.remove({_id:e._id}); });
This works however I need to script this so it moves all the documents in coll1 to the deleted collection everyday (or every hour) via a cron script. I'm using PHP so I figured I would write a script in use the Mongo PHP Library ::
$db->execute('db.coll1.find({'status' :'DELETED'}).forEach(
function(e) { db.deleted.insert(e); db.coll1.remove({_id:e._id}); })');
This works but unlike the Mongo command line, db->execute() is evaled, which causes a lock until the execution block is finished, which holds off all writes to the collection. I can't do that in my production environment.
Is there a way (without manually logging into Mongo and running the command) and executing it via a PHP script without locking?
If I use:
db->selectCollection('coll1')->find(array('status' => 'DELETED'))
and iterate through that I can select the documents, save to the deleted collection and delete from the coll1 collection. However this seems like a lot of bandwidth to pull everything on the client and to save it back to the server.
Any suggestions?
Is there a way (without manually logging into Mongo and running the command) and executing it via a PHP script without locking?
As you stated the best thing is to do it client side. As for the bandwidth, unless you got a pre-90's network then it will most likely be a very small amount of bandwidth in comparison to how much you would use for everything else including replica sets etc.
What you could do is warehouse your deletes upon their actual deletion (in your app) instead of once every day and then you would, once a day, go back through your original collection removing all deleted rows. That way the bandwidth will be spread throughout the day and when it comes to clean your production you just do a single delete command.
Another alternative would be to use an MR and make its output be that collection.
Though in general warehousing deletes in this manner is normally more work than it is worth. It is normally better to just keep them in your main collection and work your queries around the deleted flag (as you probably already do to not warehouse these immediately).

Efficient cronjob recommendation

Brief overview about my usecase: Consider a database (most probably mongodb) having a million entries. The value for each entry needs to be updated everyday by calling an API. How to design such a cronjob? I know Facebook does something similar. The only thing I can think of is to have multiple jobs which divide the database entries into batches and each job updates a batch. I am certain there are smarter solutions out there. I am also not sure what technology to use. Any advise is appreciated.
-Karan
Given the updated question context of "keeping the caches warm", a strategy of touching all of your database documents would likely diminish rather than improve performance unless that data will comfortably fit into available memory.
Caching in MongoDB relies on the operating system behaviour for file system cache, which typically frees cache by following a Least Recently Used (LRU) approach. This means that over time, the working data set in memory should naturally be the "warm" data.
If you force data to be read into memory, you could be loading documents that are rarely (or never) accessed by end users .. potentially at the expense of data that may actually be requested more frequently by the application users.
There is a use case for "prewarming" the cache .. for example when you restart a MongoDB server and want to load data or indexes into memory.
In MongoDB 2.2, you can use the new touch command for this purpose.
Other strategies for prewarming are essentially doing reverse optimization with an explain(). Instead of trying to minimize the number of index entries (nscanned) and documents (nscannedObjects), you would write a query that intentionally will maximize these entries.
With your API response time goal .. even if someone's initial call required their data to be fetched into memory, that should still be a reasonably quick indexed retrieval. A goal of 3 to 4 seconds response seems generous unless your application has a lot of processing overhead: the default "slow" query value in MongoDB is 100ms.
From a technical standpoint, You can execute scripts in the mongodb shell, and execute them via cron. If you schedule cron to run a command like:
./mongo server:27017/dbname--quiet my_commands.js
Mongodb will execute the contents of the my_commands.js script. Now, for an overly simple example just to illustrate the concept. If you wanted to find a person named sara and insert an attribute (yes, unrealistic example) you could enter the following in your .js script file.
person = db.person.findOne( { name : "sara" } );
person.validated = "true";
db.people.save( person );
Then everytime the cron runs, that record will be updated. Now, add a loop and a call to your api, and you might have a solution. More information on these commands and example can be found in the mongodb docs.
However, from a design perspective, are you sure you need to update every single record each night? Is there a way to identify a more reasonable subset of records that need to be processed? Or possibly can the api be called on the data as it's retrieved and served to whomever is going to consume it?

Object oriented coding in a multi threaded request environment - PHP

I am writing a web application in an object oriented design. This application would be interacting with the database pretty often. A few regular operations are verifying a user's ACL permissions for the function/method requested, performing certain functions etc. In a nutshell, the database would be used a lot. So my question here is, if I do develop my application using OOP, and declare class level variables which would be used to set the input coming in, and if there is any parallel or concurrent request coming in from another user, would the input data be changed??
Would I have to do something separate to make sure that the application is multi-threaded and the input coming in be not changed until the process isn't finished??
ex:
class myProces{
var $input1;
var $input2;
function process1($ip1, $ip2){
$this->input1 = $ip1;
$this->input2 = $ip2;
$this->getDataDB();
}
function getDataDB(){
//do some database activity with the class level variables;
// I would pass the values in the class level variables;
$query = "select column from table where col1 = $this->input1 and col2= $this->input2";
mysql_query($query);
return something;
}
}
Now if I have two users hitting my application at the same time, and make a call to the functions in this class
user1:
$obj = new myProces();
$obj->process1(1,2);
user2:
$obj = new myProces();
$obj->process1(5,6);
Now if I do have class level variables, would they have changed values when concurrent requests come in?? Would PHP doing any kind of handling for multi threading? I am not sure if Apache can act as a message queue, where requests can be queued.
Can anybody explain if OOP for web applications with heavy number of users is good or if any kind of multithreading has to be done by developers??
A couple of things:
This has nothing to do with OOP.
PHP doesn't support user threads
Each request will be using its own memory, so you don't have to worry about concurrent usage updating variables behind your back.
However, you do have to take care when dealing with data from a database. User 1 may read something, then User 2 may read the same thing and update it before User 1 finishes. Then when User 1 updates it, he may be accidentally overwriting something User 2 did.
These sorts of things can be handled with transactions, locks, etc. Again, it has nothing to do with OOP or multithreading.
First: try to learn about PDO (unless that VAR before the variables, means that you're using PHP4).
Second: As konforce and Grossman said, each user gets differents instances of PHP.
Third: This problem may occur in Java projects (and others), that uses static objects or static methods. Don't worry with this in PHP.
There is no need to worry about mixing things up on the PHP side, but when you come up with a need to update or insert data, having several users being able to modify the same subset of data will lead you into unwanted consequences. Such as inserting duplicate rows or modifying the same row. Thus, you need to use SQL commands such as locking tables or rows.
This isn't a problem you have to worry about. Each connection to your web server spawns a totally separate instance of the PHP interpreter, with totally separate memory and resource handles. No objects in one will be affected by the other, no database connections in one will be affected by the other. Your class properties in one process are not ever modified by a request in another process.
Many of the top sites on the web run on Apache and PHP, with hundreds of concurrent request happening simultaneously all day long, and they do not have to write any special code to handle it.

Categories