PHP/MySQL Job queueing system doing jobs more than once - php

I came up with a very simple job queueing system using PHP, MySQL and cron.
Cron will call a website, which has a function that calls function A() every 2 seconds. A() searches and retrieves a row from table A
Upon retrieving a row, A() will update that row with value 1 in column working
A() then does something to the data in the retrieved row
A() then insert a row in table B with the value obtained during processing step 3.
Problem: I notice that there are sometimes duplicate values in the table B due to function A() retrieving the same row from table A multiple times.
Which part of the design above is allowing the duplicate processing, and how should it be fixed?
Please don't suggest something like rabbitMQ without at least showing how it can be implemented in more details. I read some of their docs and did not understand how to implement it. Thanks!
Update: I have a cron job that calls a page (which calls function c()) every minute. This function c() that does a loop 30 times which calls function A(), using sleep() to delay.

The supplied answer is good, file locks work well, but, since you're using MySQL, I thought I'd answer as well. With MySQL you can implement cooperative asynchronous locking using GET_LOCK and RELEASE_LOCK.
*DISCLAIMER: The examples below are untested. I have successfully implemented something very close to this before, and the below was the general idea.
Let's say you've wrapped this GET_LOCK function in a PHP class called Mutex:
class Mutex {
private $_db = null;
private $_resource = '';
public function __construct($resource, Zend_Db_Adapter $db) {
$this->resource = $resource;
$this->_db = $db;
}
// gets a lock for $this->_resource; you could add a $timeout value,
// to pass as a 2nd parameter to GET_LOCK, but I'm leaving that
// out for now
public function getLock() {
return (bool)$this->_db->fetchOne(
'SELECT GET_LOCK(:resource)',
array(
':resource' => $this->_resource
));
}
public function releaseLock($resource) {
// using DO because I really don't care if this succeeds;
// when the PHP process terminates, the lock is released
// so there is no worry about deadlock
$this->_db->query(
'DO RELEASE_LOCK(:resource)',
array(
':resource' => $resource
));
}
}
Before A() fetches methods from the table, have it ask for a lock. You can use any string as the resource name.
class JobA {
public function __construct(Zend_Db_Adapter $db) {
$this->_db = $db;
}
public function A() {
// I'm assuming A() is a class method and that the class somehow
// acquired access to a MySQL database - pretend $this->db is a
// Zend_Db instance. The resource name can be an arbitrary
// string - I chose the class name in this case but it could be
// 'barglefarglenarg' or something.
$mutex = new Mutex($this->db, get_class($this));
// I choose to throw an exception but you could just as easily
// die silently and get out of the way for the next process,
// which often works better depending on the job
if (!$mutex->getLock())
throw new Exception('Unable to obtain lock.');
// Got a lock, now select the rows you need without fear of
// any other process running A() getting the same rows as this
// process - presumably you would update/flag the row so that the
// next A() process will not select the same row when it finally
// gets a lock. Once we have our data we release the lock
$mutex->releaseLock();
// Now we do whatever we need to do with the rows we selected
// while we had the lock
}
}
When you engineer a scenario in which multiple processes are selecting and modifying the same data, this kind of thing comes in very handy. When using MySQL, I prefer this database approach to the file locking mechanism, for portability - it's easier to host your app on different platforms if the locking mechanism is external to the filesystem. Sure it can be done, and it works fine, but in my personal experience I found this easier to use.
If you plan on your app being portable across database engines, then this approach will probably not work for you.

One problem could be the processing at first:
Cron will call a function A() that searches and retrieves a row from table A every 2 seconds.
The processing of this part of the script could take longer than two seconds on a table without indexes as such you could pick multiple rows.
You could remedy this with an exclusive file lock.
I have a feeling there is more than just the workflow, if you can show some basic code attached maybe there might be a problem in the code as well.
edit
I think it is timing judging by your last update:
Update: I have a cron job that calls a page (which calls function c())
every minute. This function c() that does a loop 30 times which calls
function A(), using sleep() to delay.
Thats a lot of jumping through hoops and I think you might have a threading problem where crons are overlapping.

Related

DDD - how to deal with get-or-create logic in Application Layer?

I have an DailyReport Entity in my Domain Layer. There are some fields in this object:
reportId
userId
date
tasks - Collection of things that user did in given day;
mood - how does the user felt during the whole day;
Also, there are some methods in my Application Service:
DailyReportService::addTaskToDailyReport
DailyReportService::setUserMoodInDailyReport
The thing is that both of these methods require DailyReport to be created earlier or created during function execution. How to deal with this situation?
I have found 2 solutions:
1 Create new DailyReport object before method dispatching, and after that pass reportId to them:
//PHP, simplified
public function __invoke() {
$taskData = getTaskData();
/** #var $dailyReport DailyReport|null **/
$dailyReport = $dailyReportRepository->getOneByDateAndUser('1234-12-12', $user);
//there were no report created today, create new one
if($dailyReport === null) {
$dailyReport = new DailyReport('1234-12-12', $user);
$dailyReportRepository->store($dailyReport);
}
$result = $dailyReportService->addTaskToDailyReport($taskData, $dailyReport->reportId);
//[...]
}
This one requires to put a more business logic to my Controller which i want to avoid.
2: Verify in method that DailyReport exists, and create new one if needed:
//my controller method
public function __invoke() {
$taskData = getTaskData();
$result = $dailyReportService->addTaskToDailyReport($taskData, '1234-12-12', $user);
//[...]
}
//in my service:
public function addTaskToDailyReport($taskData, $date, $user) {
//Ensure that daily report for given day and user exists:
/** #var $dailyReport DailyReport|null **/
$dailyReport = $dailyReportRepository->getOneByDateAndUser();
//there were no report created today, create new one
if($dailyReport === null) {
$dailyReport = new DailyReport($date, $user);
$dailyReportRepository->store($dailyReport);
}
//perform rest of domain logic here
}
This one reduces complexity of my UI layer and does not expose business logic above the Application Layer.
Maybe these example is more CRUD-ish than DDD, but i wanted to expose one of my use-case in simpler way.
Which solution should be used when in these case? Is there any better way to handle get-or-create logic in DDD?
EDIT 2020-03-05 16:21:
a 3 example, this is what i am talking about in my first comment to Savvas Answer:
//a method that listens to new requests
public function onKernelRequest() {
//assume that user is logged in
$dailyReportService->ensureThereIsAUserReportForGivenDay(
$userObject,
$currentDateObject
);
}
// in my dailyReportService:
public function ensureThereIsAUserReportForGivenDay($user, $date) {
$report = getReportFromDB();
if($report === null) {
$report = createNewReport();
storeNewReport();
}
return $report;
}
//in my controllers
public function __invoke() {
$taskData = getTaskData();
//addTaskToDailyReport() only adds the data to summary, does not creates a new one
$result = $dailyReportService->addTaskToDailyReport($taskData, '1234-12-12', $user);
//[...]
}
This will be executed only when user will log in for the first time/user were logged in yesterday but this is his first request during the new day.
There will be less complexity in my business logic, i do not need to constantly checking in services/controllers if there is a report created because this has been executed
previously in the day.
I'm not sure if this is the answer you want to hear, but basically I think you're dealing with accidental complexity, and you're trying to solve the wrong problem.
Before continuing I'd strongly suggest you consider the following questions:
What happens if someone submits the same report twice
What happens if someone submits a report two different times, but in the second one, it's slightly different?
What is the impact of actually storing the same report from the same person twice?
The answers to the above questions should guide your decision.
IMPORTANT: Also, please note that both of your methods above have a small window where two concurrent requests to store the rerport would succeed.
From personal experience I would suggest:
If having duplicates isn't that big a problem (for example you may have a script that you run manually or automatically every so often that clears duplicates), then follow your option 1. It's not that bad, and for human scale errors should work OK.
If duplicates are somewhat of a problem, have a process that runs asynchronously after reports are submited, and tries to find duplicates. Then deal with them according to how your domain experts want (for example maybe duplicates are deleted, if one is newer either the old is deleted or flagged for human decision)
If this is part of an invariant-level constraint in the business (although I highly doubt it given that we're speaking about reports), and at no point in time should there ever be two reports, then there should be an aggregate in place to enforce this. Maybe this is UserMonthlyReport or whatever, and you can enforce this during runtime. Of course this is more complicated and potentially a lot more work, but if there is a business case for an invariant, then this is what you should do. (again, I doubt it's needed for reports, but I write it here in the care reports were used as an example, or for future readers).

Check Laravel Jobs inside Queue::before to delete before they are processed

I was given the idea to look in the AppServiceProvider with Queue::before as a way to add a check for Jobs I no longer want to run and delete them without having to add checks to every Job I write.
Background, I am working on a SaaS that does audits so an audit can run for hours and be 1000s of jobs. If I can look for an audit id inside the jobs as they come through and compare with a Cache array of any audit ids that have been cancelled, I can save time.
So what I have got to is how do I unwrap the Job in the Queue::before to get an id to check? (Normal laravel Queues code, and using RabbitMQ)
As the jobs are wrapped in a layer or two of Event classes, and I can not dump the data to screen to see, just to log files, as it is in the queue.
in app/Providers/AppServiceProvider.php:
Queue::before(function (JobProcessing $event) {
// $event->connectionName
// $event->job
$job = $event->job->payload();
$obj = unserialize($job['data']['data']);
}
As far as it looks like for the events I am interesting the payload has data, which has data, that is the serialised object I am interested in. This does not seem the best way, or to see how to interact with it in a better way.
thanks
I am in the middle of a similar problem involving webhook delivery. Through a developer portal, we are allowing users to re-queue a webhook (to short-cut the wait on backed-off delivery attempts). Since this could create a second job for the same webhook, we sought a way to identify the original as out of date.
app/Jobs/DeliverWebhook.php constructor:
public function __construct(Webhook $webhook)
{
$this->webhook = $webhook;
$this->queued_at = Carbon::now();
Cache::put(
'DeliverWebhook.'. $this->webhook->id .'.QueuedAt',
$this->queued_at,
Carbon::now()->addDays(3)
);
}
Here, you can see we've attached a queued_at attribute to this instance of the job. (We can probably also make this more unique with use of something like uniqid() or random_bytes() to avoid potential double-click issues or similar hiccups when queuing.)
The second part is that we set the semi-unique cache key to match this queued_at time. I set it to expire in 3 days, past the end of our backed-off retry attempts.
Now, when a job is picked up for processing, I can check the job instance's queued_at attribute against the cached value, and delete the job if it is old.
In my AppServiceProvider boot method:
Queue::before(function ($event) {
if ($event->job->queue == 'webhooks' && $event->job->getName() == 'DeliverWebhook') {
$cache_key = 'DeliverWebhook.'. $event->job->instance->webhook->id .'QueuedAt';
if ($event->job->instance->queued_at < Cache::get($cache_key)) {
$event->job->delete();
throw new JobRequeuedException;
}
}
});
An exception is thrown at the end because the queue worker, by default, does not check if the job is deleted before calling $job->fire(). Throwing the exception forces the worker to skip fire() and jump into the handleJobException() method.
NOTE: I still need to test this appropriately.

Laravel Queues - Passing Data to the Queue

I have an array containing about ~8,000 stock tickers that I'm trying to queue up; the queue is meant to receive the array of stock tickers ($symbols[]), and then pass each one to a worker / consumer (whichever jargon you prefer).
Here's what my QueueController current looks like:
Class QueueController extends \BaseController {
public function stocks()
{
$symbols = $this->select_symbols();
Queue::push('StockQueue', array('symbols' => $symbols));
}
...
}
From my QueueController, I'm calling a method to retrieve the list of stock symbols and passing it to the StockQueue Class as the $data.
public function fire($job, $data)
{
$symbols = $data; // print_r shows all symbols...
// Get Quote Data for Symbol
$quote = $this->yql_get_quote($symbol);
// Get Key Stats for Symbol
$keystats = $this->yql_get_keystats($symbol);
// Merge Quote and Keystats into an Array
$array[] = $quote;
$array[] = $keystats;
// Save Data to DB
$this->yql_save_results($array, $symbol);
$job->delete();
}
This is not what I'm trying to achieve though; what I need to do is pass in each symbol, one by one, to the StockQueue Class, and have it process it as a task.
If I were to wrap the StockQueue->stocks() method in a while loop, it would try and pass all ~8,000 in (from what I understand) immediately to the queue. Would this be detrimental or is this the best way to do it? I haven't been able to find a lot of examples for PHP-based RPC Message Queuing online, so I'm just as curious about the best practices as I am on the correct process.
With that being said, how can I fire up multiple workers for this queue? Say, I want 5 workers (depending on how many resources each one takes; I'll figure that out) to process these tasks in order to reduce the processing time by ~4/5ths. How would I do that?
Would I just launch php artisan queue:listen five times?
And, for clarity, I'm using beanstalkd and supervisord to do the message queue / monitoring.
I look forward to your advice and insight.
Yep, just run more workers. Beanstalkd can hold a number of connections open from lots of workers and make sure they all get different jobs. Just make sure that the job completes successfully (if not, deal with it appropriately - or at least bury it to look at later) and give it enough to complete, with some to spare in the TTR (Time To Run) setting.
As for how to run more jobs - yes, just increase the number of jobs available in Supervisord (numprocs=5 in the [program:NAME] section) and have them start. I tended to have another (larger) pool of the same jobs, that don't start automatically, so I could start a couple more manually through the Supervisord control, as required.

Reduce database calls for php web shop

I'm looking for a way to prevent repeated calls to the database if the item in question has already been loaded previously. The reason is that we have a lot of different areas that show popular items, latest releases, top rated etc. and sometimes it happens that one item appears in multiple lists on the same page.
I wonder if it's possible to save the object instance in a static array associated with the class and then check if the data is actually in there yet, but then how do I point the new instance to the existing one?
Here's a draft of my idea:
$baseball = new Item($idOfTheBaseballItem);
$baseballAgain = new Item($idOfTheBaseballItem);
class Item
{
static $arrItems = array();
function __construct($id) {
if(in_array($id, self::arrItems)){
// Point this instance to the object in self::arrItems[$id]
// But how?
}
else {
// Call the database
self::arrItems[id] = $this;
}
}
}
If you have any other ideas or you just think I'm totally nuts, let me know.
You should know that static variables only exist in the page they were created, meaning 2 users that load the same page and get served the same script still exist as 2 different memory spaces.
You should consider caching results, take a look at code igniter database caching
What you are trying to achieve is similar to a singleton factory
$baseball = getItem($idOfTheBaseballItem);
$baseballAgain =getItem($idOfTheBaseballItem);
function getItem($id){
static $items=array();
if(!isset($items[$id])$items[$id]=new Item($id);
return $items[$id];
}
class Item{
// this stays the same
}
P.S. Also take a look at memcache. A very simple way to remove database load is to create a /cache/ directory and save database results there for a few minutes or until you deem the data old (this can be done in a number of ways, but most approaches are time based)
You can't directly replace "this" in constructor. Instead, prepare a static function like "getById($id)" that returns object from list.
And as stated above: this will work only per page load.

Concurrency Problem

I'm having what seems to be a concurrency problem while using MySQL and PHP + Propel 1.3. Below is a small example of the "save" method of a Propel object.
public function save(PropelPDO $con = null) {
$con = Propel::getConnection();
try {
$con->beginTransaction();
sleep(3); // ignore this, used for testing only
parent::save($con);
$foo = $this->getFoo(); // Propel object, triggers a SELECT
// stuff is happening here...
$foo->save($con);
$con->commit();
} catch (Exception $e) {
$con->rollBack();
throw $e;
}
}
The problem is the $foo object. Let's say we get two calls of the example method one after another in a very short time. In some cases, if the second transaction reads $foo...
$foo = $this->getFoo();
... before the first transaction has had the chance to save it...
$foo->save($con);
... $foo read by the second transaction will be outdated and bad things will happen.
How can I force the locking of the table Foo objects are stored in so that subsequent transactions can read from it only after the first one has finished its work?
EDIT: The context is a web application. In short, in some cases I want the very first request to do some data modification (which happens between fetching and saving of $foo). All subsequent requests should not be able to do the modification. Whether the modification will occur or not depends on the fetched $foo state (table row attribute). If two transactions fetch the same $foo, the modification will occur twice which causes a problem.
when you load this existing row to the screen/application, load the LastChgDate too. when you save it, use "AND LastChgDate=thevalue". check the affected row count of the update, if it is zero, return an error "someone else has already saved this record", and rollback and other changes. With this logic in place you can only save a row if it the same as when you loaded it. for new rows, INSERT, this is not necessary because they are new.
In MySQL, I think you can use SELECT FOR UPDATE to accomplish the lock.
Another option is to use the GET_LOCK and RELEASE_LOCK MySQL function calls to create named locks that you would use to control access to the resource.
There are some downsides to these approaches. I haven't used them myself very much and they are MySQL specific but they could work for you.

Categories