How to implement tracing for long-running imports in Laravel

How to implement tracing for long-running imports in Laravel - php

This is more an architectural question.
I'm about to code a bunch of Import implementations. They all expect some parameters (i.e. an CSV file) and then will take up quiet some time to proceed. In my previous project, I used to send those Imports in the background using an "shell_exec()"-command and than monitored a logfile in the browser to report about the status. My big hope now is, that Laravel takes over here to streamline all that manual work.
For now, my question would be about the proposed class architecture behind this.
My requirements for a bunch of imports are:
Each Import need to run as a background process
Monitor progress in Browser (and logfile)
Start imports in console and via HTTP
Right now I plan to use a "Job" in L5.1 to implement the basic Import. What I'm struggling with, is the implementation of some kind of "progress bar" and monitoring of the (most recent) "log messages" in the browser. I do not need a real "live" view via sockets, but it should be possible to regularly update the progress view of a running Import.
Has anybody some hints, how to implement this progress stuff?
My approach so far:
Read the CSV file, put each line to a queue element and monitor the queue. The log messages could trigger an event that populates a stack of the most recent log messages. (I may run into race conditions because some lines may depend on a previous processing of another line)

I would make a ActiveBackgroundTask model like this:
handler_class_name
state
progress
latest_log_messages
result
And create a cron task in your system to periodically check this table and to start tasks, that appear in this table in the state created. Each task is passed its id in the table to periodically update result and latest_log_messages field.
You can expand on this idea by, for example, standardizing location of the log files for each task so not only latest messages could be extracted, but also full task log could be downloaded.
In this case the state of each task could be checked very easily from every script in your system.
There will be a problem of detecting dead tasks, which aborted due to PHP error or exception. If you need this, you can keep php process PIDs and that cron script could check if tasks with running state are still really running.
Does that suite your needs?

Related

How to optimize the process of multiple process trying to write to the same log file

A website using PHP has very huge traffic say like 5000 of request per seconds.Each of these request try to log data but to a single file. Since at any point of time only one of the request can write to the file the other request are queued up which affects the overall response time. The data needs to be logged, that's important. How can I approach to optimize this scenario.

I see two options.
Since your requirement is to write to one large log file and no where else. You can have your PHP scripts to log these entries to the queue first. There are many message queuing platforms like RabbitMQ or ActiveMQ that you can use. Assuming these log entries are not large payloads, Your user requests that results in an entry into log can be directed to insert a message into the queue. You will have another backend process (may be a php script?) that will simply listen to the queue and keep writing to the log file continually. Here is a detail explanation on how how a client would consume the queue message (https://www.rabbitmq.com/tutorials/tutorial-one-php.html)
You can also use something like Apache Log4php which is a logging framework that uses the file appending mechanism. You will be able to easily push the information important to you by simply calling a method.
$logger->info("This is the message to be logged.");
They should handle the complexity of multiple request logging on to the same file. You can even have them them break up the the files by day etc.
Hope this helps!

Is there any way that I make the PHP at server side to perform some kind of actions on the data on it's own?

I have this scenario:
User submits a link to my PHP website and closes the browser. Now that the server has got the link it will analyse the submitted link (page) for the broken links and after it has completely analysed the posted link, it will send an email to the user. I have a complete understanding of the second part i.e. how to analyse the page for the broken links and send the mail to the user. Only problem that I have is how may I achieve this first part i.e. make the server keep running the actions on it's own even even if there is no request made by the client end?
I have learned that "Crontab" or a "fork" may work for me. What do you say about these? Is it possible to achieve what I want, using these? What are the alternatives?

crontab would be the way to go for something like this.
Essentially you have two applications:
A web site where users submit data to a database.
An offline script, scheduled to run via cron, which checks for records in the database and performs the analysis, sending notifications of the results when complete.
Both of these applications share the same database, but are otherwise oblivious to each other.
A website itself isn't suited well for this sort of offline work, it's mainly a request/response system. But a scheduled task works for this. Unless the user is expecting an immediate response, a small delay of waiting for the next scheduled run of the offline task is fine.

The server should run the script independently of the browser. Once the request is submitted, the php server runs the script and returns the result to the browser (if it has a result to return)
An alternative would be to add the request to a database and then use crontab run the php script at a given interval. The script would then check the database to see if there's anything that needs to be processed. You could limit the script to run one database entry every minute (or whatever works). This will help prevent performance problems if you have a lot of requests at once, but will be slower to send the email.

A typical approach would be to enter the link into a database when the user submits it. You would then use a cron job to execute a script periodically, which will process any pending links.
Exactly how to setup a cron job (or equivalent scheduled task) depends on your server. If you have a host which provides a web-based admin tool (such as CPanel), there will often be a way to do it in there.

PHP script will keep running after the client closes the broser (terminating the connection).
Only keep in mind PHP scripts maximum execution time is limited to "max_execution_time" directive value.
Of course here I suppose the link submission happens calling your script page... I don't understand if this is your use case...

For the sake of simplicity, a cronjob could do the wonders. User submits a link, the web handler simply saves the link into a DB (let me pretend here that the table is named "queued_links"). Then a cronjob scheduled to run each minute (for example), selects every link from queued_links, does the application logic (finds broken page links) and sends the email. It then also deletes the link from queued_links (or updates a flag to represent the fact that the link has already been processed.
In the sake of scale and speed, a cronjob wouldn't fit as well as a Message Queue (see rabbitmq, activemq, gearman, and beanstalkd (gearman and beanstalk are my favorite 2, simple and fit well with php)). In lieu of spawning a cronjob every minute, a queue processor listens for 'events' and asynchronously processes the 'events' (think 'onLinkSubmission($link)'), and processes the messages ASAP. The cronjob solution is just a simplified implementation of one of these MQ solutions, will result in better / more predictable results, but at the cost of adding new services to maintain, etc.

well, there are couple of ways, simplest of them would be:
When user submit a request, save this request some where, let's call it jobs table, and inform customer that his request has been received, they'll be updated site finish processing your request, or whatever suites you.
Now, create a (or multiple) scripts (depending upon requirement) and run this script from Cron, this script will pick requests from Job table, process it, do whatever required.
Alternatively, you can evaluate possibility of message_queue or may be using a Job server for this.
so, it all depends on your requirement.

Gearman: is there still no way to retrieve custom data from a background worker?

First things first, I'm aware of this question:
Gearman: Sending data from a background worker to the client
What I want to know, is it still the case with Gearman? I'm planning on sending a batch of image URLs from a PHP web application to the gearman worker (also written in PHP; let's call it "The Main Worker") for processing asynchronously. This worker will then submit a separate task for each image to lower-tier workers (via addTask()), call runTasks() and wait for the tasks to finish, while listening to exceptions, accumulating error messages and updating the overall job status.
While I'm perfectly ok with retrieving the overall status from the Main Worker using jobStatus() calls, then just say that all of the images were processed when [false, false, 0, 0] is returned, I definitely need to be able to inform the users that some of the images couldn't be retrieved from their respective URLs or stored on the server.
I suppose I could always just store the custom data in memcache, then retrieve it from the web app, but it just seems "dirtier" to me...
I'm not trying to get any result, because from what I've seen in the manual on php.net, even the exception handling can only be done when the task is submitted synchronously, not mentioning the custom data retrieval. I just hoped that there could be something I'm missing.
I'm I remember correctly, we're using Ubuntu Server 12.04 with libgearman6 (v 0.27) and PHP 5.3.10. The version of the gearman extension is 1.0.2. I think the database is irrelevant here, as I will not be using it in either of the workers. And I think we're not using persistent queues right now.

Since gearman won't keep any task information in memory after a task has finished (just report it back for a synchronous task), you won't be able to retrieve it in your web application without storing it in a 3rd party location. We usually use a simple web service in the application for this, letting the worker call back to the application when a task has completed or an error has occured. This allows us to keep the business logic about what we'd like to do when such an error happens in the application where it belongs, and let our workers be more general (we might need image resizing in many apps, but some apps might want to start several sub tasks that depend on the image resizing being done first).
As you write, you may also let the worker write directly to the database with the state of the task or to memcached, but I've found that letting the application itself handle the logic instead of having to change and special case the workers work better. It's also well suited for a worker framework letting you keep the same standardized way of handling callback across actual worker code.

Commercial PHP script, long running processes. daemons vs. cronjobs?

I'm putting together my first commercial PHP application, it's nothing really huge as I'm still eagerly learning PHP :)
Right now I'm still in the conceptual stage of planning my application but I run into one problem all the time, the application is supposed to be self-hosted by my customers, on their own servers and will include some very long running scripts, depending on how much data every customer enters in his application.
Now I think I have two options, either use cronjobs, like for example let one or multiple cronjobs run at a time that every customer can set himself, OR make the whole processing of data as daemons that run in the background...
My question is, since it's a self-hosted application (and every server is different)... is it even recommended to try to write php that starts background processes on a customers server, or is this more something that you can do reliably only on your own server...?
Or should I use cronjobs for these long running processes?
(depending on the amount of data my customers will enter in the application, a process could run 3+ hours)
Is that even a problem that can be solved, reliably, with PHP...? Excuse me if this should be a weird question, I'm really not experienced with PHP daemons and/or long running cronjobs created by php.
So to recap everything:
Commercial self-hosted application, including long running processes, cronjobs or daemons? And is either or maybe both also a reliable solution for a paid application that you can give to your customers with a clear conscience because you know it will work reliable on all kinds of different servers...?
EDIT*
PS: Sorry, I forgot to mention that the application targets only Linux servers, so everything like Debian, Ubuntu etc etc.

Short answer, no, don't go for background process if this will be a client hosted solution. If you go towards the ASP concept (Application Service Provider... not Active Server Pages ;)) then you can do some wacky stuff with background processes and external apps connecting to your sql servers and processing stuff for you.
What i suggest is to create a strong task management backbone and link that to a solid task processing infrastructure. I'll recommend you read an old post i did quite some time ago regarding background processes and a strategy i had adopted to fix long running processes:
Start & Stop PHP Script from Backend Administrative Webpage
Happy reading...
UPDATE
I realize that my old post is far from easy to understand so here goes:
You need 2 models: Job and JobQueue, 2 controller: JobProcessor, XYZProcessor
JobProcessor is called either by a user when a page triggers or using a cronjob as you wish. JobProcessor::process() is the key that starts the whole processing or continues it. It loads the JobQueues and asks the job queues if there is work to do. If there is work to do, it asks the jobqueue to start/continue it's job.
JobQueue Model: Used to queue several JOBS one behind each other and controls what job is currently current by keep some kind of ID and STATE about which job is running.
Job Model: Represents exactly what needs to be done, it contains for example the name of the controller that will process the data, the function to call to process the data and a serialized configuration property that describe what must be done.
XYZController: Is the one that contains the processing method. When the processing method is called, the controller must load everything it needs to memory and then process each individual unit of work as fast as possible.
Example:
Call of index.php
Index.php creates a jobprocessor controller
Index.php calls the jobprocessor's process()
JobProcessor::Process() loads all the queues and processes them
For each JobQueue::Process(), the job queue loads it's possible Jobs and detects if one is currently running or not. If none is running, it starts the next one by calling Job::Process();
Job::Process() creates the XYZController that will work the task at hand. For example, my old system had an InvoicingController and a MassmailingController that worked hand in hand.
Job::Process() calls XYZController::Prepare() so that it loads it's information to process. (For example, load a batch of emails to process, load a batch of invoices to create)
Job::Process() calls XYZController::RunWorkUnit() so that it processes a single unit of work (For example, create one invoice, send one email)
Job::Process() asks JobProcessingController::DoIStillHaveTimeToProcess() and if so, continues processing the next element.
Job::Process() runs out of time and calls XYZController::Cleanup() so that all resources are released
JobQueue::Process() ends and returns to JobController
JobController::Process() is about to end? Open a socket, call myself back so i can start another round of processing until i don't have anything to do anymore
Handle the request from the user that start in position #1.
Ultimately, you can instead open a socket each time and ask the processor to do something, or you can queue a CronJob to call your processor. This way your users won't get stuck waiting for the 3/4 work units to complete each time.

Its worth noting that, in addition to running daemons or cron jobs, you can kick off long running processes from a web request (but note that it must run outside of the webserver process group) and of course asynchronous message processing (which is essentially a variant on the batch approach).
All four of these approaches are very different in terms of how they behave, how concurrency and timing are managed. The factors which make them all different are the same ones you omitted from your question - so it's not really possible to answer.
Unfortunately all rely on facilities which are very different between MSWindows and POSIX systems - so although PHP will run on both, if you want to sell your app on both platforms it's going to need 2 versions.
Maybe you should talk to your potential customer base and ask them what they want?

Start & Stop PHP Script from Backend Administrative Webpage

I'm trying to create a webpage that will allow me to start and stop of a PHP script. The script is part of the backend of a site, and will need to access, read data from, process that data, and update a database on the same server the script exists on. Additionally, I'd like the webpage to allow an administrator to view the current status (running or stopped), and even link to logs created by the script.
I'm starting to go down the path of learning about PHP's exec, passthru, and related functions. Is this the correct path to take? Are there other ways to do this that would be more suitable? Is it possible to do this in a platform agnostic way? I'm developing on a LAMPP+CakePHP stack in Windows, and would like the functionality to exist on any webhost I choose.

I've done this in a recent job but it's probably overkill for you. I did a job processor and it basically sets 2 tables in the database, 2 objects at a minimum and 2 controllers at a minimum.
The first part is the job processing unit, it is composed of a job processor controller that manages the request to start or continue a job and it comes with two activerow models JobQueue and Job. You can remove the queue, but it's always practical to have queing in such systems so you can say that 2,3,4 jobs could execute at once.
The queue is only that, it's a slot that gets several jobs attached to it and it has a queue status to determine if it is running right now or not.
The job is a virtual object that maps to a job table describing what has to be done. In my implementation, i have created an interface that must be implemented into the called controller and a field + a type in the database. The Job instanciates the controller class to call (not the job processor controler, another controler that manages the operation to do) and calls a method in it to start the task processing.
Now, to get tricky, i forced my system to run on a dedicated server just for that portion because i didn't want the task to load the main server or jam the processing queue of Apache. So i had two servers and my Queue class was in charge of calling via an ip address a page on another server to run the job on that server specifically. When the job was done, it called itself back using a HTTP request to restart processing and do the next task. If no task was left, then it would simply die normally.
The advantage of doing it this way is that it doesn't require a cronjob (as long as your script is super stable and cannot crash) because it gets triggered by you when you want it and then you can let it go and it calls itself back with a fsockopen to trigger another page view that triggers the next job.
Work units
It is important to understand that if your jobs are very large, you should segment them. I used the principle of a "work unit" to describe 1 part the job has to do any number of times. Then the Queue Processor became a time manager too so that he could detect if a job took more than X seconds, it would simply defer the rest of the steps for later and call itself back and continue were he was at. That way, you don't need to "set time limit" and you don't jam your server while a 30s script gets executed.
I hope this helps!

To run a script which run continually, you need think to that:
Your php script should be launched as CLI (command line) by a job scheduler like cron or something else. Don't forget that your web server configuration defined a timeout on executed script.
To run 24h a day, maybe you imagine to implement an infinite loop. In that case, you can write a test like jobIsActive which read in a file or in the database every loop if the job should be executed or not. If you click on the button just change the job status (update file, db ...). Your both button can stop the treatment or activate it but doesn't stop the infinite loop.
An infinite loop isn't the most elegant solution, why don't you write an entry in the cron tab to execute the job each night and a click on a button can fired it manually.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.