I found that pthreads does not work on web environment. I use PHP7.1 on FPM on Linux Debian which i also use Symfony 3.2. All I want to do is, for example:
User made a request and PUT a file (which is 1GB)
PHP Server receives the file and process it.
Immediately return true to user (jsonResponse) without awaiting processing uploaded file
Later, when processing file is finished (move, copy, duplicate whatever you want) just add an event or do callback from background and notify user.
Now. For this I created Console Command. I execute a Process('bin/console my:command')->start(); from background and I do my processing. But this is killing a fly with bazooka for me. I have to pass many variables to this executable command.
All I want to is creating another thread and just return to user without awaiting processing.
You may say this is duplicate. And point to pthreads. But pthreads stated that it is only intended for CLI. Also last version of pthreads doesn't work with symfony. (fatal error).
I am stuck at this point and have doubt if should I stay with creating processes for each uploaded file or move to python -> django
You don't want threads. You want a job queue. Have a look at Gearman or similar things.
Gearman provides a generic application framework to farm out work to other machines or processes that are better suited to do the work. It allows you to do work in parallel, to load balance processing, and to call functions between languages. It can be used in a variety of applications, from high-availability web sites to the transport of database replication events. In other words, it is the nervous system for how distributed processing communicates.
Related
I have a website, created using PHP and running on Apache. I want a subscriber to be able to log in and start a process on the server. They can then log out or close the browser without interrupting the process. Later they can log in and see the progress or see the results of the original process. What is the best way to accomplish this (having the process run until completion, after the browser is closed)?
Just looking for someone to point me in the right direction. A few people mentioned Gearman.
Gearman would be an ideal candidate, and I would use it for exactly the purpose you describe. It has everything you need out of the box to meet your requirements ("background" a long running, CPU-bound process to another machine, e.g. video encoding).
There is a Gearman PHP library, but you can write your worker code in a different language if it's better suited to doing the work.
For reporting progress information, I recommend having the worker write to Redis or Memcached - some kind of temporary storage that your web server can also access.
Check out the simple PHP example on the Gearman site. For learning, I recommend setting up a lab environment that contains 3 separate VM's, one for your web server (the client), one for the Gearman job queue (the server) and another for processing jobs (the workers).
First things first, I'm aware of this question:
Gearman: Sending data from a background worker to the client
What I want to know, is it still the case with Gearman? I'm planning on sending a batch of image URLs from a PHP web application to the gearman worker (also written in PHP; let's call it "The Main Worker") for processing asynchronously. This worker will then submit a separate task for each image to lower-tier workers (via addTask()), call runTasks() and wait for the tasks to finish, while listening to exceptions, accumulating error messages and updating the overall job status.
While I'm perfectly ok with retrieving the overall status from the Main Worker using jobStatus() calls, then just say that all of the images were processed when [false, false, 0, 0] is returned, I definitely need to be able to inform the users that some of the images couldn't be retrieved from their respective URLs or stored on the server.
I suppose I could always just store the custom data in memcache, then retrieve it from the web app, but it just seems "dirtier" to me...
I'm not trying to get any result, because from what I've seen in the manual on php.net, even the exception handling can only be done when the task is submitted synchronously, not mentioning the custom data retrieval. I just hoped that there could be something I'm missing.
I'm I remember correctly, we're using Ubuntu Server 12.04 with libgearman6 (v 0.27) and PHP 5.3.10. The version of the gearman extension is 1.0.2. I think the database is irrelevant here, as I will not be using it in either of the workers. And I think we're not using persistent queues right now.
Since gearman won't keep any task information in memory after a task has finished (just report it back for a synchronous task), you won't be able to retrieve it in your web application without storing it in a 3rd party location. We usually use a simple web service in the application for this, letting the worker call back to the application when a task has completed or an error has occured. This allows us to keep the business logic about what we'd like to do when such an error happens in the application where it belongs, and let our workers be more general (we might need image resizing in many apps, but some apps might want to start several sub tasks that depend on the image resizing being done first).
As you write, you may also let the worker write directly to the database with the state of the task or to memcached, but I've found that letting the application itself handle the logic instead of having to change and special case the workers work better. It's also well suited for a worker framework letting you keep the same standardized way of handling callback across actual worker code.
I am familiar with the various methods available within php for spawning new processes, forking, etc... Everything I have read urges against using pcntl_fork from within a web-accessible app. Can anyone tell me why this is not recommended?
At a fundamental level, I can see how if you are not careful, things could quickly get out of hand. But what if you are careful? In my case, I would like to pcntl_fork my parent script into a new child, run a short series of specific functions, and then close the child. Seems pretty straightforward, right? Would it still be dangerous for me to try this?
On a related note, can anyone talk about the overhead involved in doing this a different way... Calling proc_open() to launch an entirely new PHP process? Will I lose any possible speed increase by having to launch the new process?
Background: Consider a site with roughly 2,000 concurrent users running fastcgi.
Have you considered gearman for 'forking' new processes? It's also described as 'a distributed forking mechanism' so your workers do not need to be on the same machine.
Synchronous and asynchronous calls are also available.
You will find it here: http://gearman.org/ and it might be a candidate solution to the problem.
I would like to propose another possibility... Tell me what you think about this.
What if I created a pool of web servers whose sole job was to respond to job requests from the master application server? I would have something like this:
Master Application Server (Apache, PHP - FastCGI)
Application Worker Server (Apache, PHP - FastCGI)
Application Worker Server (Apache, PHP - FastCGI)
Application Worker Server (Apache, PHP - FastCGI)
Application Worker Server (Apache, PHP - FastCGI)
Instead of spawning new PHP processes on my master application server, I would send out job requests to my "workers" using asynchronous sockets. The workers would then run these jobs in realtime and send the result back to the main application server.
Has anyone tried this? Do you foresee any problems? It seems to me that this might work wonderfully.
The problem is not that the app is web-accessible.
The problem is that the web server (or here the FastCGI module) may not handle forks very well. Just try yourself.
My app takes a loooong list of urls, and split it in X (where X = $threads) so then I can start a thread.php and calculate the urls for it. Then it does GET and POST request to retrieve data
I am using this:
for($x=1;$x<=$threads;$x++){
$pid[] = exec("/path/bin/php thread.php <options> > /dev/null & echo \$!");
}
For "threading" (I know its not really threading, is it forking or what?), I save the pids into a file for later checking if N thread is running and to stop them.
Now I want to move out from php, I was thinking about using python because I'd like to learn more about it.
How can I achieve this kind of "threading" with python? (or ruby)
Or is there a better way to launch multiple background threads in python or ruby that runs in parallel (at the same time)?
The threads doesn't need to communicate between each other or with a main thread, they are independent, they do http request and interact with a mysql db, they may need to access/modify the same table entries (I haven't tought about this or how I will solve it yet).
The app works with "projects", each project has a "max threads" variable and I use a web interface to control it (so I could still use php for the interface [starting/stopping threads] in the new app).
I wanted to use
from threading import Thread
in python, but I've been told those threads wont run in parallel but once at a time.
The app is intended to run on linux web servers.
Any suggestion will be appreciated.
For Python 2.6+, consider the multiprocessing module:
multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine. It runs on both Unix and Windows
For Python 2.5, the same functionality is available via pyprocessing.
In addition to the example at the links above, here are some additional links to get you started:
multiprocessing Basics
Communication between processes with multiprocessing
You don't want threading. You want a work queue like Gearman that you can send jobs to asynchronously.
It's worth noting that this is a cross-platform, cross-language solution. There are bindings for many languages (including Python and PHP) provided officially, and many more unofficially with a bit of work with Google.
The original intent is effectively load balancing, but it works just as well with only one machine. Basically, you can create one or more Workers that listen for Jobs. You can control the number of Workers and the types of Jobs they can listen for.
If you insert five Jobs into the queue at the same time, and there happen to be five Workers waiting, each Worker will be handed one of the Jobs. If there are more Jobs than Workers, the Jobs get handled sequentially. Your Client (the thing that submits Jobs) can either wait for all of the Jobs it's created to complete, or it can simply place them in the queue and continue on.
I know about PHP not being multithreaded but i talked with a friend about this: If i have a large algorithmic problem i want to solve with PHP isn't the solution to simply using the "curl_multi_xxx" interface and start n HTTP requests on the same server. This is what i would call PHP style multithreading.
Are there any problems with this in the typical webserver environment? The master request which is waiting for "curl_multi_exec" shouldn't count any time against its maximum runtime or memory length.
I have never seen this anywhere promoted as a solution to prevent a script killed by too restrictive admin settings for PHP.
If i add this as a feature into a popular PHP system will there be server admins hiring a russian mafia hitman to get revenge for this hack?
If i add this as a feature into a
popular PHP system will there be
server admins hiring a russian mafia
hitman to get revenge for this hack?
No but it's still a terrible idea for no other reason than PHP is supposed to render web pages. Not run big algorithms. I see people trying to do this in ASP.Net all the time. There are two proper solutions.
Have your PHP script spawn a process
that runs independently of the web
server and updates a common data
store (probably a database) with
information about the progress of
the task that your PHP scripts can
access.
Have a constantly running daemon
that checks for jobs in a common
data store that the PHP scripts can
issue jobs to and view the progress
on currently running jobs.
By using curl, you are adding a network timeout dependency into the mix. Ideally you would run everything from the command line to avoid timeout issues.
PHP does support forking (pcntl_fork). You can fork some processes and then monitor them with something like pcntl_waitpid. You end up with one "parent" process to monitor the children it spanned.
Keep in mind that while one process can startup, load everything, then fork, you can't share things like database connections. So each forked process should establish it's own. I've used forking for up 50 processes.
If forking isn't available for your install of PHP, you can spawn a process as Spencer mentioned. Just make sure you spawn the process in such a way that it doesn't stop processing of your main script. You also want to get the process ID so you can monitor the spawned processes.
exec("nohup /path/to/php.script > /dev/null 2>&1 & echo $!", $output);
$pid = $output[0];
You can also use the above exec() setup to spawn a process started from a web page and get control back immediately.
Out of curiosity - what is your "large algorithmic problem" attempting to accomplish?
You might be better to write it as an Amazon EC2 service, then sell access to the service rather than the package itself.
Edit: you now mention "mass emails". There are already services that do this, they're generally known as "spammers". Please don't.
Lothar,
As far as I know, php don't work with services, like his concorrent, so you don't have a way for php to know how much time have passed unless you're constantly interrupting the process to check the time passed .. So, imo, no, you can't do that in php :)