Does PHP proc_nice leave Apache threads at new priority setting? - php

When executing proc_nice(), is it actually nice'ing Apache's thread?
If so, and if the current user (non-super user) can't renice to its original priority is killing the Apache thread appropriate (apache_child_terminate) on an Apache 2.0x server?
The issue is that I am trying to limit the impact of an app that allows the user to run Ad-Hack queries. The Queries can be massive and the resultant transform on the data requires a lot of Memory and CPU.
I've already re-written the process to be more stream based - helping with the memory consumption, but I would also like the process to run a lower priority. However I can't leave the Apache thread in low priority as we have a lot of high-priority web services running on this same box.
TIA

In that kind of situation, a solution if often to not do that kind of heavy work within the Apache processes, but either :
run an external PHP process, using something like shell_exec, for instance -- this is if you must work in synchronous mode (ie, if you cannot execute the task a couple of minutes later)
push the task to a FIFO system, and immediatly return a message to the user saying "your task will be processed soon"
and have some other process (launched via a crontab every minute, for instance) check that FIFO queue
and do the processing it there is something in the queue
That process, itself, can run in low priority mode.
As often as possible, especially if the heavy calculations take some time, I would go for the second solution :
It allows users to get some feedback immediatly : "the server has received your request, and will process it soon"
It doesn't keep Apaches's processes "working" for long : the heavy stuff is done by other processes
If, one day, you need such an amount of processing power that one server is not enough anymore, this kind of system will be easier to scale : just add a second server that'll pick from the same FIFO queue
If your server is really too loaded, you can stop processing from the queue, at least for some time, so the load can get better -- for instance, this can be usefull if your critical web-services are used a lot in a specific time-frame.
Another (nice-looking, but I haven't tried it yet) solution would be to use some kind of tool like, for instance, Gearman :
Gearman provides a generic application
framework to farm out work to other
machines or processes that are better
suited to do the work. It allows you
to do work in parallel, to load
balance processing, and to call
functions between languages. It can be
used in a variety of applications,
from high-availability web sites to
the transport of database replication
events. In other words, it is the
nervous system for how distributed
processing communicates.

Related

What is a typical Gearman flow for database modification?

Would appreciate some help understanding typical best practices in carrying out a series of tasks using Gearman in conjunction with PHP (among other things).
Here is the basic scenario:
A user uploads a set of image files through a web-based interface. The php code responding to the POST request generates an entry in a database for each file, mostly with null entries in the columns, queues a job for each to do analysis using Gearman, generates a status page and exits.
The Gearman worker gets a job for a file and starts a relatively long-running analysis. The result of that analysis is a set of parameters that need to be inserted back into the database record for that file.
My question is, what is the generally accepted method of doing this? Should I use a callback that will ultimately kick off a different php script that is going to do the modification, or should the worker function itself do the database modification?
Everything is currently running on the same machine; I'm planning on using Gearman for background scheduling, rather than for scaling by farming out to different machines, but in any case any of the functions could connect to the database wherever it is.
Any thoughts appreciated; just looking for some insights on how this typically gets structured and what might be considered best practice.
Are you sure you want to use Gearman? I only ask because it was the defacto PHP job server about 15 years ago but hasn't been a reliable solution for quite some time. I am not sure if things have drastically improved in the last 12 months, but last time I evaluated Gearman, it wasn't production capable.
Now, on to the questions.
what is the generally accepted method of doing this? Should I use a callback that will ultimately kick off a different php script that is going to do the modification, or should the worker function itself do the database modification?
You are going to follow this general pattern with any job queue:
Collect a unit of work. In your case, it will be 1 of the images and any information about who that image belongs to, user id, etc.
Submit the work to the job queue with this information.
Job Queue's worker process picks up the work, and starts processing it. This is where I would create records in the database as you can opt to not create them on job failure.
The job queue is going to track which jobs have completed and usually the status of completion. If you are using gearman, this is the gearmand process. You also need something pickup work and process that work, I will refer to this as the job worker. The job worker is where the concurrency happens which is what i think you were referring to when you said "kick off a different php script." You can just kick off a PHP script at an interval (with supervisord or a cronjob) for a kind of poll & fork approach. It's not the most efficient approach, but it doesn't sound like it will really matter for your applications use case. You could also use pcntl_fork or pthreads in PHP to get more control over your concurrent processes and implement a worker pool pattern, but it is much more complicated than just firing off a script. If you are interested in trying to implement some concurrency in PHP, I have a proof-of-concept job worker for beanstalkd available on GitHub that implements a worker pool with both fork and pthreads. I have also include a couple of other resources on the subject of concurrency.
Job Worker (pthreads)
Job Worker (fork)
PHP Daemon Example
PHP IPC Example

Temporary storage for collecting data prior to sending

I'm working on a composer package for PHP apps. The goal is to send some data after requests, queue jobs, other actions that are taken. My initial (and working) idea is to use register_shutdown_function to do it. There are a couple of issues with this approach, firstly, this increases the page response time, meaning that there's the overhead of computing the request, plus sending the data via my API. Another issue is that long-running processes, such as queue workers, do not execute this method for a long time, therefore there might be massive gaps between when the data was created and when it's sent and processed.
My thought is that I could use some sort of temporary storage to store the data and have a cronjob to send it every minute. The only issue I can see with this approach is managing concurrency on hight IO. Because many processes will be writing to the file every (n) ms, there's an issue with reading the file and removing lines that had been already sent.
Another option which I'm trying to desperately avoid is using the client database. This could potentially cause performance issues.
What would be the preferred way to do this?
Edit: the package is essentially a monitoring agent.
There are a couple of issues with this approach, firstly, this increases the page response time, meaning that there's the overhead of computing the request, plus sending the data via my API
I'm not sure you can get around this, there will be additional overhead to doing more work within the context of a web request. I feel like using a job-queue based/asynchronous system is minimizing this for the client. Whether you choose a local file system write, or a socket write you'll have that extra overhead, but you'll be able to return to the client immediately and not block on the processing of that request.
Another issue is that long-running processes, such as queue workers, do not execute this method for a long time, therefore there might be massive gaps between when the data was created and when it's sent and processed.
Isn't this the whole point?? :p To return to your client immediately, and then asynchronously complete the job at some point in the future? Using a job queue allows you to decouple and scale your worker pool and webserver separately. Your webservers can be pretty lean because heavy lifting is deferred to the workers.
My thought is that I could use some sort of temporary storage to store the data and have a cronjob to send it every minute.
I would def recommend looking at a job queue opposed to rolling your own. This is pretty much solved and there are many extremely popular open source projects to handle this (any of the MQs) Will the minute cron job be doing the computation for the client? How do you scale that? If a file has 1000 entries, or you scale 10x and has 10000 will you be able to do all those computations in less than a minute? What happens if a server dies? How do you recover? Inter-process concurrency? Will you need to manage locks for each process? Will you use a separate file for each process and each minute? To bucket events? What happens if you want less than 1 minute runs?
Durability Guarantees
What sort of guarantees are you offering your clients? If a request returns can the client be sure that the job is persisted and it will be completed at sometime in the future?
I would def recommend choosing a worker queue, and having your webserver processes write to it. It's an extremely popular problem with so many resources on how to scale it, and with clear durability and performance guarantees.

Managing workers with RabbitMQ

I have implemented rabbitMQ in my current php application to handle asynchroneous jobs that are handled by workers. But my current problem is that how should i monitor and scale up or down the workers. Also, i want to add error handling in case all the workers die. I have thought of following two ways but don't know which one is the better:
At producer end, i would analyze the rabbitMQ queue size. If queue size (list of pending tasks) is more than a threshold, i would create one new worker everytime producer script executes but before that i would check the server load (using linux command uptime). If server load is less than a threshold then only new worker would be created. At consumer end (in worker.php), i would apply same method to scale up the workers and i would also check that if script is idle for a given time (i.e. there is no pending task in rabbit mq queue) then it would automatically die (to automate scaling down of workers).
Second method is to use background process or cron to monitor and scale/up down the workers. But i don't want to rely on cron (as i have very bad experiences with it) or background process because if background process crashes for some reason then there is no way to recover from it.
Please help.
I wouldn't recommend bothering to scale them down to nothing when there's no work to be done. The worker that's left (if you want to scale back to 1) will simply wait for something else to consume and it's not an expensive operation.
In terms of determining whether to scale up, I'd recommend leveraging the RabbitMQ Management HTTP API (http://hg.rabbitmq.com/rabbitmq-management/raw-file/3646dee55e02/priv/www-api/help.html). You can use the queue related aspects via a GET operation to get information about queues, including how many entries are currently waiting to be processed.
With that info, you can decide to scale if it either hits a certain threshold, or keeps increasing with every check for a certain amount of time, or something similar. This can be done from the consumer side.
In terms of error handling, I would recommend encapsulating the RabbitMQ connection aspect of your workers such that if a RabbitMQ exception occurs the connection is re-established from scratch and continues.
If it's a more serious type of exception that isn't RabbitMQ-related, you may need to catch it at such a level where the worker basically spawns a new worker before it dies. Then of course there are other types of exceptions (out of memory conditions, for example), where it really isn't feasible to try to continue and your program should just completely die.
It is very difficult to answer your question with any degree of accuracy since there are many aspects of the context which are not included.
How long do the tasks take to execute?
Why do you want to scale up/down? Why don't you have threads waiting for load in the first place?
That being said, coming from the world of Erland and functional programming (which is the language used to power RabbitMQ) I would like to suggest the concept of a SUPERVISOR thread. This thread would have the following responsibilities:
Spawn threads depending on the load/qty of requests
Discard threads depending on the load/qty of requests
Monitor the children threads and re-launch them as required reprocessing the same messages if necessary or discarding them
The Supervisor thread should be as easy as possible and should be built in such a way that it simply loops, sleeps and checks if all the threads that need to be alive actually are - it can then check the load and spawn up or kill off the workers as needed. Or in other words, spawn more and/or not-spawn depending on your needs.
You could easily use an exchange to send messages to both the supervisor and the worker queues where the supervisor would then be able to keep a record/count of the messages in the queue without having to write polling code to the server, it would simply listen to it's own queue. You can increase/dec the counter from the supervisor thread and manage everything from there.
Hope this helps.
See: http://docs.dotcloud.com/guides/daemons/
Regretfully I don't program in PHP and therefore cannot give you PHP-specific assistance, this is however the programming pattern that I recommend that you use. If PHP doesn't allow multi-threaded programming and/or threads then I would highly recommend that you use a language that does since you will not be able to scale and use the full power of the local machine unless you use multiple threads. As for the supervisor crashing, if you keep minimal work in the supervisor and delegate all responsibilities to children threads then the risk of a supervisor crash is minimal.
Perhaps this will help:
Philosophy:
http://soapatterns.org/design_patterns/service_agent
PHP-specific:
http://www.quora.com/PHP-programming-language-1/Is-there-an-actor-framework-for-php

Commercial PHP script, long running processes. daemons vs. cronjobs?

I'm putting together my first commercial PHP application, it's nothing really huge as I'm still eagerly learning PHP :)
Right now I'm still in the conceptual stage of planning my application but I run into one problem all the time, the application is supposed to be self-hosted by my customers, on their own servers and will include some very long running scripts, depending on how much data every customer enters in his application.
Now I think I have two options, either use cronjobs, like for example let one or multiple cronjobs run at a time that every customer can set himself, OR make the whole processing of data as daemons that run in the background...
My question is, since it's a self-hosted application (and every server is different)... is it even recommended to try to write php that starts background processes on a customers server, or is this more something that you can do reliably only on your own server...?
Or should I use cronjobs for these long running processes?
(depending on the amount of data my customers will enter in the application, a process could run 3+ hours)
Is that even a problem that can be solved, reliably, with PHP...? Excuse me if this should be a weird question, I'm really not experienced with PHP daemons and/or long running cronjobs created by php.
So to recap everything:
Commercial self-hosted application, including long running processes, cronjobs or daemons? And is either or maybe both also a reliable solution for a paid application that you can give to your customers with a clear conscience because you know it will work reliable on all kinds of different servers...?
EDIT*
PS: Sorry, I forgot to mention that the application targets only Linux servers, so everything like Debian, Ubuntu etc etc.
Short answer, no, don't go for background process if this will be a client hosted solution. If you go towards the ASP concept (Application Service Provider... not Active Server Pages ;)) then you can do some wacky stuff with background processes and external apps connecting to your sql servers and processing stuff for you.
What i suggest is to create a strong task management backbone and link that to a solid task processing infrastructure. I'll recommend you read an old post i did quite some time ago regarding background processes and a strategy i had adopted to fix long running processes:
Start & Stop PHP Script from Backend Administrative Webpage
Happy reading...
UPDATE
I realize that my old post is far from easy to understand so here goes:
You need 2 models: Job and JobQueue, 2 controller: JobProcessor, XYZProcessor
JobProcessor is called either by a user when a page triggers or using a cronjob as you wish. JobProcessor::process() is the key that starts the whole processing or continues it. It loads the JobQueues and asks the job queues if there is work to do. If there is work to do, it asks the jobqueue to start/continue it's job.
JobQueue Model: Used to queue several JOBS one behind each other and controls what job is currently current by keep some kind of ID and STATE about which job is running.
Job Model: Represents exactly what needs to be done, it contains for example the name of the controller that will process the data, the function to call to process the data and a serialized configuration property that describe what must be done.
XYZController: Is the one that contains the processing method. When the processing method is called, the controller must load everything it needs to memory and then process each individual unit of work as fast as possible.
Example:
Call of index.php
Index.php creates a jobprocessor controller
Index.php calls the jobprocessor's process()
JobProcessor::Process() loads all the queues and processes them
For each JobQueue::Process(), the job queue loads it's possible Jobs and detects if one is currently running or not. If none is running, it starts the next one by calling Job::Process();
Job::Process() creates the XYZController that will work the task at hand. For example, my old system had an InvoicingController and a MassmailingController that worked hand in hand.
Job::Process() calls XYZController::Prepare() so that it loads it's information to process. (For example, load a batch of emails to process, load a batch of invoices to create)
Job::Process() calls XYZController::RunWorkUnit() so that it processes a single unit of work (For example, create one invoice, send one email)
Job::Process() asks JobProcessingController::DoIStillHaveTimeToProcess() and if so, continues processing the next element.
Job::Process() runs out of time and calls XYZController::Cleanup() so that all resources are released
JobQueue::Process() ends and returns to JobController
JobController::Process() is about to end? Open a socket, call myself back so i can start another round of processing until i don't have anything to do anymore
Handle the request from the user that start in position #1.
Ultimately, you can instead open a socket each time and ask the processor to do something, or you can queue a CronJob to call your processor. This way your users won't get stuck waiting for the 3/4 work units to complete each time.
Its worth noting that, in addition to running daemons or cron jobs, you can kick off long running processes from a web request (but note that it must run outside of the webserver process group) and of course asynchronous message processing (which is essentially a variant on the batch approach).
All four of these approaches are very different in terms of how they behave, how concurrency and timing are managed. The factors which make them all different are the same ones you omitted from your question - so it's not really possible to answer.
Unfortunately all rely on facilities which are very different between MSWindows and POSIX systems - so although PHP will run on both, if you want to sell your app on both platforms it's going to need 2 versions.
Maybe you should talk to your potential customer base and ask them what they want?

Anatomy of a Distributed System in PHP

I've a problem which is giving me some hard time trying to figure it out the ideal solution and, to better explain it, I'm going to expose my scenario here.
I've a server that will receive orders
from several clients. Each client will
submit a set of recurring tasks that
should be executed at some specified
intervals, eg.: client A submits task
AA that should be executed every
minute between 2009-12-31 and
2010-12-31; so if my math is right
that's about 525 600 operations in a
year, given more clients and tasks
it would be infeasible to let the server process all these tasks so I
came up with the idea of worker
machines. The server will be developed
on PHP.
Worker machines are just regular cheap
Windows-based computers that I'll
host on my home or at my workplace,
each worker will have a dedicated
Internet connection (with dynamic IPs)
and a UPS to avoid power outages. Each
worker will also query the server every
30 seconds or so via web service calls,
fetch the next pending job and process it.
Once the job is completed the worker will
submit the output to the server and request
a new job and so on ad infinitum. If
there is a need to scale the system I
should just set up a new worker and the
whole thing should run seamlessly.
The worker client will be developed
in PHP or Python.
At any given time my clients should be
able to log on to the server and check
the status of the tasks they ordered.
Now here is where the tricky part kicks in:
I must be able to reconstruct the
already processed tasks if for some
reason the server goes down.
The workers are not client-specific,
one worker should process jobs for
any given number of clients.
I've some doubts regarding the general database design and which technologies to use.
Originally I thought of using several SQLite databases and joining them all on the server but I can't figure out how I would group by clients to generate the job reports.
I've never actually worked with any of the following technologies: memcached, CouchDB, Hadoop and all the like, but I would like to know if any of these is suitable for my problem, and if yes which do you recommend for a newbie is "distributed computing" (or is this parallel?) like me. Please keep in mind that the workers have dynamic IPs.
Like I said before I'm also having trouble with the general database design, partly because I still haven't chosen any particular R(D)DBMS but one issue that I've and I think it's agnostic to the DBMS I choose is related to the queuing system... Should I precalculate all the absolute timestamps to a specific job and have a large set of timestamps, execute and flag them as complete in ascending order or should I have a more clever system like "when timestamp modulus 60 == 0 -> execute". The problem with this "clever" system is that some jobs will not be executed in order they should be because some workers could be waiting doing nothing while others are overloaded. What do you suggest?
PS: I'm not sure if the title and tags of this question properly reflect my problem and what I'm trying to do; if not please edit accordingly.
Thanks for your input!
#timdev:
The input will be a very small JSON encoded string, the output will also be a JSON enconded string but a bit larger (in the order of 1-5 KB).
The output will be computed using several available resources from the Web so the main bottleneck will probably be the bandwidth. Database writes may also be one - depending on the R(D)DBMS.
It looks like you're on the verge of recreating Gearman. Here's the introduction for Gearman:
Gearman provides a generic application
framework to farm out work to other
machines or processes that are better
suited to do the work. It allows you
to do work in parallel, to load
balance processing, and to call
functions between languages. It can be
used in a variety of applications,
from high-availability web sites to
the transport of database replication
events. In other words, it is the
nervous system for how distributed
processing communicates.
You can write both your client and the back-end worker code in PHP.
Re your question about a Gearman Server compiled for Windows: I don't think it's available in a neat package pre-built for Windows. Gearman is still a fairly young project and they may not have matured to the point of producing ready-to-run distributions for Windows.
Sun/MySQL employees Eric Day and Brian Aker gave a tutorial for Gearman at OSCON in July 2009, but their slides mention only Linux packages.
Here's a link to the Perl CPAN Testers project, that indicates that Gearman-Server can be built on Win32 using the Microsoft C compiler (cl.exe), and it passes tests: http://www.nntp.perl.org/group/perl.cpan.testers/2009/10/msg5521569.html But I'd guess you have to download source code and build it yourself.
Gearman seems like the perfect candidate for this scenario, you might even want to virtualize you windows machines to multiple worker nodes per machine depending on how much computing power you need.
Also the persistent queue system in gearman prevents jobs getting lost when a worker or the gearman server crashes. After a service restart the queue just continues where it has left off before crash/reboot, you don't have to take care of all this in your application and that is a big advantage and saves alot of time/code
Working out a custom solution might work but the advantages of gearman especially the persistent queue seem to me that this might very well be the best solution for you at the moment. I don't know about a windows binary for gearman though but i think it should be possible.
A simpler solution would be to have a single database with multiple php-nodes connected. If you use a proper RDBMS (MSql + InnoDB will do), you can have one table act as a queue. Each worker will then pull tasks from that to work on and write it back into the database upon completion, using transactions and locking to synchronise. This depends a bit on the size of input/output data. If it's large, this may not be the best scheme.
I would avoid sqlite for this sort of task, although it is a very wonderful database for small apps, it does not handle concurrency very well, it has only one locking strategey which is to lock the entire database and keep it locked until a sinlge transaction is complete.
Consider Postgres which has industrial strength concurrency and lock management and can handle multiple simultanious transactions very nicely.
Also this sounds like a job for queuing! If you were in hte Java world I would recommend a JMS based archictecture for your solution. There is a 'dropr' project to do something similar in php but its all fairly new so it might not be suitable for your project.
Whichever technoligy you use you should go for a "free market" solution where the worker threads consume available "jobs" as fast as they can, rather than a "command economy" where a central process allocates tasks to choosen workers.
The setup of a master server and several workers looks right in your case.
On the master server I would install MySQL (Percona InnoDB version is stable and fast) in master-master replication so you won't have a single point of failure.
The master server will host an API which the workers will pull at every N seconds. The master will check if there is a job available, if so it has to flag that the job has been assigned to the worker X and return the appropriate input to the worker (all of this via HTTP).
Also, here you can store all the script files of the workers.
On the workers, I would strongly suggest you to install a Linux distro. On Linux it's easier to set up scheduled tasks and in general I think it's more appropriate for the job.
With Linux you can even create a live cd or iso image with a perfectly configured worker and install it fast and easy on all the machines you want.
Then set up a cron job that will RSync with the master server to update/modify the scripts. In this way you will change the files in just one place (the master server) and all the workers will get the updates.
In this configuration you don't care of the IPs or the number of workers because the workers are connecting to the master, not vice-versa.
The worker job is pretty easy: ask the API for a job, do it, send back the result via API. Rinse and repeat :-)
Rather than re-inventing the queuing wheel via SQL, you could use a messaging system like RabbitMQ or ActiveMQ as the core of your system. Each of these systems provides the AMQP protocol and has hard-disk backed queues. On the server you have one application that pushes new jobs into a "worker" queue according to your schedule and another that writes results from a "result" queue into the database (or acts on it some other way).
All the workers connect to RabbitMQ or ActiveMQ. They pop the work off the work queue, do the job and put the response into another queue. After they have done that, they ACK the original job request to say "its done". If a worker drops its connection, the job will be restored to the queue so another worker can do it.
Everything other than the queues (job descriptions, client details, completed work) can be stored in the database. But anything realtime should be put somewhere else. In my own work I'm streaming live power usage data and having many people hitting the database to poll it is a bad idea. I've written about live data in my system.
I think you're going in the right direction with a master job distributor and workers. I would have them communicate via HTTP.
I would choose C, C++, or Java to be clients, as they have capabilities to run scripts (execvp in C, System.Desktop.something in Java). Jobs could just be the name of a script and arguments to that script. You can have the clients return a status on the jobs. If the jobs failed, you could retry them. You can have the clients poll for jobs every minute (or every x seconds and make the server sort out the jobs)
PHP would work for the server.
MySQL would work fine for the database. I would just make two timestamps: start and end. On the server, I would look for WHEN SECONDS==0

Categories