I want to write a worker for beanstalkd in php, using a Zend Framework 2 controller. It starts via the CLI and will run forever, asking for jobs from beanstalkd like this example.
In simple pseudo-like code:
while (true) {
$data = $beanstalk->reserve();
$class = $data->class;
$params = $data->params;
$job = new $class($params);
$job();
}
The $job has here an __invoke() method of course. However, some things in these jobs might be running for a long time. Some might run with a considerable amount of memory. Some might have injected the $beanstalk object, to start new jobs themselves, or have a Zend\Di\Locator instance to pull objects from the DIC.
I am worried about this setup for production environments on the long term, as perhaps circular references might occur and (at this moment) I do not explicitly "do" any garbage collection while this action might run for weeks/months/years *.
*) In beanstalk, reserve is a blocking call and if no job is available, this worker will wait until it gets any response back from beanstalk.
My question: how will php handle this on the long term and should I take any special precaution to keep this from blocking?
This I did consider and might be helpful (but please correct if I am wrong and add more if possible):
Use gc_enable() before starting the loop
Use gc_collect_cycles() in every iteration
Unset $job in every iteration
Explicitly unset references in __destruct() from a $job
(NB: Update from here)
I did run some tests with arbitrary jobs. The jobs I included were: "simple", just set a value; "longarray", create an array of 1,000 values; "producer", let the loop inject $pheanstalk and add three simplejobs to the queue (so there is now a reference from job to beanstalk); "locatoraware", where a Zend\Di\Locator is given and all job types are instantiated (though not invoked). I added 10,000 jobs to the queue, then I reserved all jobs in a queue.
Results for "simplejob" (memory consumption per 1,000 jobs, with memory_get_usage())
0: 56392
1000: 548832
2000: 1074464
3000: 1538656
4000: 2125728
5000: 2598112
6000: 3054112
7000: 3510112
8000: 4228256
9000: 4717024
10000: 5173024
Picking a random job, measuring the same as above. Distribution:
["Producer"] => int(2431)
["LongArray"] => int(2588)
["LocatorAware"] => int(2526)
["Simple"] => int(2456)
Memory:
0: 66164
1000: 810056
2000: 1569452
3000: 2258036
4000: 3083032
5000: 3791256
6000: 4480028
7000: 5163884
8000: 6107812
9000: 6824320
10000: 7518020
The execution code from above is updated to this:
$baseMemory = memory_get_usage();
gc_enable();
for ( $i = 0; $i <= 10000; $i++ ) {
$data = $bheanstalk->reserve();
$class = $data->class;
$params = $data->params;
$job = new $class($params);
$job();
$job = null;
unset($job);
if ( $i % 1000 === 0 ) {
gc_collect_cycles();
echo sprintf( '%8d: ', $i ), memory_get_usage() - $baseMemory, "<br>";
}
}
As everybody notices, the memory consumption is in php not leveraged and kept to a minimum, but increases over time.
I've usually restarted the script regularly - though you don't have to do it after every job is run (unless you want to, and it's useful to clear memory). You could for example run for up to 100 jobs or more at a time or till the script had used say 20MB RAM, and then exit the script, to be instantly re-run.
My blogpost at http://www.phpscaling.com/2009/06/23/doing-the-work-elsewhere-sidebar-running-the-worker/ has some example shell scripts of re-running the scripts.
I ended up benchmarking my current code base line for line, after which I came to this:
$job = $this->getLocator()->get($data->name, $params);
It uses the Zend\Di dependency injection which instance manager tracks instances through the complete process. So after a job was invoked and could be removed, the instance manager still kept it in memory. Not using Zend\Di for instantiating the jobs immediately resulted in a static memory consumption instead of a linear one.
For memory safety, don't use looping after each sequence job in PHP. But just create simple bash script to do looping:
while [ true ] ; do
php do_jobs.php
done
Hey there, with do_jobs.php contains something like:
// ...
$data = $beanstalk->reserve();
$class = $data->class;
$params = $data->params;
$job = new $class($params);
$job();
// ...
simple right? ;)
Related
I'm having a Symfony Command that uses the Doctrine Paginator on PHP 7.0.22. The command must process data from a large table, so I do it in chunks of 100 items. The issue is that after a few hundred loops it gets to fill 256M RAM. As measures against OOM (out-of-memory) I use:
$em->getConnection()->getConfiguration()->setSQLLogger(null); - disables the sql logger, that fills memory with logged queries for scripts running many sql commands
$em->clear(); - detaches all objects from Doctrine at the end of every loop
I've put some dumps with memory_get_usage() to check what's going on and it seems that the collector doesn't clean as much as the command adds at every $paginator->getIterator()->getArrayCopy(); call.
I've even tried to manually collect the garbage every loop with gc_collect_cycles(), but still no difference, the command starts using 18M and increases with ~2M every few hundred items. Also tried to manually unset the results and the query builder... nothing. I removed all the data processing and kept only the select query and the paginator and got the same behaviour.
Anyone has any idea where I should look next?
Note: 256M should be more than enough for this kind of operations, so please don't recommend solutions that suggest increasing allowed memory.
The striped down execute() method looks something like this:
protected function execute(InputInterface $input, OutputInterface $output)
{
// Remove SQL logger to avoid out of memory errors
$em = $this->getEntityManager(); // method defined in base class
$em->getConnection()->getConfiguration()->setSQLLogger(null);
$firstResult = 0;
// Get latest ID
$maxId = $this->getMaxIdInTable('AppBundle:MyEntity'); // method defined in base class
$this->getLogger()->info('Working for max media id: ' . $maxId);
do {
// Get data
$dbItemsQuery = $em->createQueryBuilder()
->select('m')
->from('AppBundle:MyEntity', 'm')
->where('m.id <= :maxId')
->setParameter('maxId', $maxId)
->setFirstResult($firstResult)
->setMaxResults(self::PAGE_SIZE)
;
$paginator = new Paginator($dbItemsQuery);
$dbItems = $paginator->getIterator()->getArrayCopy();
$totalCount = count($paginator);
$currentPageCount = count($dbItems);
// Clear Doctrine objects from memory
$em->clear();
// Update first result
$firstResult += $currentPageCount;
$output->writeln($firstResult);
}
while ($currentPageCount == self::PAGE_SIZE);
// Finish message
$output->writeln("\n\n<info>Done running <comment>" . $this->getName() . "</comment></info>\n");
}
The memory leak was generated by Doctrine Paginator. I replaced it with native query using Doctrine prepared statements and fixed it.
Other things that you should take into consideration:
If you are replacing the Doctrine Paginator, you should rebuild the pagination functionality, by adding a limit to your query.
Run your command with --no-debug flag or -env=prod or maybe both. The thing is that the commands are running by default in the dev environment. This enables some data collectors that are not used in the prod environment. See more on this topic in the Symfony documentation - How to Use the Console
Edit: In my particular case I was also using the bundle eightpoints/guzzle-bundle that implements the HTTP Guzzle library (had some API calls in my command). This bundle was also leaking, apparently through some middleware. To fix this, I had to instantiate the Guzzle client independently, without the EightPoints bundle.
I have 1000 queues with specific names. so I want to process these queues with one broker. is it possible?
the queue names is stored in mysql db so I should fetch theme and run the broker for each one. and of course it should run asynchronously and should be able to pass the queued item to a idle broker. is this possible? or I should make 1000 files with specific queue names as brokers?
Update:
this is a picture of my queues. the queues should run in a parallel manner not a serial one. so the users are producer and the worker is consumer that runs the send_message() method;
I can show you how to it with enqueue library. I must warn you, there is no way to consume messages asynchronously in one process. Though you can run a few processes that serve a set of queues. They could be divided into groups by the queue importance.
Install the AMQP transport and consumption library:
composer require enqueue/amqp-ext enqueue/enqueue
Create a consumption script. I assume that you have an array of queue names already fetched from DB. They are stored in $queueNames var. The example bound the same processor to all queues but you can set different ones, of course.
<?php
use Enqueue\AmqpExt\AmqpConnectionFactory;
use Enqueue\Consumption\QueueConsumer;
use Enqueue\Psr\PsrMessage;
use Enqueue\Psr\PsrProcessor;
// here's the list of queue names which you fetched from DB
$queueNames = ['foo_queue', 'bar_queue', 'baz_queue'];
$factory = new AmqpConnectionFactory('amqp://');
$context = $factory->createContext();
// create queues at RabbitMQ side, you can remove it if you do not need it
foreach ($queueNames as $queueName) {
$queue = $context->createQueue($queueName);
$queue->addFlag(AMQP_DURABLE);
$context->declareQueue($queue);
}
$consumer = new QueueConsumer($context);
foreach ($queueNames as $queueName) {
$consumer->bind($queueName, function(PsrMessage $psrMessage) use ($queueName) {
echo 'Consume the message from queue: '.$queueName;
// your processing logic.
return PsrProcessor::ACK;
});
}
$consumer->consume();
More in doc
I have a computation-expensive backend process in Symfony2 / PHP that I would like to run multi-threaded.
Since I iterate over thousands of objects, I think I shouldn't start one thread per object. I would like to have a $cores variable that defines how many threads I want in parallel, then iterate through the loop and keep that many threads running. So every time a thread finishes, a new one with the next object should be started, until all objects are done.
Looking at the pthreads documentation and doing some Google searches, I can't find a useable example for this situation. All examples I found have a fixed number of threads they run once, none of them iterate over thousands of objects.
Can someone point me into the right direction to get started? I understand the basics of setting up a thread and joining it, etc. but not how to do it in a loop with a wait condition.
The answer to the question is use Pool and Worker abstraction.
The basic idea is that you ::submit Threaded objects to the Pool, which it stacks onto the next available Worker, distributing your Threaded objects (round robin) across all Workers.
Follows is super simple code is for PHP7 (pthreads v3):
<?php
$jobs = [];
while (count($jobs) < 2000) {
$jobs[] = mt_rand(0, 1999);
}
$pool = new Pool(8);
foreach ($jobs as $job) {
$pool->submit(new class($job) extends Threaded {
public function __construct(int $job) {
$this->job = $job;
}
public function run() {
var_dump($this->job);
}
});
}
$pool->shutdown();
?>
The jobs are pointless, obviously. In the real world, I guess your $jobs array keeps growing, so you can just swap foreach for some do {} while, and keep calling ::submit for new jobs.
In the real world, you will want to collect garbage in the same loop (just call Pool::collect with no parameters for default behaviour).
Noteworthy, none of this would be possible if it really were the case that PHP wasn't intended to work in multi-threaded environments ... it definitely is.
That is the answer to the question, but it doesn't make it the best solution to your problem.
You have mentioned in comments that you assume 8 threads executing Symfony code will take up less memory than 8 processes. This is not the case, PHP is shared nothing, all the time. You can expect 8 Symfony threads to take up as much memory as 8 Symfony processes, in fact, a little bit more. The benefit of using threads over processes is that they can communicate, synchronize and (appear to) share with each other.
Just because you can, doesn't mean you should. The best solution for the task at hand is probably to use some ready made package or software intended to do what is required.
Studying this stuff well enough to implement a robust solution is something that will take a long time, and you wouldn't want to deploy that first solution ...
If you decide to ignore my advice, and give it a go, you can find many examples in the github repository for pthreads.
Joe has a good approach, but I found a different solution elsewhere that I am now using. Basically, I have two commands, one control and one worker command. The control command starts background processes and checks their results:
protected function process($worker, $entity, $timeout=60) {
$min = $this->em->createQuery('SELECT MIN(e.id) FROM BM2SiteBundle:'.$entity.' e')->getSingleScalarResult();
$max = $this->em->createQuery('SELECT MAX(e.id) FROM BM2SiteBundle:'.$entity.' e')->getSingleScalarResult();
$batch_size = ceil((($max-$min)+1)/$this->parallel);
$pool = array();
for ($i=$min; $i<=$max; $i+=$batch_size) {
$builder = new ProcessBuilder();
$builder->setPrefix($this->getApplication()->getKernel()->getRootDir().'/console');
$builder->setArguments(array(
'--env='.$this->getApplication()->getKernel()->getEnvironment(),
'maf:worker:'.$worker,
$i, $i+$batch_size-1
));
$builder->setTimeout($timeout);
$process = $builder->getProcess();
$process->start();
$pool[] = $process;
}
$this->output->writeln($worker.": started ".count($pool)." jobs");
$running = 99;
while ($running > 0) {
$running = 0;
foreach ($pool as $p) {
if ($p->isRunning()) {
$running++;
}
}
usleep(250);
}
foreach ($pool as $p) {
if (!$p->isSuccessful()) {
$this->output->writeln('fail: '.$p->getExitCode().' / '.$p->getCommandLine());
$this->output->writeln($p->getOutput());
}
}
}
where $this->parallel is a variable I set to 6 on my 8 core machine, it signifies the number of processes to start. Note that this method requires that I iterate over a specific entity (it splits by that), which is always true in my use cases.
It's not perfect, but it starts completely new processes instead of threads, which I consider the better solution.
The worker command takes min and max ID numbers and does the actual work for the set between those two.
This approach works as long as the data set is reasonably well distributed. If you have no data in the 1-1000 range but every ID between 1000 and 2000 is used, the first three processes would have nothing to do.
I have a simple web application that is written with the Laravel 4.2 framework. I have configured the Laravel queue component to add new queue items to a locally running beastalkd server.
Essentially, there is a POST route that will add an item to the beanstalkd tube.
I then have supervisord set up to run artisan queue:listen as three separate processes. The issue that I am seeing is that the different queue:listen processes will end up spawning anywhere between one to three queue:worker processes for just one inserted job.
The end result being that one job inserted into the queue is sometimes being processed by multiple workers at the same time, something I am obviously trying to avoid.
The job code is relatively simple:
<?php
use App\Repositories\URLRepository;
class ProcessDataJob {
private $urls;
public function __construct(URLRepository $urls)
{
$this->urls = $urls;
}
public function fire($job, $data)
{
$input = $data['post'];
if (!isset($data['post']) && !is_array($data['post'])) {
Log::error('[Job #'.$job->getJobId().'] $input was empty inside CreateAuditJob. Deleting job. Quitting. Bye!');
$job->delete();
return false;
}
//
// ... code that will take a few hours to run.
//
$job->delete();
Log::info('[Job #'.$job->getJobId().'] ProcessDataJob was successful, deleting the job!');
return true;
}
}
The fun part being that most of the (duplicated) queue workers fail when deleting the job with this left in the errorlog:
exception 'Pheanstalk_Exception_ServerException' with message 'Job 3248 NOT_FOUND: does not exist or is not reserved by client'
The ttr (Time to Run) is set to 172,800 seconds (or 48 hours), which is much larger then the time it would take for the job to complete.
what's the job time_to_run when queued? If running the job takes longer than time_to_run seconds, the job is automatically re-queued and becomes eligible to be run by the next worker.
A reserved job has time_to_run seconds to be deleted, released or touched. Calling touch() restarts the timeout timer, so workers can use it to give themselves more time to finish. Or use a large enough value when queueing.
I've found the beanstalkd protocol document helpful
https://github.com/kr/beanstalkd/blob/master/doc/protocol.md
Since you are running with Laravel, verify your queue.php configuration file.
Change the ttr value from 60 (default) to something else that is better for you.
'beanstalkd' => array(
'driver' => 'beanstalkd',
'host' => 'localhost',
'queue' => 'default',
'ttr' => 600, //Example
),
I am developing a networking application where I listen on a port and create a new socket and thread when a new connection request arrives; the architecture is working well but we are facing severe memory issues.
The problem is that even if I create a single thread, it does the work but the memory keeps on increasing.
To demonstrate the problem please review the following code where we start one thread of a class whose duty it to print a thread ID and a random number infinitely.
class ThreadWorker extends Thread {
public function run() {
while(1) {
echo $this->getThreadId()." => ".rand(1,1000)."\r\n";
}
}
}
$th = new ThreadWorker();
$th->start();
I am developing it on Windows OS and when I open the task manager the PHP.exe memory usage keeps on increasing until the system becomes unresponsive.
Please note that the PHP script is executed from command line:
PHP.exe pthreads-test.php
OK, I think the problem is that the thread loop is highly CPU consuming. Avoid such that code. If you just want to echo a message, I recommend putting a sleep() instruction after. Example:
class ThreadWorker extends Thread {
public function run() {
while(1) {
echo $this->getThreadId()." => ".rand(1,1000)."\r\n";
sleep(1);
}
}
}
EDIT
It seems there's a way to force garbage collection in PHP. On the other hand, sleep() is not a proper way to stabilize CPU use. Normally threads do things like reading from files, sockets or pipes, i.e., they often perform I/O operations which are normally blocking (i.e. they pause the thread until data is I/O operation is possible). This behaviour inherently yields the CPU and other resources to other threads, thus stabilizing the whole system.