I have a computation-expensive backend process in Symfony2 / PHP that I would like to run multi-threaded.
Since I iterate over thousands of objects, I think I shouldn't start one thread per object. I would like to have a $cores variable that defines how many threads I want in parallel, then iterate through the loop and keep that many threads running. So every time a thread finishes, a new one with the next object should be started, until all objects are done.
Looking at the pthreads documentation and doing some Google searches, I can't find a useable example for this situation. All examples I found have a fixed number of threads they run once, none of them iterate over thousands of objects.
Can someone point me into the right direction to get started? I understand the basics of setting up a thread and joining it, etc. but not how to do it in a loop with a wait condition.
The answer to the question is use Pool and Worker abstraction.
The basic idea is that you ::submit Threaded objects to the Pool, which it stacks onto the next available Worker, distributing your Threaded objects (round robin) across all Workers.
Follows is super simple code is for PHP7 (pthreads v3):
<?php
$jobs = [];
while (count($jobs) < 2000) {
$jobs[] = mt_rand(0, 1999);
}
$pool = new Pool(8);
foreach ($jobs as $job) {
$pool->submit(new class($job) extends Threaded {
public function __construct(int $job) {
$this->job = $job;
}
public function run() {
var_dump($this->job);
}
});
}
$pool->shutdown();
?>
The jobs are pointless, obviously. In the real world, I guess your $jobs array keeps growing, so you can just swap foreach for some do {} while, and keep calling ::submit for new jobs.
In the real world, you will want to collect garbage in the same loop (just call Pool::collect with no parameters for default behaviour).
Noteworthy, none of this would be possible if it really were the case that PHP wasn't intended to work in multi-threaded environments ... it definitely is.
That is the answer to the question, but it doesn't make it the best solution to your problem.
You have mentioned in comments that you assume 8 threads executing Symfony code will take up less memory than 8 processes. This is not the case, PHP is shared nothing, all the time. You can expect 8 Symfony threads to take up as much memory as 8 Symfony processes, in fact, a little bit more. The benefit of using threads over processes is that they can communicate, synchronize and (appear to) share with each other.
Just because you can, doesn't mean you should. The best solution for the task at hand is probably to use some ready made package or software intended to do what is required.
Studying this stuff well enough to implement a robust solution is something that will take a long time, and you wouldn't want to deploy that first solution ...
If you decide to ignore my advice, and give it a go, you can find many examples in the github repository for pthreads.
Joe has a good approach, but I found a different solution elsewhere that I am now using. Basically, I have two commands, one control and one worker command. The control command starts background processes and checks their results:
protected function process($worker, $entity, $timeout=60) {
$min = $this->em->createQuery('SELECT MIN(e.id) FROM BM2SiteBundle:'.$entity.' e')->getSingleScalarResult();
$max = $this->em->createQuery('SELECT MAX(e.id) FROM BM2SiteBundle:'.$entity.' e')->getSingleScalarResult();
$batch_size = ceil((($max-$min)+1)/$this->parallel);
$pool = array();
for ($i=$min; $i<=$max; $i+=$batch_size) {
$builder = new ProcessBuilder();
$builder->setPrefix($this->getApplication()->getKernel()->getRootDir().'/console');
$builder->setArguments(array(
'--env='.$this->getApplication()->getKernel()->getEnvironment(),
'maf:worker:'.$worker,
$i, $i+$batch_size-1
));
$builder->setTimeout($timeout);
$process = $builder->getProcess();
$process->start();
$pool[] = $process;
}
$this->output->writeln($worker.": started ".count($pool)." jobs");
$running = 99;
while ($running > 0) {
$running = 0;
foreach ($pool as $p) {
if ($p->isRunning()) {
$running++;
}
}
usleep(250);
}
foreach ($pool as $p) {
if (!$p->isSuccessful()) {
$this->output->writeln('fail: '.$p->getExitCode().' / '.$p->getCommandLine());
$this->output->writeln($p->getOutput());
}
}
}
where $this->parallel is a variable I set to 6 on my 8 core machine, it signifies the number of processes to start. Note that this method requires that I iterate over a specific entity (it splits by that), which is always true in my use cases.
It's not perfect, but it starts completely new processes instead of threads, which I consider the better solution.
The worker command takes min and max ID numbers and does the actual work for the set between those two.
This approach works as long as the data set is reasonably well distributed. If you have no data in the 1-1000 range but every ID between 1000 and 2000 is used, the first three processes would have nothing to do.
Related
I’m trying to insert a large amount of data (30 000+ lines) in a MySQL database using Doctrine2 and the Symfony2 fixture bundle.
I looked at the right way to do it. I saw lots of questions about memory leaks and Doctrine, but no satisfying answer for me. It often comes the Doctrine clear() function.
So, I did various shapes of this:
while (($data = getData()) {
$iteration++;
$obj = new EntityObject();
$obj->setName('henry');
// Fill object...
$manager->persist($obj);
if ($iteration % 500 == 0) {
$manager->flush();
$manager->clear();
// Also tried some sort of:
// $manager->clear($obj);
// $manager->detach($obj);
// gc_collect_cycles();
}
}
PHP memory still goes wild, right after the flush() (I’m sure of that). In fact, every time the entities are flushed, memory goes up for a certain amount depending on batch size and the entities, until it reaches the deadly Allowed Memory size exhausted error. With a very very tiny entity, it works but memory consumption increase too much: several MB whereas it should be KB.
clear(), detach() or calling GC doesn’t seem to have an effect at all. It only clears some KB.
Is my approach flawed? Did I miss something, somewhere? Is it a bug?
More info:
Without flush() memory barely moves;
Lowering the batch do not change the outcome;
Data comes from a CSV that need to be sanitized;
EDIT (partial solution):
#qooplmao brought a solution that significantly decrease memory consumption, disable doctrine sql logger: $manager->getConnection()->getConfiguration()->setSQLLogger(null);
However, it is still abnormally high and increasing.
I resolved my problem using this resource, as #Axalix suggested.
This is how I modified the code:
// IMPORTANT - Disable the Doctrine SQL Logger
$manager->getConnection()->getConfiguration()->setSQLLogger(null);
// SUGGESION - make getData as a generator (using yield) to to save more memory.
while ($data = getData()) {
$iteration++;
$obj = new EntityObject();
$obj->setName('henry');
// Fill object...
$manager->persist($obj);
// IMPORTANT - Temporary store entities (of course, must be defined first outside of the loop)
$tempObjets[] = $obj;
if ($iteration % 500 == 0) {
$manager->flush();
// IMPORTANT - clean entities
foreach($tempObjets as $tempObject) {
$manager->detach($tempObject);
}
$tempObjets = null;
gc_enable();
gc_collect_cycles();
}
}
// Do not forget the last flush
$manager->flush();
And, last but not least, as I use this script with Symfony data fixtures, adding the --no-debug parameter in the command is also very important. Then memory consumption is stable.
I found out that Doctrine logs all SQLs during execute. I recommend to disable it with code below, it can really save memory:
use Doctrine\ORM\EntityManagerInterface;
public function __construct(EntityManagerInterface $entity_manager)
{
$em_connection = $entity_manager->getConnection();
$em_connection->getConfiguration()->setSQLLogger(null);
}
My suggestion is to drop the Doctrine approach for bulk inserts. I really like Doctrine but I just hate this kind of stuff on bulk inserts.
MySQL has a great thing called LOAD DATA. I would rather use it or even if I have to sanitize my csv first and do the LOAD after.
If you need to change the values, I would read csv to array $csvData = array_map("str_getcsv", file($csv));. Change whatever you need on the array and save it to the line. After that, use the new .csv to LOAD with MySQL.
To support my claims on why I wouldn't use Doctrine for this here described on the top.
$p = new Pool(10);
for ($i = 0; i<1000; i++){
$tasks[i] = new workerThread($i);
}
foreach ($tasks as $task) {
$p->submit($task);
}
// shutdown will wait for current queue to be completed
$p->shutdown();
// garbage collection check / read results
$p->collect(function($checkingTask){
return ($checkingTask->isGarbage);
});
class workerThread extends Collectable {
public function __construct($i){
$this->i= $i;
}
public function run(){
echo $this->i;
ob_flush();
flush();
}
}
The code above is a simple example that would cause crash. I'm trying to update the page real-time by putting ob_flush();and flush(); in the Threaded Object, and it mostly works as expected. So the code above is not guaranteed to crash every time, but if you run it a couple times more, sometimes the script stops and Apache restarts with an error message "httpd.exe Application error The instruction at "0x006fb17f" referenced memory at "0x028a1e20". The memory could not be "Written". Click on OK ."
I think it's caused by flushing conflict of multiple threads when they try to flush about the same time? What can I do to work around it and flush as there's any new output.
Multiple threads should not write standard output, there is no safe way to do this.
Zend provides no facility to make it safe, it works by coincidence, and will always be unsafe.
I am developing a networking application where I listen on a port and create a new socket and thread when a new connection request arrives; the architecture is working well but we are facing severe memory issues.
The problem is that even if I create a single thread, it does the work but the memory keeps on increasing.
To demonstrate the problem please review the following code where we start one thread of a class whose duty it to print a thread ID and a random number infinitely.
class ThreadWorker extends Thread {
public function run() {
while(1) {
echo $this->getThreadId()." => ".rand(1,1000)."\r\n";
}
}
}
$th = new ThreadWorker();
$th->start();
I am developing it on Windows OS and when I open the task manager the PHP.exe memory usage keeps on increasing until the system becomes unresponsive.
Please note that the PHP script is executed from command line:
PHP.exe pthreads-test.php
OK, I think the problem is that the thread loop is highly CPU consuming. Avoid such that code. If you just want to echo a message, I recommend putting a sleep() instruction after. Example:
class ThreadWorker extends Thread {
public function run() {
while(1) {
echo $this->getThreadId()." => ".rand(1,1000)."\r\n";
sleep(1);
}
}
}
EDIT
It seems there's a way to force garbage collection in PHP. On the other hand, sleep() is not a proper way to stabilize CPU use. Normally threads do things like reading from files, sockets or pipes, i.e., they often perform I/O operations which are normally blocking (i.e. they pause the thread until data is I/O operation is possible). This behaviour inherently yields the CPU and other resources to other threads, thus stabilizing the whole system.
I want to write a worker for beanstalkd in php, using a Zend Framework 2 controller. It starts via the CLI and will run forever, asking for jobs from beanstalkd like this example.
In simple pseudo-like code:
while (true) {
$data = $beanstalk->reserve();
$class = $data->class;
$params = $data->params;
$job = new $class($params);
$job();
}
The $job has here an __invoke() method of course. However, some things in these jobs might be running for a long time. Some might run with a considerable amount of memory. Some might have injected the $beanstalk object, to start new jobs themselves, or have a Zend\Di\Locator instance to pull objects from the DIC.
I am worried about this setup for production environments on the long term, as perhaps circular references might occur and (at this moment) I do not explicitly "do" any garbage collection while this action might run for weeks/months/years *.
*) In beanstalk, reserve is a blocking call and if no job is available, this worker will wait until it gets any response back from beanstalk.
My question: how will php handle this on the long term and should I take any special precaution to keep this from blocking?
This I did consider and might be helpful (but please correct if I am wrong and add more if possible):
Use gc_enable() before starting the loop
Use gc_collect_cycles() in every iteration
Unset $job in every iteration
Explicitly unset references in __destruct() from a $job
(NB: Update from here)
I did run some tests with arbitrary jobs. The jobs I included were: "simple", just set a value; "longarray", create an array of 1,000 values; "producer", let the loop inject $pheanstalk and add three simplejobs to the queue (so there is now a reference from job to beanstalk); "locatoraware", where a Zend\Di\Locator is given and all job types are instantiated (though not invoked). I added 10,000 jobs to the queue, then I reserved all jobs in a queue.
Results for "simplejob" (memory consumption per 1,000 jobs, with memory_get_usage())
0: 56392
1000: 548832
2000: 1074464
3000: 1538656
4000: 2125728
5000: 2598112
6000: 3054112
7000: 3510112
8000: 4228256
9000: 4717024
10000: 5173024
Picking a random job, measuring the same as above. Distribution:
["Producer"] => int(2431)
["LongArray"] => int(2588)
["LocatorAware"] => int(2526)
["Simple"] => int(2456)
Memory:
0: 66164
1000: 810056
2000: 1569452
3000: 2258036
4000: 3083032
5000: 3791256
6000: 4480028
7000: 5163884
8000: 6107812
9000: 6824320
10000: 7518020
The execution code from above is updated to this:
$baseMemory = memory_get_usage();
gc_enable();
for ( $i = 0; $i <= 10000; $i++ ) {
$data = $bheanstalk->reserve();
$class = $data->class;
$params = $data->params;
$job = new $class($params);
$job();
$job = null;
unset($job);
if ( $i % 1000 === 0 ) {
gc_collect_cycles();
echo sprintf( '%8d: ', $i ), memory_get_usage() - $baseMemory, "<br>";
}
}
As everybody notices, the memory consumption is in php not leveraged and kept to a minimum, but increases over time.
I've usually restarted the script regularly - though you don't have to do it after every job is run (unless you want to, and it's useful to clear memory). You could for example run for up to 100 jobs or more at a time or till the script had used say 20MB RAM, and then exit the script, to be instantly re-run.
My blogpost at http://www.phpscaling.com/2009/06/23/doing-the-work-elsewhere-sidebar-running-the-worker/ has some example shell scripts of re-running the scripts.
I ended up benchmarking my current code base line for line, after which I came to this:
$job = $this->getLocator()->get($data->name, $params);
It uses the Zend\Di dependency injection which instance manager tracks instances through the complete process. So after a job was invoked and could be removed, the instance manager still kept it in memory. Not using Zend\Di for instantiating the jobs immediately resulted in a static memory consumption instead of a linear one.
For memory safety, don't use looping after each sequence job in PHP. But just create simple bash script to do looping:
while [ true ] ; do
php do_jobs.php
done
Hey there, with do_jobs.php contains something like:
// ...
$data = $beanstalk->reserve();
$class = $data->class;
$params = $data->params;
$job = new $class($params);
$job();
// ...
simple right? ;)
I'm having a terrible amount of problems with an XML parsing script leaking some memory in PHP.
I've made a solution by rewriteing my whole OOP code to non OOP, which was mostly database checks and inserts, and that seemed to plug the hole, but I'm curious as to what caused it? I'm using Zend Framework and once I removed all of the model stuff, there are no leaks.
Just to give you and idea how bad it was:
I'm running through some 30k items on the same number of files. So, one per file. It started out by using 5mb!!! or RAM, when the file itself was only about 20kb big.
Could it be those referencing functions that I've read about? I thought that that bug was fixed?!
EDIT
I found out, that the leak was due to using Zend Framework database classes. Is there a way to call a shutdown function after each iteration, so that it would clear the resources?
Its pretty dificult to answer this as we have no code to work with.
Revert back to the OOP version of your sources and create a small class like so:
abstract class MemoryLeakLogger
{
public static $_logs = array();
public function Start($id,$action)
{
self::$_logs[$id] = array(
'action' => $action,
'start_ts' => microtime(),
'memory_start' => memory_get_usage()
);
}
public function End($id)
{
self::$_logs[$id]['end_ts'] = microtime();
self::$_logs[$id]['memory_end'] = memory_get_usage();
}
public static function GetInformation(){return self::$_logs;}
}
and then within your application do the following:
MemoryLeakLogger::Start(":xml_parse_links_set_2", "parsing set to of links");
/*
* Here you would do the relative code
*/
MemoryLeakLogger::End(":xml_parse_links_set_2");
And so forth throughout your application, you will need to create calculations to gather the offsets for memory usages and time taken per action, once your script is completed just debug the information by printing it in a readable fashion and look for peaks
You can also use xdebug to trace your application.
Hope this helps