PHP pthreads segmentation fault if not storing array of stacked elements

PHP pthreads segmentation fault if not storing array of stacked elements - php

I try to use code from the examples (Stacking.php):
$worker = new ExampleWorker("My Worker Thread");
$work = array();
while($i++<10){
printf(
"Stacking: %d/%d\n", $i, $worker->stack($work[]=new Work(array(rand()*100)))
);
}
I would like to adopt this example, put this in infinite WHILE loop waiting for events from database and stack new elements when they appear.
There will be huge amount of events to stack, I can't store all of them in $work array and would like to clean it somehow or not use at all.
The problem is that when I change:
$worker->stack($work[]=new Work(array(rand()*100))
to
$worker->stack(new Work(array(rand()*100))
PHP process segfaults after first worker finishes
How can I put $worker->stack in infinite loop without having to store reference to each stacked work?

The sort answer is; you cannot.
The details of how the implementation works and why are laid out here: https://gist.github.com/krakjoe/6437782
Anything I say will just be a repetition of that, in any case you will benefit from reading the whole document in it's entirety.

Related

What is the best way to process large data in PHP

I have a daily cron job which will get a XML from web service. Sometimes it is large, contains more than 10K products information and the XML size will be 14M example.
What I need to do is parsing XML to object then processing them. The processing is quite complicated. Not like directly put them into the database, I need to do a lot operation on them, and finally put them into many database tables.
It is just in one PHP script. I don't have any experience on dealing with large data.
So the problem is it take a lot of memory. And very long time to do it. I turn my localhost PHP memory_limit to 4G and running 3.5hrs then got successful. But my production host is not allowed such amount memory.
I do a research but I am very confused which is a right way to dealing with this situation.
Here is a sample of my code:
function my_items_import($xml){
$results = new SimpleXMLElement($xml);
$results->registerXPathNamespace('i', 'http://schemas.microsoft.com/dynamics/2008/01/documents/Item');
//it will loop over 10K
foreach($results->xpath('//i:Item') as $data) {
$data->registerXPathNamespace('i', 'http://schemas.microsoft.com/dynamics/2008/01/documents/Item');
//my processing code here, it will call a other functions to do a lot things
processing($data);
}
unset($results);
}

As a start don't use SimpleXMLElement on the whole document. SimpleXMLElement loads everything in the memory and is not efficient for large data. Here is a snippet from a real code. You'll need to accommodate it to your case but hope you'll get the general idea.
$reader = new XMLReader();
$reader->xml($xml);
// Get cursor to first article
while($reader->read() && $reader->name !== 'article');
// Iterate articles
while($reader->name === 'article')
{
$doc = new DOMDocument('1.0', 'UTF-8');
$article = simplexml_import_dom($doc->importNode($reader->expand(), true));
processing($article);
$reader->next('article');
}
$reader->close();
$article is SimpleXMLElement which can be processed further.
This way you save a lot of memory by making only single article nodes go into memory.
Additionally if each processing() function take long time you can turn it into a background process which runs in separately from the main script and several processing() functions can be started in parallel.

Key hints:
dispose data during process.
Dispose data - mean over write it with blank data. BTW, unset is slower than overwrite with null
Use functions or static method, avoid as much oop instance as possible.
One extra question, how long it takes to loop your xml without do [lots things]:
function my_items_import($xml){
$results = new SimpleXMLElement($xml);
$results->registerXPathNamespace('i', 'http://schemas.microsoft.com/dynamics/2008/01/documents/Item');
//it will loop over 10K
foreach($results->xpath('//i:Item') as $data) {
$data->registerXPathNamespace('i', 'http://schemas.microsoft.com/dynamics/2008/01/documents/Item');
//my processing code here, it will call a other functions to do a lot things
//processing($data);
}
//unset($result);// no need
}

PHP fork process - getting child output in parent

I want to achieve the following:
Initialize an array. Child process adds some elements to the array. Parent process adds some elements to the array. Finally before exiting, print all elements.
Following is the code that I wrote:
<?php
$values=array();
$pid = pcntl_fork();
if (!$pid) {
sleep(2);
$values[]="Put by child";
exit(0);
}
$values[]="Put by parent";
pcntl_waitpid($pid, $status);
print_r($values);
?>
However, it only prints one value - Put by parent. Can someone please explain the behavior and suggest the right code?
Regards,
JP

(sorry for crossposting)
I suggest a look at socket_create_pair().
In the PHP manual is a very short & easy example of interprocess communication (IPC) between a fork()-parent and the child.
And using serialize() und unserialize() You could even transfer complex data types like arrays...

Forked children will gain their own dedicated copy of their memory space as soon as they write anywhere to it - this is "copy-on-write". While shmop does provide access to a common memory location, the actual PHP variables and whatnot defined in the script are NOT shared between the children.
Doing $x = 7; in one child will not make the $x in the other children also become 7. Each child will have its own dedicated $x that is completely independent of everyone else's copy.
a local domain socket is easiest. have the parent open one with fsockopen for each child immediately before the fork. that way you can have one comm channel per child: http://php.net/manual/en/transports.unix.php and http://php.net/manual/en/transports.unix.php.
You could also shared memory, or open a bi-directional communications channel between the two processes and build a little api to send data back and forth.
As long as father and children know the key/keys of the shared memory segment is ok to do a shmop_open before pcnlt_fork. But remember that pcnlt_fork returns 0 in the child's process and -1 on failure to create the child (check your code near the comment /confusion/). The father will have in $pid the PID of the child process just created.
Check it here:
http://php.net/manual/es/function.pcntl-fork.php

The child's code is missing the print_r() statement.
The parent won't print what the child added to values, as the addition was done after the child process had been fork()ed off, and with this it had gotten its own copy of the prcoess' memory.
From the fork-tag's excerpt (emphasis by me):
The fork() function is the Unix/Linux/POSIX way of creating a new process by duplicating the calling process.
This behaviour of forking is different from threading where all threads share the same address space.

Memory usage increasing inside loop: are Magento functions the cause?

My platform is PHP 5.2, Apache, Magento EE 1.9 and CentOS.
I have a pretty basic script which is fetching about 60,000 rows of data from an MS-SQL database using PHP's ms_sql() functions. The data is then processed a bit via data from Magento and finally written to a text file.
Really simple stuff...
$result = mssql_query($query);
while($row = mssql_fetch_assoc($result)) {
$member = $row; // Copied so I can modify it
// Do some stuff with each row... e.g.:
$customer = Mage::getModel("customer/customer");
$customer->loadByEmail($member["email"]);
$customerId = $customer->getId();
// Some more stuff like that...
$ordersCollection = Mage::getResourceModel('sales/order_collection');
// ...........
// Some more stuff like that...
$wishList = Mage::getModel('wishlist/wishlist')->loadByCustomer($customer);
// ...........
// Write straight to a file
fwrite($fp, implode("\t", $member) . "\r\n");
// Probably not even necessary
unset($member);
}
The problem is, the memory usage of my script increases with each iteration of the loop (about 10MB for every 300 rows), with a theoretical peak of about 2GB (though it hasn't got there yet).
I've taken great pains to ensure that I'm not leaving any data in memory. No huge arrays are building up, no variables are being added to, everything is either unset() or directly overwritten with each iteration of the loop.
So my question is: could the Magento functions be causing memory leaks?
And if so, how do I stop them from doing so?
Ideally this script should be totally "passive": just grab the query results, modify them a bit (very temporary memory needed for this) then dump them straight to a file and destroy the memory. But this is not happening!
Thanks

Exclude all Mage:: from your code and just dump data to the file without processing. And see what happens to the memory while doing this. Then start adding the Mage:: functions back one by one and see when it breaks.
This way you'll find the culprit. Then you need to start digging into it's implementation and see what could go wrong. You could also consider doing the processing without relying on your Mage:: calls. Just write the plain code to deal with the data in self-contained functions/classes and compare how things turn out if you exclude Mage:: entirely from the process.

Yes — PHP has a long history of non-ideal behavior when it comes to memory managment and code that pushes the edges of it's object oriented model.
You can try an alternate method of querying for your data that wastes less memory, or you can read up on how the Magento core team deals with this same issue.

What's wrong with my concurrent programming logic?

I wrote a web spider to spider pages concurrently. For each link that the spider finds, I want to fork off a new child that starts the process all over again.
I don't want to overload the target server so I created a static array that all objects can access. Each child can add their PID to the array, and either parent or child should check the array to see if $maxChildren have been met, and if so, patiently wait until any child finishes.
As you see, I have $maxChildren set to 3. I am expecting to see 3 simultaneous processes at any given time. However, that's not the case. The linux top command shows 12 to 30 processes at any given time. In concurrent programming, how can I regulate the number of simultaneous processes? My logic is currently inspired by how Apache handles it's max children, but I'm not exactly sure how that works.
As pointed out in one of the answers, globally accessing the static variable brings up issues with race conditions. To deal with this, the $children array takes the unique $PID of the process as both the key and it's value, thereby creating a unique value. My thinking is that since any object can only deal with one $children[$pid] value, locking is not necessary. Is this not true? Is there a chance that two processes could try to unset or add the same value at some point?
private static $children = array();
private $maxChildren = 3;
public function concurrentSpider($url) {
// STEP 1:
// Download the $url
$pageData = http_get($url, $ref = '');
if (!$this->checkIfSaved($url)) {
$this->save_link_to_db($url, $pageData);
}
// STEP 2:
// extract all hyperlinks from this url's page data
$linksOnThisPage = $this->harvest_links($url, $pageData);
// STEP 3:
// Check the links array from STEP 2 to see if this page has
// already been saved or is excluded because of any other
// logic from the excluded_link() function
$filteredLinks = $this->filterLinks($linksOnThisPage);
shuffle($filteredLinks);
// STEP 4: loop through each of the links and
// repeat the process
foreach ($filteredLinks as $filteredLink) {
$pid = pcntl_fork();
switch ($pid) {
case -1:
print "Could not fork!\n";
exit(1);
case 0:
if ($this->checkIfSaved($filteredLink)) {
exit();
}
//$pid = getmypid();
print "In child with PID: " . getmypid() . " processing $filteredLink \n";
$var[$pid]->concurrentSpider($filteredLink);
sleep(2);
exit(1);
default:
// Add an element to the children array
self::$children[$pid] = $pid;
// If the maximum number of children has been
// achieved, wait until one or more return
// before continuing.
while (count(self::$children) >= $this->maxChildren) {
//print count(self::$children) . " children \n";
$pid = pcntl_waitpid(-1, $status);
unset(self::$children[$pid]);
}
}
}
}
This is written in PHP. I know that the pcntl_waitpid function with argument of -1 waits for any child to complete regardless of the parent (http://php.net/manual/en/function.pcntl-waitpid.php).
What's wrong with my logic and how can I correct it so that only $maxChildren processes are running simultaneously? I'm also open to improving the logic in general if you have suggestions.

First thing to note: if this is truly a global being shared among multiple threads, it's possible that multiple threads are adding to it at once and you're running afoul of a race condition. You need some sort of concurrency control to ensure that only one process is accessing your global array at once.
Also, try the simple debugging trick of having each process write out (to the console or to a file) its PID and the full contents of the global array each time a new spider is forked. It will help you to check your assumptions (which are plainly wrong at some point) and figure out what's going wrong.
EDIT: (In response to the comments)
I'm not a PHP developer, but if I had to guess, based on the fact that you're using an OS tool that counts OS-level processes, I'd guess that your fork is spawning multiple processes, but your static array is global within the current process. Implementing system-wide shared memory is a lot more complicated!
If you just want to count something and ensure that instances of a shared resource don't grow out of control, look into semaphores, and see if you can find a way in PHP to create a named semaphore object that can be shared between multiple instances of your spider.

Use a real programming language ;)
Step 1 is kind of bad why are you downloading if it might be in the db. Put that inside the if and see if you can put a mutex around it. Maybe so something in sql to imitate one.
I hope harvest_links uses a proper html processor with css selector support (i like fizzler for .NET). I guess regular expression would be fine if its just to get links but it is possible to mess up.
I see step 4 and i don't think its bad but personally i'd do it a different way.
I'd have something like step one to insert url,page,flag into a db. Then i'd have another process or the same one ask the db for unprocessed pages and set the flag to some value if it errors and another if its successful. This is so if something fails of the process exits (shutdown, crash, power out, etc) it can pick it up easily and don't need to scan every page to find where it left off. It just ask the database for the next link and redoes what it didnt finish

PHP doesn't support multithreading, therefore it doesn't support mutexes or any other synchronization methods. As others have said in their answers, this will lead to a race condition.
You'll have to write a wrapper in C or bash. That way, the PHP script can submit targets to the wrapper, and the wrapper will handle scheduling.
Another approach is to rewrite your spider in Python or Ruby, both of which support multithreading. That will eliminate the need for interprocess communication.
Edit: On second thought, the best way is to write the wrapper in Python or Ruby and reuse your existing PHP code as a black box. That's a compromise of the solutions above.

If the spider is for practical purposes, you might want to google "curl multithread"
cURL Multi Threading with PHP

PHP foreach stack - is it possible that functions called in a for each loop are still running when the next iteration is called

I am having problems with cURL not being able to connect to a server that returns an xml feed and am not sure if my code is stacking up and causing the problem. Is it possible the final function called in this foreach loop is still running when the next loop iteration comes round.
Is it possible to make sure all functions in the loop complete before the next iteration begins, or does foreach do this by default anyway? I tried setting a return true on process_xml() and running a test in the loop: if($this->process_xml($xml_array))continue;
but it didn't seem to have an effect and seems like a bad idea anyway.
foreach($arrayOfUrls as $url){
//retrieve xml from url as string.
if($url_xml_string = $this->getFeedStringUsing_cURL($url)){
$xml_object = simplexml_load_string($url_xml_string);
$xml_array = $this->feedStringToArray($xml_object);
//process the xml.
$this->process_xml($xml_array);
}
}

No, this is not possible. Each statement is executed and finished before the next statement is run.

and am not sure if my code is stacking up
Not sure? If it's important to you why don't you find out? Without knowing what OS you are running on its rather hard to advise how you'd go about that - but netstat might be a good starting point.
Is it possible the final function called in this foreach loop is still running
It's highly improbable - PHP scripts run in a single thread of execution unless you tell them not to - but the curl extension allows you to define callbacks into your php code which run before the operation completes, and the curl_multi_ family of functions also allow you to run php code while requests are in progress.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP pthreads segmentation fault if not storing array of stacked elements - php

The sort answer is; you cannot. The details of how the implementation works and why are laid out here: https://gist.github.com/krakjoe/6437782 Anything I say will just be a repetition of that, in any case you will benefit from reading the whole document in it's entirety.

Related

What is the best way to process large data in PHP

PHP fork process - getting child output in parent

Memory usage increasing inside loop: are Magento functions the cause?

What's wrong with my concurrent programming logic?

PHP foreach stack - is it possible that functions called in a for each loop are still running when the next iteration is called

Categories

Resources