For awhile now, I've been storing serialized objects from forked processes in /tmp with file_put_contents.
Once all child processes wrap up, I'm simply using file_get_contents and unserializing the data to rebuild my object for processing.
so my question is, is there a better way of storing my data without writing to /tmp?
Outside of storing the data in a file, the only other native solutions that come to mind is shm http://www.php.net/manual/en/function.shm-attach.php or socket stream pairs http://www.php.net/manual/en/function.stream-socket-pair.php
Either of these should be doable if the data collected is unimportant after the script is run. The idea behind both of them is to just open a communication channel between your parent and child processes. I will say that my personal opinion is that unless there is some sort of issue using the file system is causing that it is by far the least complicated solution.
SHM
The idea with shm is that instead of storing the serialized objects in a file, you would store them in an shm segment protected for concurrency by a semaphore. Forgive the code, it is rough but should be enough to give you the general idea.
/*** Configurations ***/
$blockSize = 1024; // Size of block in bytes
$shmVarKey = 1; //An integer specifying the var key in the shm segment
/*** In the children processes ***/
//First you need to get a semaphore, this is important to help make sure you don't
//have multiple child processes accessing the shm segment at the same time.
$sem = sem_get(ftok(tempnam('/tmp', 'SEM'), 'a'));
//Then you need your shm segment
$shm = shm_attach(ftok(tempnam('/tmp', 'SHM'), 'a'), $blockSize);
if (!$sem || !$shm) {
//error handling goes here
}
//if multiple forks hit this line at roughly the first time, the first one gets the lock
//everyone else waits until the lock is released before trying again.
sem_acquire($sem);
$data = shm_has_var($shm, $shmVarKey) ? shm_get_var($shm, $shmVarKey) : shm_get_var($shm, $shmVarKey);
//Here you could key the data array by probably whatever you are currently using to determine file names.
$data['child specific id'] = 'my data'; // can be an object, array, anything that is php serializable, though resources are wonky
shm_put_var($shm, $shmVarKey, $data); // important to note that php handles the serialization for you
sem_release($sem);
/*** In the parent process ***/
$shm = shm_attach(ftok(tempnam('/tmp', 'SHM'), 'a'), $blockSize);
$data = shm_get_var($shm, $shmVarKey);
foreach ($data as $key => $value)
{
//process your data
}
Stream Socket Pair
I personally love using these for inter process communication. The idea is that prior to forking, you create a stream socket pair. This results in two read write sockets being created that are connected to each other. One of them should be used by the parent, one of them should be used by the child. You would have to create a separate pair for each child and it will change your parent's model a little bit in that it will need to manage the communication a bit more real time.
Fortunately the PHP docs for this function has a great example: http://us2.php.net/manual/en/function.stream-socket-pair.php
You could use a shared memory cache such as memcached which would be faster, but depending on what you're doing and how sensitive/important the data is, a file-based solution may be your best option.
Related
I want to achieve the following:
Initialize an array. Child process adds some elements to the array. Parent process adds some elements to the array. Finally before exiting, print all elements.
Following is the code that I wrote:
<?php
$values=array();
$pid = pcntl_fork();
if (!$pid) {
sleep(2);
$values[]="Put by child";
exit(0);
}
$values[]="Put by parent";
pcntl_waitpid($pid, $status);
print_r($values);
?>
However, it only prints one value - Put by parent. Can someone please explain the behavior and suggest the right code?
Regards,
JP
(sorry for crossposting)
I suggest a look at socket_create_pair().
In the PHP manual is a very short & easy example of interprocess communication (IPC) between a fork()-parent and the child.
And using serialize() und unserialize() You could even transfer complex data types like arrays...
Forked children will gain their own dedicated copy of their memory space as soon as they write anywhere to it - this is "copy-on-write". While shmop does provide access to a common memory location, the actual PHP variables and whatnot defined in the script are NOT shared between the children.
Doing $x = 7; in one child will not make the $x in the other children also become 7. Each child will have its own dedicated $x that is completely independent of everyone else's copy.
a local domain socket is easiest. have the parent open one with fsockopen for each child immediately before the fork. that way you can have one comm channel per child: http://php.net/manual/en/transports.unix.php and http://php.net/manual/en/transports.unix.php.
You could also shared memory, or open a bi-directional communications channel between the two processes and build a little api to send data back and forth.
As long as father and children know the key/keys of the shared memory segment is ok to do a shmop_open before pcnlt_fork. But remember that pcnlt_fork returns 0 in the child's process and -1 on failure to create the child (check your code near the comment /confusion/). The father will have in $pid the PID of the child process just created.
Check it here:
http://php.net/manual/es/function.pcntl-fork.php
The child's code is missing the print_r() statement.
The parent won't print what the child added to values, as the addition was done after the child process had been fork()ed off, and with this it had gotten its own copy of the prcoess' memory.
From the fork-tag's excerpt (emphasis by me):
The fork() function is the Unix/Linux/POSIX way of creating a new process by duplicating the calling process.
This behaviour of forking is different from threading where all threads share the same address space.
My platform is PHP 5.2, Apache, Magento EE 1.9 and CentOS.
I have a pretty basic script which is fetching about 60,000 rows of data from an MS-SQL database using PHP's ms_sql() functions. The data is then processed a bit via data from Magento and finally written to a text file.
Really simple stuff...
$result = mssql_query($query);
while($row = mssql_fetch_assoc($result)) {
$member = $row; // Copied so I can modify it
// Do some stuff with each row... e.g.:
$customer = Mage::getModel("customer/customer");
$customer->loadByEmail($member["email"]);
$customerId = $customer->getId();
// Some more stuff like that...
$ordersCollection = Mage::getResourceModel('sales/order_collection');
// ...........
// Some more stuff like that...
$wishList = Mage::getModel('wishlist/wishlist')->loadByCustomer($customer);
// ...........
// Write straight to a file
fwrite($fp, implode("\t", $member) . "\r\n");
// Probably not even necessary
unset($member);
}
The problem is, the memory usage of my script increases with each iteration of the loop (about 10MB for every 300 rows), with a theoretical peak of about 2GB (though it hasn't got there yet).
I've taken great pains to ensure that I'm not leaving any data in memory. No huge arrays are building up, no variables are being added to, everything is either unset() or directly overwritten with each iteration of the loop.
So my question is: could the Magento functions be causing memory leaks?
And if so, how do I stop them from doing so?
Ideally this script should be totally "passive": just grab the query results, modify them a bit (very temporary memory needed for this) then dump them straight to a file and destroy the memory. But this is not happening!
Thanks
Exclude all Mage:: from your code and just dump data to the file without processing. And see what happens to the memory while doing this. Then start adding the Mage:: functions back one by one and see when it breaks.
This way you'll find the culprit. Then you need to start digging into it's implementation and see what could go wrong. You could also consider doing the processing without relying on your Mage:: calls. Just write the plain code to deal with the data in self-contained functions/classes and compare how things turn out if you exclude Mage:: entirely from the process.
Yes — PHP has a long history of non-ideal behavior when it comes to memory managment and code that pushes the edges of it's object oriented model.
You can try an alternate method of querying for your data that wastes less memory, or you can read up on how the Magento core team deals with this same issue.
I would like to implement a quick and efficient serialization mechanism between PHP requests for virtual named resources that would unlock when the script is finished, either normally or due to error. I had eaccelerator_lock() and its corresponding eaccelerator_unlock() in the past, but eaccelerator doesn't implement that function anymore. What I want to do is something like:
lock_function("my-named-resource");
..
my_might_abort_abruptly_function();
..
unlock_function("my-named-resource");
Other PHP scripts calling lock_function() with the exact same parameter should block until this script calls unlock_function() or aborts. The resource name is unknown before the processing (it's a generated string) and can't be constrained to a small set (i.e., the locking mechanism should have good granularity). I would like to avoid try/catch code, because there are circunstances in which catch is not called. Also, any mechanism depending on manual usleep() spinning (instead of native OS blocking) should be avoided.
Mine is the only running application in the server. The system is a CentOS 6 Linux with PHP 5.3.3, Apache 2.2.15 and I have full control over it.
I explored the following alternatives:
semaphores: they are not well implemented in PHP; Linux allows arrays of thousands, while PHP only allocates one per id.
flock(): my resources are virtual, and flock() would only lock whole/real/existing files; I'd need to pre-create thousands of files and choose one to lock with a hash function. The granularity would depend on the number of files.
dio_fcntl(): I could attempt to reproduce the idea of flock() with a single file and fcntl(F_SETLK). This would have the advantage of a good granularity without the need of many files; the file could even be zero bytes long! (F_SETLK can lock beyond the end of the file). Alas! The problem is that nowhere in the documentation says that dio_fcntl() will release resources when the script terminates.
database lock: I could implement some key locking in a database with good key locking granularity, althought this is too database dependent. It would not be so quick either.
implement my own PHP extension: I'd really like to avoid that path.
The thing is, I think someone somewhere should have thought of this before me. What would be a good choice? Is there another solution I'm not seeing?
Thanks in advance. Guillermo.
You can always go old school and touch a file when your script starts and remove it when complete.
You could register_shutdown_function to remove the file.
The existence or absence of the file would indicate the locked state of the resource.
It turns out dio_open() does release the resources upon script termination. So I ended writing up the following functions:
$lockfile = $writable_dir."/serialized.lock";
function serialize_access($name)
{
$f = serialize_openfile();
if( !$f ) return false;
$h = serialize_gethash($name);
return dio_fcntl($f, F_SETLKW, array("whence"=>SEEK_SET,"start"=>$h, "length"=>1, "type"=>F_WRLCK)) >= 0;
}
function serialize_release($name)
{
$f = serialize_openfile();
if( !$f ) return false;
$h = serialize_gethash($name);
#dio_fcntl($f, F_SETLK, array("whence"=>SEEK_SET,"start"=>$h, "length"=>1, "type"=>F_UNLCK));
}
function serialize_gethash($name)
{
// Very good granularity (2^31)
return crc32($name) & 0x7fffffff;
}
function serialize_openfile()
{
global $lockfile, $serialize_file;
if( !isset($serialize_file) )
{
$serialize_file = false;
if( extension_loaded("dio") )
{
$serialize_file = #dio_open($lockfile,O_RDWR);
if( $serialize_file )
{
// Do not attempt to create the file with dio_open()
// because the file permissions get all mangled.
$prev = umask(0);
$temp = fopen($lockfile,"a");
if( $temp )
{
$serialize_file = #dio_open($lockfile,O_RDWR);
fclose($temp);
}
umask($prev);
}
}
}
return $serialize_file;
}
It seems to work very well.
implement my own PHP extension
You might want to check ninja-mutex library which does exactly what you want
I wrote a web spider to spider pages concurrently. For each link that the spider finds, I want to fork off a new child that starts the process all over again.
I don't want to overload the target server so I created a static array that all objects can access. Each child can add their PID to the array, and either parent or child should check the array to see if $maxChildren have been met, and if so, patiently wait until any child finishes.
As you see, I have $maxChildren set to 3. I am expecting to see 3 simultaneous processes at any given time. However, that's not the case. The linux top command shows 12 to 30 processes at any given time. In concurrent programming, how can I regulate the number of simultaneous processes? My logic is currently inspired by how Apache handles it's max children, but I'm not exactly sure how that works.
As pointed out in one of the answers, globally accessing the static variable brings up issues with race conditions. To deal with this, the $children array takes the unique $PID of the process as both the key and it's value, thereby creating a unique value. My thinking is that since any object can only deal with one $children[$pid] value, locking is not necessary. Is this not true? Is there a chance that two processes could try to unset or add the same value at some point?
private static $children = array();
private $maxChildren = 3;
public function concurrentSpider($url) {
// STEP 1:
// Download the $url
$pageData = http_get($url, $ref = '');
if (!$this->checkIfSaved($url)) {
$this->save_link_to_db($url, $pageData);
}
// STEP 2:
// extract all hyperlinks from this url's page data
$linksOnThisPage = $this->harvest_links($url, $pageData);
// STEP 3:
// Check the links array from STEP 2 to see if this page has
// already been saved or is excluded because of any other
// logic from the excluded_link() function
$filteredLinks = $this->filterLinks($linksOnThisPage);
shuffle($filteredLinks);
// STEP 4: loop through each of the links and
// repeat the process
foreach ($filteredLinks as $filteredLink) {
$pid = pcntl_fork();
switch ($pid) {
case -1:
print "Could not fork!\n";
exit(1);
case 0:
if ($this->checkIfSaved($filteredLink)) {
exit();
}
//$pid = getmypid();
print "In child with PID: " . getmypid() . " processing $filteredLink \n";
$var[$pid]->concurrentSpider($filteredLink);
sleep(2);
exit(1);
default:
// Add an element to the children array
self::$children[$pid] = $pid;
// If the maximum number of children has been
// achieved, wait until one or more return
// before continuing.
while (count(self::$children) >= $this->maxChildren) {
//print count(self::$children) . " children \n";
$pid = pcntl_waitpid(-1, $status);
unset(self::$children[$pid]);
}
}
}
}
This is written in PHP. I know that the pcntl_waitpid function with argument of -1 waits for any child to complete regardless of the parent (http://php.net/manual/en/function.pcntl-waitpid.php).
What's wrong with my logic and how can I correct it so that only $maxChildren processes are running simultaneously? I'm also open to improving the logic in general if you have suggestions.
First thing to note: if this is truly a global being shared among multiple threads, it's possible that multiple threads are adding to it at once and you're running afoul of a race condition. You need some sort of concurrency control to ensure that only one process is accessing your global array at once.
Also, try the simple debugging trick of having each process write out (to the console or to a file) its PID and the full contents of the global array each time a new spider is forked. It will help you to check your assumptions (which are plainly wrong at some point) and figure out what's going wrong.
EDIT: (In response to the comments)
I'm not a PHP developer, but if I had to guess, based on the fact that you're using an OS tool that counts OS-level processes, I'd guess that your fork is spawning multiple processes, but your static array is global within the current process. Implementing system-wide shared memory is a lot more complicated!
If you just want to count something and ensure that instances of a shared resource don't grow out of control, look into semaphores, and see if you can find a way in PHP to create a named semaphore object that can be shared between multiple instances of your spider.
Use a real programming language ;)
Step 1 is kind of bad why are you downloading if it might be in the db. Put that inside the if and see if you can put a mutex around it. Maybe so something in sql to imitate one.
I hope harvest_links uses a proper html processor with css selector support (i like fizzler for .NET). I guess regular expression would be fine if its just to get links but it is possible to mess up.
I see step 4 and i don't think its bad but personally i'd do it a different way.
I'd have something like step one to insert url,page,flag into a db. Then i'd have another process or the same one ask the db for unprocessed pages and set the flag to some value if it errors and another if its successful. This is so if something fails of the process exits (shutdown, crash, power out, etc) it can pick it up easily and don't need to scan every page to find where it left off. It just ask the database for the next link and redoes what it didnt finish
PHP doesn't support multithreading, therefore it doesn't support mutexes or any other synchronization methods. As others have said in their answers, this will lead to a race condition.
You'll have to write a wrapper in C or bash. That way, the PHP script can submit targets to the wrapper, and the wrapper will handle scheduling.
Another approach is to rewrite your spider in Python or Ruby, both of which support multithreading. That will eliminate the need for interprocess communication.
Edit: On second thought, the best way is to write the wrapper in Python or Ruby and reuse your existing PHP code as a black box. That's a compromise of the solutions above.
If the spider is for practical purposes, you might want to google "curl multithread"
cURL Multi Threading with PHP
Apologies if this has been covered before - I did my searching but possibly may not know the correct terms to have used.
This process is handled with PHP.
Here's the situation:
I have a large array of file names. The script I have opens these files and enters their content into a database. Processing these files one at a time takes over 24 hours, and these files are updated on a daily basis.
Breaking the single large array into four smaller arrays and running concurrent processes finishes the job before the 24 hour window elapses, but sometimes one or two processes will finish hours before the others because file sizes vary on a daily basis.
Much like people who stock retail shelves (who else has worked that nightmare before?) pitch in to help out with what's left after finishing their own tasks, I'd like to have a script in place where these "agents" do the same.
Here's some basics of what I have figured out - it could be wrong, and I'm not too proud to protest if I am :-)
$files = array('file1','file2','file3','file4','file5');
//etc... on to over 4k elements
while($file = array_pop($files)){
//Something in here... I have no idea what.
}
Ideas? Something like four function calls or four loops within that overarching 'while' has crossed my mind, but I'm pretty sure it's going to wait on executing subsequent calls until the previous one(s) finish.
Any help is appreciated. I'm seriously stuck on this one!
Thanks!
A database-backed message queue seems the obvious solution but I think that's overkill in this case. I would simply put the files to be processed into a single dedicated queue directory, then use the DirectoryIterator class to scan it. Something like this:
while (true) {
look in the queue directory for a file
if you don't fine one, exit the script, all processing is done
if you find one, rename it or move it to a work directory
if the rename/move command succeeded, process the file
if the rename/move command failed, one of the other threads got it first
}
Edit:
Regarding launching the workers, you could use a simple shell script to spawn the PHP processes in the background:
NUM_WORKERS=5
for WORKER in $(seq 1 ${NUM_WORKERS})
do
echo "starting worker ${WORKER}"
php -f /path/to/my/process.php &
done
Then, create a cron entry to run this launcher, for example, at midnight:
0 0 * * * /path/to/launcher.sh
You want what's called a "message queue". Something like beanstalkd
You'll basically create a list of messages that include your individual filenames. You'll then create a set of processors to process them. Each processor will handle one file then go back to the queue to see if there are more messages/files waiting to be processed.
EDIT:
Here's an analogy to help explain message queues. Your first idea is like a human manager taking a stack of files, dividing them into four piles and then handing each of his four employees a pile to process. A message queue is more like this: the manager puts all the files on a table and tells each employee to take a single file from the table and process it. He tells them when they're done with the first file to keep taking files until there are no more files on the table. When all the files are done, the employees can go home.
One employee might end up with really large files and only handle a few, while another employee might get smaller files and handle many. It doesn't matter how many each employee handles, they'll all keep working until the table is empty.
I would have a socket server master script that hands out file paths to x number of slave scripts, until there are no files left to process. This way, all the slave scripts will keep running, and you can hand out file paths dynamically as they are requested.
Something like this:
master.php
<?php
// load the array of files to process (however you do this)
$fileList = file('filelist.txt');
// Create a listening socket on localhost
$serverSocket = stream_socket_server('tcp://127.0.0.1:7878');
$sockets = array($serverSocket);
$clients = array();
// Loop while there are still files to process
while (count($fileList)) {
// Run a select() call on the existing sockets' read buffers
// Skip to next iteration if no sockets are waiting for handling
if (stream_select($read = $sockets, $write = NULL, $except = NULL, 1) < 1) {
continue;
}
// Loop sockets with data to read
foreach ($read as $socket) {
if ($socket == $serverSocket) {
// Accept new clients
$sockets[] = $clients[] = stream_socket_accept($serverSocket);
} else if (trim(fgets($socket)) == 'next') {
// Hand out a new file path to the client
fwrite($socket, array_shift($fileList)."\n");
if (!count($fileList)) {
break 2;
}
}
}
}
// When we're done, disconnect the clients
foreach ($clients as $socket) {
#fclose($socket);
}
// ...and close the listen socket
#fclose($serverSocket);
slave.php
<?php
$socket = fsockopen('127.0.0.1', 7878);
while (!feof($socket)) {
// Get a new file path from the master
fwrite($socket,"next\n");
$path = trim(fgets($socket));
if (is_file($path)) {
// Process the file at $path here
}
}
You then just need to start master.php, then when it is running, you can start however many instances of slave.php as you want, and they will all keep running until there are no more files to process.
Obviously, this has no error handling, but it should provide a basic framework to get you started. This relies on blocking function calls (stream_select() and fgets()) to avoid a race condition - this may or may not be sufficient for your purposes.