Making a large processing job smaller

Making a large processing job smaller - php

This is the code I'm using as I work my way to a solution.
public function indexAction()
{
//id3 options
$options = array("version" => 3.0, "encoding" => Zend_Media_Id3_Encoding::ISO88591, "compat" => true);
//path to collection
$path = APPLICATION_PATH . '/../public/Media/Music/';//Currently Approx 2000 files
//inner iterator
$dir = new RecursiveDirectoryIterator($path, RecursiveDirectoryIterator::SKIP_DOTS);
//iterator
$iterator = new RecursiveIteratorIterator($dir, RecursiveIteratorIterator::SELF_FIRST);
foreach ($iterator as $file) {
if (!$file->isDir() && $file->getExtension() === 'mp3') {
//real path to mp3 file
$filePath = $file->getRealPath();
Zend_Debug::dump($filePath);//current results: accepted path no errors
$id3 = new Zend_Media_Id3v2($filePath, $options);
foreach ($id3->getFramesByIdentifier("T*") as $frame) {
$data[$frame->identifier] = $frame->text;
}
Zend_Debug::dump($data);//currently can scan the whole collection without timing out, but APIC data not being processed.
}
}
}
The problem: Process a file system of mp3 files in multiple directories. Extract id3 tag data to a database (3 tables) and extract the cover image from the tag to a separate file.
I can handle the actual extraction and data handling. My issue is with output.
With the way that Zend Framework 1.x handles output buffering, outputting an indicator that the files are being processed is difficult. In an old style PHP script, without output buffering, you could print out a bit of html with every iteration of the loop and have some indication of progress.
I would like to be able to process each album's directory, output the results and then continue on to the next album's directory. Only requiring user intervention on certain errors.
Any help would be appreciated.
Javascript is not the solution I'm looking for. I feel that this should be possible within the constructs of PHP and a ZF 1 MVC.
I'm doing this mostly for my own enlightenment, it seems a very good way to learn some important concepts.
[EDIT]
Ok, how about some ideas on how to break this down into smaller chunks. Process one chunk, commit, process next chunk, kind of thing. In or out of ZF.
[EDIT]
I'm beginning to see the problem with what I'm trying to accomplish. It seems that output buffering is not just happening in ZF, it's happening everywhere from ZF all the way to the browser. Hmmmmm...

Introduction
This is a typical example of what you should not do because
You are trying to parse ID3 tag with PHP which is slow and trying to have multiple parse files at once would definitely make it even slower
RecursiveDirectoryIterator would load all the files in a folder and sub folder from what i see there is no limit .. it can be 2,000 today the 100,000 the next day ? Total processing time is unpredictable and this can definitely take some hours in some cases
High dependence on single file system, with your current architecture the files are stored in local system so it would be difficult to split the files and do proper load balancing
You are not checking if the file information has been extracted before and this results Loop and extraction Duplication
No locking system .. this means that this process can be initiated simultaneously resulting to general slow performance on the server
Solution 1 : With Current Architecture
My advice is not to use loop or RecursiveDirectoryIterator to process the files in bulk.
Target the file as soon as they are uploaded or transferred to the server. That way you are only working with one file at a time this way to can spread the processing time.
Solution 2: Job Queue (Proposed Solution)
Your problem is exactly what Job Queue are designed to do you are also not limited to implementing the parsing with PHP .. you take advantage of C or C++ for performance
Advantage
Transfer Jobs to other machines or processes that are better suited to do the work
It allows you to do work in parallel, to load balance processing
Reduce the latency of page views in high-volume web applications by running time-consuming tasks asynchronously
Multiple Languages client in PHP sever in C
Examples have tested
ZemoMQ
Gearman
Beanstalkd
Expected Process Client
Connect To Job Queue eg German
Connect to Database eg MongoDB or Redis
Loop with folder path
Check File extension
If file is mp3 , generate file hash eg. sha1_file
Check if file has been sent for processing
send hash, file to Job Server
Expected Process Server
Connect To Job Queue eg German
Connect to Database eg MongoDB or Redis
Receive hash / file
Extract ID3 tag ;
Update DB with ID3 Tag Information
Finally this processing can be done on multiple servers in parallel

One solution would be to use a Job Queue, such a Gearman. Gearman is an excellent solution for this kind of problem, and easily integrated with Zend Framework (http://blog.digitalstruct.com/2010/10/17/integrating-gearman-into-zend-framework/)
It will allow you to create a worker to process each "chuck", allowing your process to continue unblocked while the job is processed, very handy for long running proceeses such as music/image processing etc http://gearman.org/index.php?id=getting_started

I'm not familiar with how Zend Framework work. I will give you a general advice. When working with process that is doing so many iterative and possibly in long time, it is generally advised that the long process be moved into background process. Or, in web related, moved into cron job.
If the process you want to use is for single site, you can implement something like this, in your cronjob (note: rough pseudo-code):
<?php
$targetdir = "/path/to/mp3";
$logdir = "/path/to/log/";
//check if current state is exists. If it does, then previous cronjob is still running
//we should stop this process so that it doesn't do duplicated process which might have introduced random bugs
if(file_exists($logdir."current-state")){
exit;
}
//start process, write state to logdir
file_put_contents($logdir."current-log", "process started at ".date("Y-m-d H:i:s"));
file_put_contents($logdir."current-state", "started\t".date("Y-m-d H:i:s"));
$dirh = opendir($targetdir);
while($file = readdir($dirh)){
//lets ignore current and parent dir
if(in_array($file, array('.', '..'))) continue;
//do whatever process you want to do here:
//you might want to write another log, too:
file_put_contents($logdir."current-log", "processing file {$file}", FILE_APPEND);
}
closedir($dirh);
file_put_contents($logdir."current-log", "process finished at ".date("Y-m-d H:i:s"));
//process is finished, delete current-state:
unlink($logdir."current-state");
Next, in your php file for web, you can add snippet to, says admin page, or footer, or whatever page you want, to see the progress:
<?php
if(file_exists($logdir."current-state")){
echo "<strong>there are background process running.</strong>";
} else {
echo "<strong>no background process running.</strong>";
}

I should suggest using plugin.
class Postpone extends Zend_Controller_Plugin_Abstract
{
private $tail;
private $callback;
function __construct ($callback = array())
{
$this->callback = $callback;
}
public function setRequest (Zend_Controller_Request_Abstract $request)
{
/*
* We use layout, which essentially contains some html and a placeholder for action output.
* We put the marker into this placeholder in order to figure out "the tail" -- the part of layout that goes after placeholder.
*/
$mark = '---cut-here--';
$layout = $this->getLayout ();
$layout->content = $mark;
/*
* Now we have it.
*/
$this->tail = preg_replace ("/.*$mark/s", '', $layout->render ());
}
public function postDispatch (Zend_Controller_Request_Abstract $request)
{
$response = $this->getResponse ();
$response->sendHeaders ();
/*
* The layout generates its output to the default section of the response.
* This output inludes "the tail".
* We don't need this tail shown right now, because we have callback to do.
* So we remove it here for a while, but we'll show it later.
*/
echo substr ($this->getResponse ()
->getBody ('default'), 0, - strlen ($this->tail));
/*
* Since we have just echoed the result, we don't need it in the response. Do we?
*/
Zend_Controller_Front::getInstance ()->returnResponse(true);
$response->clearBody ();
/*
* Now to business.
* We execute that calculation intensive callback.
*/
if (! empty ($this->callback) && is_callable ($this->callback))
{
call_user_func ($this->callback);
}
/*
* We sure don't want to leave behind the tail.
* Output it so html looks consistent.
*/
echo $this->tail;
}
/**
* Returns layout object
*/
function getLayout ()
{
$layout_plugin = Zend_Controller_Front::getInstance ()->getPlugin ('Zend_Layout_Controller_Plugin_Layout');
return $layout = $layout_plugin->getLayout ();
}
}
class IndexController extends Zend_Controller_Action
{
/*
* This is a calculation intensive action
*/
public function indexAction ()
{
/*
* Zend_Layout in its current implementation accumulates whole action output inside itself.
* This fact hampers out intention to gradually output the result.
* What we do here is we defer execution of our intensive calculation in form of callback into the Postpone plugin.
* The scenario is:
* 1. Application started
* 2. Layout is started
* 3. Action gets executed (except callback) and its output is collected by layout.
* 4. Layout output goes to response.
* 5. Postpone::postDispatch outputs first part of the response (without the tail).
* 6. Postpone::postDispatch calls the callback. Its output goes stright to browser.
* 7. Postpone::postDispatch prints the tail.
*/
$this->getFrontController ()
->registerPlugin (new Postpone (function ()
{
/*
* A calculation immigration
* Put your actual calculations here.
*/
echo str_repeat(" ", 5000);
foreach (range (1, 500) as $x)
{
echo "<p>$x</p><br />\n";
usleep(61500);
flush();
}
}), 1000);
}
}

Related

What's wrong with my concurrent programming logic?

I wrote a web spider to spider pages concurrently. For each link that the spider finds, I want to fork off a new child that starts the process all over again.
I don't want to overload the target server so I created a static array that all objects can access. Each child can add their PID to the array, and either parent or child should check the array to see if $maxChildren have been met, and if so, patiently wait until any child finishes.
As you see, I have $maxChildren set to 3. I am expecting to see 3 simultaneous processes at any given time. However, that's not the case. The linux top command shows 12 to 30 processes at any given time. In concurrent programming, how can I regulate the number of simultaneous processes? My logic is currently inspired by how Apache handles it's max children, but I'm not exactly sure how that works.
As pointed out in one of the answers, globally accessing the static variable brings up issues with race conditions. To deal with this, the $children array takes the unique $PID of the process as both the key and it's value, thereby creating a unique value. My thinking is that since any object can only deal with one $children[$pid] value, locking is not necessary. Is this not true? Is there a chance that two processes could try to unset or add the same value at some point?
private static $children = array();
private $maxChildren = 3;
public function concurrentSpider($url) {
// STEP 1:
// Download the $url
$pageData = http_get($url, $ref = '');
if (!$this->checkIfSaved($url)) {
$this->save_link_to_db($url, $pageData);
}
// STEP 2:
// extract all hyperlinks from this url's page data
$linksOnThisPage = $this->harvest_links($url, $pageData);
// STEP 3:
// Check the links array from STEP 2 to see if this page has
// already been saved or is excluded because of any other
// logic from the excluded_link() function
$filteredLinks = $this->filterLinks($linksOnThisPage);
shuffle($filteredLinks);
// STEP 4: loop through each of the links and
// repeat the process
foreach ($filteredLinks as $filteredLink) {
$pid = pcntl_fork();
switch ($pid) {
case -1:
print "Could not fork!\n";
exit(1);
case 0:
if ($this->checkIfSaved($filteredLink)) {
exit();
}
//$pid = getmypid();
print "In child with PID: " . getmypid() . " processing $filteredLink \n";
$var[$pid]->concurrentSpider($filteredLink);
sleep(2);
exit(1);
default:
// Add an element to the children array
self::$children[$pid] = $pid;
// If the maximum number of children has been
// achieved, wait until one or more return
// before continuing.
while (count(self::$children) >= $this->maxChildren) {
//print count(self::$children) . " children \n";
$pid = pcntl_waitpid(-1, $status);
unset(self::$children[$pid]);
}
}
}
}
This is written in PHP. I know that the pcntl_waitpid function with argument of -1 waits for any child to complete regardless of the parent (http://php.net/manual/en/function.pcntl-waitpid.php).
What's wrong with my logic and how can I correct it so that only $maxChildren processes are running simultaneously? I'm also open to improving the logic in general if you have suggestions.

First thing to note: if this is truly a global being shared among multiple threads, it's possible that multiple threads are adding to it at once and you're running afoul of a race condition. You need some sort of concurrency control to ensure that only one process is accessing your global array at once.
Also, try the simple debugging trick of having each process write out (to the console or to a file) its PID and the full contents of the global array each time a new spider is forked. It will help you to check your assumptions (which are plainly wrong at some point) and figure out what's going wrong.
EDIT: (In response to the comments)
I'm not a PHP developer, but if I had to guess, based on the fact that you're using an OS tool that counts OS-level processes, I'd guess that your fork is spawning multiple processes, but your static array is global within the current process. Implementing system-wide shared memory is a lot more complicated!
If you just want to count something and ensure that instances of a shared resource don't grow out of control, look into semaphores, and see if you can find a way in PHP to create a named semaphore object that can be shared between multiple instances of your spider.

Use a real programming language ;)
Step 1 is kind of bad why are you downloading if it might be in the db. Put that inside the if and see if you can put a mutex around it. Maybe so something in sql to imitate one.
I hope harvest_links uses a proper html processor with css selector support (i like fizzler for .NET). I guess regular expression would be fine if its just to get links but it is possible to mess up.
I see step 4 and i don't think its bad but personally i'd do it a different way.
I'd have something like step one to insert url,page,flag into a db. Then i'd have another process or the same one ask the db for unprocessed pages and set the flag to some value if it errors and another if its successful. This is so if something fails of the process exits (shutdown, crash, power out, etc) it can pick it up easily and don't need to scan every page to find where it left off. It just ask the database for the next link and redoes what it didnt finish

PHP doesn't support multithreading, therefore it doesn't support mutexes or any other synchronization methods. As others have said in their answers, this will lead to a race condition.
You'll have to write a wrapper in C or bash. That way, the PHP script can submit targets to the wrapper, and the wrapper will handle scheduling.
Another approach is to rewrite your spider in Python or Ruby, both of which support multithreading. That will eliminate the need for interprocess communication.
Edit: On second thought, the best way is to write the wrapper in Python or Ruby and reuse your existing PHP code as a black box. That's a compromise of the solutions above.

If the spider is for practical purposes, you might want to google "curl multithread"
cURL Multi Threading with PHP

How can I run a php script exactly once - No sessions

I have the following question: how can I run a php script only once? Before people start to reply that this is indeed a similar or duplicate question, please continue reading...
The situation is as follows, I'm currently writing my own MVC Framework and I've come up with a module based system so I can easily add new functionality to my framework. In order to do so, I created a /ROOT/modules directory in which one could add the new modules.
So as you can imagine, the script needs to read the directory, read all the php files, parse them and then is able to execute the new functionality, however it has to do this for all the webbrowsers requests. This would make this task about O(nAmountOfRequests * nAmountOfModules) which is rather big on websites with a large amount of user requests every second.
Then I figured, what if I would introduce a session variable like: $_SESSION['modulesLoaded'] and then simply check if its set or not. This would reduce the load to O(nUniqueAmountOfRequests * nAmountOfModules) but this is still a large Big O if the only thing I want to do is read the directory once.
What I have now is the following:
/** Load the modules */
require_once(ROOT . DIRECTORY_SEPARATOR . 'modules' . DIRECTORY_SEPARATOR . 'module_bootloader.php');
Which exists of the following code:
<?php
//TODO: Make sure that the foreach only executes once for all the requests instead of every request.
if (!array_key_exists('modulesLoaded', $_SESSION)) {
foreach (glob('*.php') as $module) {
require_once($module);
}
$_SESSION['modulesLoaded'] = '1';
}
So now the question, is there a solution, like a superglobal variable, that I can access and exists for all requests, so instead of the previous Big Os, I can make a Big O thats only exists of nAmountOfModules? Or is there another way of easily reading the module files only once?
Something like:
if(isFirstRequest){
foreach (glob('*.php') as $module) {
require_once($module);
}
}

At the most basic form, if you want to run it once, and only once (per installation, not per user), have your intensive script change something on the server state (add a file, change a file, change a record in a database), then check against that every time a request to run it is issued.
If you find a match, it would mean the script was already run, and you can continue with the process without having to run it again.

when called, lock the file, at the end of the script, delete the file. only called once. and as so not needed any longer, vanished in nirvana.
This naturally works the other way round, too:
<?php
$checkfile = __DIR__ . '/.checkfile';
clearstatcache(false, $checkfile);
if (is_file($checkfile)) {
return; // script did run already
}
touch($checkfile);
// run the rest of your script.

Just cache the array() to a file and, when you upload new modules, just delete the file. It will have to recreate itself and then you're all set again.
// If $cache file does not exist or unserialize fails, rebuild it and save it
if(!is_file($cache) or (($cached = unserialize(file_get_contents($cache))) === false)){
// rebuild your array here into $cached
$cached = call_user_func(function(){
// rebuild your array here and return it
});
// store the $cached data into the $cache file
file_put_contents($cache, $cached, LOCK_EX);
}
// Now you have $cached file that holds your $cached data
// Keep using the $cached variable now as it should hold your data
This should do it.
PS: I'm currently rewriting my own framework and do the same thing to store such data. You could also use a SQLite DB to store all such data your framework needs but make sure to test performance and see if it fits your needs. With proper indexes, SQLite is fast.

Multiple "agents" handling a single array

Apologies if this has been covered before - I did my searching but possibly may not know the correct terms to have used.
This process is handled with PHP.
Here's the situation:
I have a large array of file names. The script I have opens these files and enters their content into a database. Processing these files one at a time takes over 24 hours, and these files are updated on a daily basis.
Breaking the single large array into four smaller arrays and running concurrent processes finishes the job before the 24 hour window elapses, but sometimes one or two processes will finish hours before the others because file sizes vary on a daily basis.
Much like people who stock retail shelves (who else has worked that nightmare before?) pitch in to help out with what's left after finishing their own tasks, I'd like to have a script in place where these "agents" do the same.
Here's some basics of what I have figured out - it could be wrong, and I'm not too proud to protest if I am :-)
$files = array('file1','file2','file3','file4','file5');
//etc... on to over 4k elements
while($file = array_pop($files)){
//Something in here... I have no idea what.
}
Ideas? Something like four function calls or four loops within that overarching 'while' has crossed my mind, but I'm pretty sure it's going to wait on executing subsequent calls until the previous one(s) finish.
Any help is appreciated. I'm seriously stuck on this one!
Thanks!

A database-backed message queue seems the obvious solution but I think that's overkill in this case. I would simply put the files to be processed into a single dedicated queue directory, then use the DirectoryIterator class to scan it. Something like this:
while (true) {
look in the queue directory for a file
if you don't fine one, exit the script, all processing is done
if you find one, rename it or move it to a work directory
if the rename/move command succeeded, process the file
if the rename/move command failed, one of the other threads got it first
}
Edit:
Regarding launching the workers, you could use a simple shell script to spawn the PHP processes in the background:
NUM_WORKERS=5
for WORKER in $(seq 1 ${NUM_WORKERS})
do
echo "starting worker ${WORKER}"
php -f /path/to/my/process.php &
done
Then, create a cron entry to run this launcher, for example, at midnight:
0 0 * * * /path/to/launcher.sh

You want what's called a "message queue". Something like beanstalkd
You'll basically create a list of messages that include your individual filenames. You'll then create a set of processors to process them. Each processor will handle one file then go back to the queue to see if there are more messages/files waiting to be processed.
EDIT:
Here's an analogy to help explain message queues. Your first idea is like a human manager taking a stack of files, dividing them into four piles and then handing each of his four employees a pile to process. A message queue is more like this: the manager puts all the files on a table and tells each employee to take a single file from the table and process it. He tells them when they're done with the first file to keep taking files until there are no more files on the table. When all the files are done, the employees can go home.
One employee might end up with really large files and only handle a few, while another employee might get smaller files and handle many. It doesn't matter how many each employee handles, they'll all keep working until the table is empty.

I would have a socket server master script that hands out file paths to x number of slave scripts, until there are no files left to process. This way, all the slave scripts will keep running, and you can hand out file paths dynamically as they are requested.
Something like this:
master.php
<?php
// load the array of files to process (however you do this)
$fileList = file('filelist.txt');
// Create a listening socket on localhost
$serverSocket = stream_socket_server('tcp://127.0.0.1:7878');
$sockets = array($serverSocket);
$clients = array();
// Loop while there are still files to process
while (count($fileList)) {
// Run a select() call on the existing sockets' read buffers
// Skip to next iteration if no sockets are waiting for handling
if (stream_select($read = $sockets, $write = NULL, $except = NULL, 1) < 1) {
continue;
}
// Loop sockets with data to read
foreach ($read as $socket) {
if ($socket == $serverSocket) {
// Accept new clients
$sockets[] = $clients[] = stream_socket_accept($serverSocket);
} else if (trim(fgets($socket)) == 'next') {
// Hand out a new file path to the client
fwrite($socket, array_shift($fileList)."\n");
if (!count($fileList)) {
break 2;
}
}
}
}
// When we're done, disconnect the clients
foreach ($clients as $socket) {
#fclose($socket);
}
// ...and close the listen socket
#fclose($serverSocket);
slave.php
<?php
$socket = fsockopen('127.0.0.1', 7878);
while (!feof($socket)) {
// Get a new file path from the master
fwrite($socket,"next\n");
$path = trim(fgets($socket));
if (is_file($path)) {
// Process the file at $path here
}
}
You then just need to start master.php, then when it is running, you can start however many instances of slave.php as you want, and they will all keep running until there are no more files to process.
Obviously, this has no error handling, but it should provide a basic framework to get you started. This relies on blocking function calls (stream_select() and fgets()) to avoid a race condition - this may or may not be sufficient for your purposes.

PHP Singleton class for all requests

I have a simple problem. I use php as server part and have an html output. My site shows a status about an other server. So the flow is:
Browser user goes on www.example.com/status
Browser contacts www.example.com/status
PHP Server receives request and ask for stauts on www.statusserver.com/status
PHP Receives the data, transforms it in readable HTML output and send it back to the client
Browser user can see the status.
Now, I've created a singleton class in php which accesses the statusserver only 8 seconds. So it updates the status all 8 seconds. If a user requests for update inbetween, the server returns the locally (on www.example.com) stored status.
That's nice isn't it? But then I did an easy test and started 5 browser windows to see if it works. Here it comes, the php server created a singleton class for each request. So now 5 Clients requesting all 8 seconds the status on the statusserver. this means I have every 8 second 5 calls to the status server instead of one!
Isn't there a possibility to provide only one instance to all users within an apache server? That would be solve the problem in case 1000 users are connecting to www.example.com/status....
thx for any hints
=============================
EDIT:
I already use a caching on harddrive:
public function getFile($filename)
{
$diff = (time()-filemtime($filename));
//echo "diff:$diff<br/>";
if($diff>8){
//echo 'grösser 8<br/>';
self::updateFile($filename);
}
if (is_readable($filename)) {
try {
$returnValue = #ImageCreateFromPNG($filename);
if($returnValue == ''){
sleep(1);
return self::getFile($filename);
}else{
return $returnValue;
}
} catch (Exception $e){
sleep(1);
return self::getFile($filename);
}
} else {
sleep(1);
return self::getFile($filename);
}
}
this is the call in the singleton. I call for a file and save it on harddrive. but all the request call it at same time and start requesting the status server.
I think the only solution would be a standalone application which does an update every 8 seconds on the file... All request should just read the file and nomore able to update it.
This standalone could be a perl script or something similar...

Php requests are handled by different processes and each of them have a different state, there isn't any resident process like in other web development framework. You should handle that behavior directly in your class using for instance some caching.
The method which query the server status should have this logic
public function getStatus() {
if (!$status = $cache->load()) {
// cache miss
$status = // do your query here
$cache->save($status); // store the result in cache
}
return $status;
}
In this way only one request of X will fetch the real status. The X value depends on your cache configuration.
Some cache library you can use:
APC
Memcached
Zend_Cache which is just a wrapper for actual caching engines
Or you can store the result in plain text file and on every request check for the m_time of the file itself and rewrite it if more than xx seconds are passed.
Update
Your code is pretty strange, why all those sleep calls? Why a try/catch block when ImageCreateFromPNG does not throw?
You're asking a different question, since php is not an application server and cannot store state across processes your approach is correct. I suggest you to use APC (uses shared memory so it would be at least 10x faster than reading a file) to share status across different processes. With this approach your code could become
public function getFile($filename)
{
$latest_update = apc_fetch('latest_update');
if (false == $latest_update) {
// cache expired or first request
apc_store('latest_update', time(), 8); // 8 is the ttl in seconds
// fetch file here and save on local storage
self::updateFile($filename);
}
// here you can process the file
return $your_processed_file;
}
With this approach the code in the if part will be executed from two different processes only if a process is blocked just after the if line, which should not happen because is almost an atomic operation.
Furthermore if you want to ensure that you should use something like semaphores to handle that, but it would be an oversized solution for this kind of requirement.
Finally imho 8 seconds is a small interval, I'd use something bigger, at least 30 seconds, but this depends from your requirements.

As far as I know it is not possible in PHP. However, you surely can serialize and cache the object instance.
Check out http://php.net/manual/en/language.oop5.serialization.php

Speeding up a PHP App

I have a list of data that needs to be processed. The way it works right now is this:
A user clicks a process button.
The PHP code takes the first item that needs to be processed, takes 15-25 secs to process it, moves on to the next item, and so on.
This takes way too long. What I'd like instead is that:
The user clicks the process button.
A PHP script takes the first item and starts to process it.
Simultaneously another instance of the script takes the next item and processes it.
And so on, so around 5-6 of the items are being process simultaneously and we get 6 items processed in 15-25 secs instead of just one.
Is something like this possible?
I was thinking that I use CRON to launch an instance of the script every second. All items that need to be processed will be flagged as such in the MySQL database, so whenever an instance is launched through CRON, it will simply take the next item flagged to be processed and remove the flag.
Thoughts?
Edit: To clarify something, each 'item' is stored in a mysql database table as seperate rows. Whenever processing starts on an item, it is flagged as being processed in the db, hence each new instance will simply grab the next row which is not being processed and process it. Hence I don't have to supply the items as command line arguments.

Here's one solution, not the greatest, but will work fine on Linux:
Split the processing PHP into a separate CLI scripts in which:
The command line inputs include `$id` and `$item`
The script writes its PID to a file in `/tmp/$id.$item.pid`
The script echos results as XML or something that can be read into PHP to stdout
When finished the script deletes the `/tmp/$id.$item.pid` file
Your master script (presumably on your webserver) would do:
`exec("nohup php myprocessing.php $id $item > /tmp/$id.$item.xml");` for each item
Poll the `/tmp/$id.$item.pid` files until all are deleted (sleep/check poll is enough)
If they are never deleted kill all the processing scripts and report failure
If successful read the from `/tmp/$id.$item.xml` for format/output to user
Delete the XML files if you don't want to cache for later use
A backgrounded nohup started application will run independent of the script that started it.
This interested me sufficiently that I decided to write a POC.
test.php
<?php
$dir = realpath(dirname(__FILE__));
$start = time();
// Time in seconds after which we give up and kill everything
$timeout = 25;
// The unique identifier for the request
$id = uniqid();
// Our "items" which would be supplied by the user
$items = array("foo", "bar", "0xdeadbeef");
// We exec a nohup command that is backgrounded which returns immediately
foreach ($items as $item) {
exec("nohup php proc.php $id $item > $dir/proc.$id.$item.out &");
}
echo "<pre>";
// Run until timeout or all processing has finished
while(time() - $start < $timeout)
{
echo (time() - $start), " seconds\n";
clearstatcache(); // Required since PHP will cache for file_exists
$running = array();
foreach($items as $item)
{
// If the pid file still exists the process is still running
if (file_exists("$dir/proc.$id.$item.pid")) {
$running[] = $item;
}
}
if (empty($running)) break;
echo implode($running, ','), " running\n";
flush();
sleep(1);
}
// Clean up if we timeout out
if (!empty($running)) {
clearstatcache();
foreach ($items as $item) {
// Kill process of anything still running (i.e. that has a pid file)
if(file_exists("$dir/proc.$id.$item.pid")
&& $pid = file_get_contents("$dir/proc.$id.$item.pid")) {
posix_kill($pid, 9);
unlink("$dir/proc.$id.$item.pid");
// Would want to log this in the real world
echo "Failed to process: ", $item, " pid ", $pid, "\n";
}
// delete the useless data
unlink("$dir/proc.$id.$item.out");
}
} else {
echo "Successfully processed all items in ", time() - $start, " seconds.\n";
foreach ($items as $item) {
// Grab the processed data and delete the file
echo(file_get_contents("$dir/proc.$id.$item.out"));
unlink("$dir/proc.$id.$item.out");
}
}
echo "</pre>";
?>
proc.php
<?php
$dir = realpath(dirname(__FILE__));
$id = $argv[1];
$item = $argv[2];
// Write out our pid file
file_put_contents("$dir/proc.$id.$item.pid", posix_getpid());
for($i=0;$i<80;++$i)
{
echo $item,':', $i, "\n";
usleep(250000);
}
// Remove our pid file to say we're done processing
unlink("proc.$id.$item.pid");
?>
Put test.php and proc.php in the same folder of your server, load test.php and enjoy.
You will of course need nohup (unix) and PHP cli to get this to work.
Lots of fun, I may find a use for it later.

Use an external workqueue like Beanstalkd which your PHP script writes a bunch of jobs too. You have as many worker processes pulling jobs from beanstalkd and processing them as fast as possible. You can spin up as many workers as you have memory / CPU. Your job body should contain as little information as possible, maybe just some IDs which you hit the DB with. beanstalkd has a slew of client APIs and itself has a very basic API, think memcached.
We use beanstalkd to process all of our background jobs, I love it. Easy to use, its very fast.

There is no multithreading in PHP, however you can use fork.
php.net:pcntl-fork
Or you could execute a system() command and start another process which is multithreaded.

can you implementing threading in javascript on the client side? seems to me i've seen a javascript library (from google perhaps?) that implements it. google it and i'm sure you'll find something. i've never done it, but i know its possible. anyway, your client-side javascript could activate (ajax) a php script once for each item in separate threads. that might be easier than trying to do it all on the server side.
-don

If you are running a high traffic PHP server you are INSANE if you do not use Alternative PHP Cache: http://php.net/manual/en/book.apc.php . You do not have to make code modifications to run APC.
Another useful technique that can work along with APC is using the Smarty template system which allows you to cache output so that pages do not have to be rebuilt.

To solve this problem, I've used two different products; Gearman and RabbitMQ.
The benefit of putting your jobs into some sort of queuing software like Gearman or Rabbit is that you have multiple machines, they can all participate in processing items off the queue(s).
Gearman is easier to setup, so I'd suggest poking around with it a bit first. If you find you need something more heavy duty with queue robustness; Look into RabbitMQ
http://www.danga.com/gearman/
http://pear.php.net/package/Net_Gearman (PEAR library)

You can use pcntl_fork() and family to fork a process - however you may need something like IPC to communicate back to the parent process that the child process (the one you fork'd) is finished.
You could have them write to shared memory, like via memcache or a DB.
You could also have the child process write the completed data to a file, that the parent process keeps checking - as each child process completes the file is created/written to/updated, and parent process can grab it, one at a time, and them throw them back to the callee/client.
The parent's job is to control the queue, to make sure the same data isn't processed twice and also to sanity check the children (better kill that runaway process and start over...etc)
Something else to keep in mind - on windows platforms you are going to be severely limited - I dont even think you have access to pcntl_ unless you compiled PHP with support for it.
Also, can you cache the data once its been processed, or is it unique data every time? that would surely speed things up..?

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Making a large processing job smaller - php

Related

What's wrong with my concurrent programming logic?

How can I run a php script exactly once - No sessions

Multiple "agents" handling a single array

PHP Singleton class for all requests

Speeding up a PHP App

Categories

Resources