File parsing with Beanstalkd Queue - php

I am currently re-writing a file uploader. Parsing scripts for different data types that currently exists are perl scripts. Program is written in php. The way it currently is that it allows for a single file upload only and once the file is on the server, it will call the perl script for the uploaded file's data type. We have over 20 data types.
What I have done so far is to write a new system that allows multiple file uploads. It will first let you validate your attributes before upload, compress them using zipjs, upload the zipped file, uncompress it on the server, for each file, call the parser for it.
I am at the part where I need to say for each file, put the parser call in the queue. I can not run multiple parsers at once. Rough sketch is below.
for each file
$job = "exec('location/to/file/parser.pl file');";
// using the pheanstalkd library
$this->pheanstalk->useTube('testtube')->put($job);
Depending on the file, parsing may take 2mins or 20mins. When I put the job on the queue, I need to make sure that the parser for the file2 fires after the parser for file1 finishes. How can I accomplish that ? Thx

Beanstalk doesn't have the notion of dependencies between jobs. You seem to have two jobs:
Job A: Parse file 1
Job B: Parse file 2
If you need job B to run only after job A, the most straightforward way to do this is for Job A to create Job B as its last action.

I have achieved what I wanted which was to request more time if parser is taking longer than a minute. Worker is a php script and I can get the process id when I execute the "exec" command for the parser executable file. I am currently using the code snippet below in my worker.
$job = $pheanstalk->watch( $tubeName )->reserve();
// do some more stuff here ... then
// while the parser is running on the server
while( file_exists( "/proc/$pid" ) )
{
// make sure the job is still reserved on the queue server
if( $job ) {
// get the time left on the queue server for the job
$jobStats = $pheanstalk->statsJob( $job );
// when there is not enough time, request more
if( $jobStats['time-left'] < 5 ){
echo "requested more time for the job at ".$jobStats['time-left']." secs left \n";
$pheanstalk->touch( $job );
}
}
}

Related

Concurrent writing to zip file

I have a process done in PHP. This process get a file from internet and put it inside a zip file. The target zipfile is based in an algorithm, there are 4096 zipfiles. The target zipfile is based in a hash of the url processed.
I have another program that launches http petitions so i can run the script concurrently (around 110 processes).
My question is simple. Since threads are pseudorandom, easly 2 threads can try to add files to the same zipfile in the same moment.
Is it possible? Will the file get corrupt if 2 proccess try to add files at same time?
Locking the file or something like that would be possible a possible solution.
I was thinking to use semaphores, but reading, php semaphores dont work under windows.
I have seen this possible solution:
if ( !function_exists('sem_get') ) {
function sem_get($key) { return fopen(__FILE__.'.sem.'.$key, 'w+'); }
function sem_acquire($sem_id) { return flock($sem_id, LOCK_EX); }
function sem_release($sem_id) { return flock($sem_id, LOCK_UN); }
}
Anyways the question is if it is allowed to add files to a zip file from 2 or more different php proccesses at same time.
Short answer: No! The zip algorithm analyses and compresses one stream at a time.
This is tough under Windows. It's far from easy in Linux! I would be tempted to create a db table with a unique index, and use that index number to determine a filename, or at least flag that a file is being written to.

Download a large XML file from an external source in the background, with the ability to resume download if incomplete

Some background information
The files I would like to download is kept at the external server for a week, and a new XML file(10-50mb large) is created there every hour with a different name. I would like the large file to be downloaded to my server chunk by chunk in the background each time my website is loaded, perhaps 0.5mb each time, and then resume the download the next time someone else loads the website. This would require my site to have atleast 100 pageloads each hour to stay updated, so perhaps abit more of the file each time if possible. I have researched simpleXML, XMLreader, SAX parsing, but whatever I do, it seems it takes too long to parse the file directly, therefore I would like a different approach, namely downloading it like described above.
If I download a 30mb large XML file, I can parse it locally with XMLreader in 3 seconds(250k iterations) only, but when I try to do the same from the external server limiting it to 50k iterations, it uses 15secs to read that small part, so it would not be possible to parse it directly from that server it seems.
Possible solutions
I think it's best to use cURL. But then again, perhaps fopen(), fsockopen(), copy() or file_get_contents() are the way to go. I'm looking for advice on what functions to use to make this happen, or different solutions on how I can parse a 50mb external XML file into a mySQL database.
I suspect a Cron job every hour would be the best solution, but I am not sure how well that would be supported by webhosting companies, and I have no clue how to do something like that. But if thats the best solution, and the majority thinks so, I will have to do my research in that area too.
If a java applet/javascript running in the background would be a better solution, please point me in the right direction when it comes to functions/methods/libraries there aswell.
Summary
What's the best solution to downloading parts of a file in the
background, and resume the download each time my website is loaded
until its completed?
If the above solution would be moronic to even try, what
language/software would you use to achieve the same thing(download a large file every hour)?
Thanks in advance for all answers, and sorry for the long story/question.
Edit: I ended up using this solution to get the files with cron job scheduling a php script. It checks my folder for what files I already have, generates a list of the possible downloads for the last four days, then downloads the next XMLfile in line.
<?php
$date = new DateTime();
$current_time = $date->getTimestamp();
$four_days_ago = $current_time-345600;
echo 'Downloading: '."\n";
for ($i=$four_days_ago; $i<=$current_time; ) {
$date->setTimestamp($i);
if($date->format('H') !== '00') {
$temp_filename = $date->format('Y_m_d_H') ."_full.xml";
if(!glob($temp_filename)) {
$temp_url = 'http://www.external-site-example.com/'.$date->format('Y/m/d/H') .".xml";
echo $temp_filename.' --- '.$temp_url.'<br>'."\n";
break; // with a break here, this loop will only return the next file you should download
}
}
$i += 3600;
}
set_time_limit(300);
$Start = getTime();
$objInputStream = fopen($temp_url, "rb");
$objTempStream = fopen($temp_filename, "w+b");
stream_copy_to_stream($objInputStream, $objTempStream, (1024*200000));
$End = getTime();
echo '<br>It took '.number_format(($End - $Start),2).' secs to download "'.$temp_filename.'".';
function getTime() {
$a = explode (' ',microtime());
return(double) $a[0] + $a[1];
}
?>
edit2: I just wanted to inform you that there is a way to do what I asked, only it would'nt work in my case. With the amount of data I need the website would have to have 400+ visitors an hour for it to work properly. But with smaller amounts of data there are some options; http://www.google.no/search?q=poormanscron
You need to have a scheduled, offline task (e.g., cronjob). The solution you are pursuing is just plain wrong.
The simplest thing that could possibly work is a php script you run every hour (scheduled via cron, most likely) that downloads the file and processes it.
You could try fopen:
<?php
$handle = fopen("http://www.example.com/test.xml", "rb");
$contents = stream_get_contents($handle);
fclose($handle);
?>

PHP queue file implementation

For a project I was working on I need a queue which will be too large to hold in normal memory. I had been implementing it as a simple file where it would read the whole file take the first few (~100) lines, process them, then write back the updated queue with new instructions added and the old ones removed. However, since the queue became too large to hold in memory like this I need something different. Preferably someone can tell me a way to peel off just the first few lines of a file without having to look at the rest of the data. I had thought about using a database (MySQL probably with sorted insert timestamps) but I would heavily prefer to do it without for load and bandwidth reasons (several servers would have to all be sending and receiving a lot of data from the DB). The language I'm working in is PHP but really this question is more about unix files I suppose. Any help would be appreciated.
Sucking out the first line of a file is pretty trivial (fopen() followed by an fgets()). Re-writing the file to remove completed jobs would be very painful, especially if you've got multiple concurrent servers working off the same queue file.
One alternative would be to use a seperate file for each job. If you have some concurrency-safe method of generating an incrementing ID for these files, then it'd be a simple matter of picking out the file with the lowest id for the oldest job, and generating a new id for each new job. You'd have to figure out some file locking, though, to keep two+ servers grabbing the same file at the same time, however.
I had same problems while I was working on enqueue/fs transport. I failed to modify a small portion at the begging of the file without copying it to the memory and saving back. Instead, but that's possible to do that with the end of the file. You can read a portion and then truncate it. That's not really a queue but a stack. So if you rely on message ordering this would not be a solution. In my case, I lock the file when the file has been read from the file, the lock is released.
This is how you could write messages to a queue file:
<?php
$rawMessage = 'this your message to put to the queue as a string';
$queueFile = fopen('/path/to/queue/file', '+a');
// here it may add some spaces so the message length is multiples of modular.
// that make it easier to read messages from a file.
// lock file
$rawMessage = str_repeat(' ', 64 - (strlen($rawMessage) % 64)).$rawMessage;
fwrite($queueFile, $rawMessage);
// release lock
This is how you could read messages from a queue file:
<?php
$queueFile = fopen('/path/to/queue/file', '+c');
// lock file
$frame = readFrame($file, 1);
ftruncate($file, fstat($file)['size'] - strlen($frame));
rewind($file);
$rawMessage = substr(trim($frame), 1);
// release lock
function readFrame($file, $frameNumber)
{
$frameSize = 64;
$offset = $frameNumber * $frameSize;
fseek($file, -$offset, SEEK_END);
$frame = fread($file, $frameSize);
if ('' == $frame) {
return '';
}
if (false !== strpos($frame, '|{')) {
return $frame;
}
return readFrame($file, $frameNumber + 1).$frame;
}
For the locking I'd suggest using Symfony LockHandler or simply take enqueue/fs.

How to schedule PHP script on Job Finish (from CLI)

I am processing a big .gz file using PHP (transfering data from gz to mysql)
it takes about 10 minutes per .gz file.
I have a lot of .gz file to be processed.
After PHP is finished with one file I have to manually change the PHP script to select another .gz file and then run the script again manually.
I want it to be automatically run the next job to process the next file.
the gz file is named as 1, 2 ,3, 4, 5 ...
I can simply make a loop to be something like this ( process file 1 - 5):
for ($i = 1 ; $i >= 5; $i++)
{
$file = gzfile($i.'.gz')
...gz content processing...
}
However, since the gz file is really big, I cannot do that, because if I use this loop, PHP will run multiple big gz files as single script job. (takes a lot of memory)
What I want to do is after PHP is finished with one job I want a new job to process the next file.
maybe its going to be something like this:
$file = gzfile($_GET['filename'].'.gz')
...gz content processing...
Thank You
If you clean up after processing and free all memory using unset(), you could simply wrap the whole script in a foreach (glob(...) as $filename) loop. Like this:
<?php
foreach (glob(...) as $filename) {
// your script code here
unset($thisVar, $thatVar, ...);
}
?>
What you should do is
Schedule a cronjob to run your php script every x minutes
When script is run, check if there is a lock file in place, if not create one and start processing the next unprocessed gz file, if yes abort
Wait for the queue to get cleared
You should call the PHP script with argument, from a shell script. Here's the doc how to use command-line parameters in PHP http://php.net/manual/en/features.commandline.php
Or, I can't try it now, but you may give a chance to unset($file) after processing the gzip.
for ($i = 1 ; $i >= 5; $i++)
{
$file = gzfile($i.'.gz')
...gz content processing...
unset($file);
}

Creating files on a time (hourly) basis

I experimenting with twitter streaming API,
I use Phirehose to connect to twitter and fetch the data but having problems storing it in files for further processing.
Basically what I want to do is to create a file named
date("YmdH")."."txt"
for every hour of connection.
Here is how my code looks like right now (not handling the hourly change of files)
public function enqueueStatus($status)
$data = json_decode($status,true);
if(isset($data['text'])/*more conditions here*/) {
$fp = fopen("/tmp/$time.txt");
fwirte ($status,$fp);
fclose($fp);
}
Help is as always much appreciated :)
You want the 'append' mode in fopen - this will either append to a file or create it.
if(isset($data['text'])/*more conditions here*/) {
$fp = fopen("/tmp/" . date("YmdH") . ".txt", "a");
fwrite ($status,$fp);
fclose($fp);
}
From the Phirehose googlecode wiki:
As of Phirehose version 0.2.2 there is
an example of a simple "ghetto queue"
included in the tarball (see file:
ghetto-queue-collect.php and
ghetto-queue-consume.php) that shows
how statuses could be easily collected
on to the filesystem for processing
and then picked up by a separate
process (consume).
This is a complete working sample of doing what you want to do. The rotation time interval is configurable too. Additionally there's another script to consume and process the written files too.
Now if only I could find a way to stop the whole sript, my log keeps filling up (the script continues execution) even if I close the browser tab :P

Categories