Decompressing an LZO stream in PHP - php

I have a number of LZO-compressed log files on Amazon S3, which I want to read from PHP. The AWS SDK provides a nice StreamWrapper for reading these files efficiently, but since the files are compressed, I need to decompress the content before I can process it.
I have installed the PHP-LZO extension which allows me to do lzo_decompress($data), but since I'm dealing with a stream rather than the full file contents, I assume I'll need to consume the string one LZO compressed block at a time. In other words, I want to do something like:
$s3 = S3Client::factory( $myAwsCredentials );
$s3->registerStreamWrapper();
$stream = fopen("s3://my_bucket/my_logfile", 'r');
$compressed_data = '';
while (!feof($stream)) {
$compressed_data .= fread($stream, 1024);
// TODO: determine if we have a full LZO block yet
if (contains_full_lzo_block($compressed_data)) {
// TODO: extract the LZO block
$lzo_block = get_lzo_block($compressed_data);
$input = lzo_decompress( $lzo_block );
// ...... and do stuff to the decompressed input
}
}
fclose($stream);
The two TODOs are where I'm unsure what to do:
Inspecting the data stream to dtermine whether I have a full LZO block yet
Extracting this block for decompression
Since the compression was done by Amazon (s3distCp) I don't have control over the block size, so I'll probably need to inspect the incoming stream to determine how big the blocks are -- is this a correct assumption?
(ideally, I'd use a custom StreamFilter directly on the stream, but I haven't been able to find anyone who has done that before)

Ok executing a command via PHP can be done in many different ways, something like:
$command = 'gunzip -c /path/src /path/dest';
$escapedCommand = escapeshellcmd($command);
system($escapedCommand);
or also
shell_exec('gunzip -c /path/src /path/dest');
will do the work.
Now it's a matter of what command to execute, under Linux there's a nice command line tool called lzop which extracts orcompresses lzop files.
You can use it via something like:
lzop -dN sources.lzo
So you final code might be something as easy as:
shell_exec('lzop -dN s3://my_bucket/my_logfile');

Related

Broadcast stream with PHP within localhost

Maybe I'm asking the impossible but I wanted to clone a stream multiple times. A sort of multicast emulation. The idea is to write every 0.002 seconds a 1300 bytes big buffer into a .sock file (instead of IP:port to avoid overheading) and then to read from other scripts the same .sock file multiple times.
Doing it through a regular file is not doable. It works only within the same script that generates the buffer file and then echos it. The other scripts will misread it badly.
This works perfectly with the script that generates the chunks:
$handle = #fopen($url, 'rb');
$buffer = 1300;
while (1) {
$chunck = fread($handle, $buffer);
$handle2 = fopen('/var/tmp/stream_chunck.tmp', 'w');
fwrite($handle2, $chunck);
fclose($handle2);
readfile('/var/tmp/stream_chunck.tmp');
}
BUT the output of another script that reads the chunks:
while (1) {
readfile('/var/tmp/stream_chunck.tmp');
}
is messy. I don't know how to synchronize the reading process of chunks and I thought that sockets could make a miracle.
It works only within the same script that generates the buffer file and then echos it. The other scripts will misread it badly
Using a single file without any sort of flow control shouldn't be a problem - tail -F does just that. The disadvantage is that the data will just accululate indefinitely on the filesystem as long as a single client has an open file handle (even if you truncate the file).
But if you're writing chunks, then write each chunk to a different file (using an atomic write mechanism) then everyone can read it by polling for available files....
do {
while (!file_exists("$dir/$prefix.$current_chunk")) {
clearstatcache();
usleep(1000);
}
process(file_get_contents("$dir/$prefix.$current_chunk"));
$current_chunk++;
} while (!$finished);
Equally, you could this with a database - which should have slightly lower overhead for the polling, and simplifies the garbage collection of old chunks.
But this is all about how to make your solution workable - it doesn't really address the problem you are trying to solve. If we knew what you were trying to achieve then we might be able to advise on a more appropriate solution - e.g. if it's a chat application, video broadcast, something else....
I suspect a more appropriate solution would be to use mutli-processing, single memory model server - and when we're talking about PHP (which doesn't really do threading very well) that means an event based/asynchronous server. There's a bit more involved than simply calling socket_select() but there are some good scripts available which do most of the complicated stuff for you.

uploading large object to Cloudfiles returns different md5

So I have this code and I'm trying to upload large files as per https://github.com/rackspace/php-opencloud/blob/master/docs/userguide/ObjectStore/Storage/Object.md to Rackspace:
$src_path = 'pathtofile.zip'; //about 700MB
$md5_checksum = md5_file($src_path); //result is f210775ccff9b0e4f686ea49ac4932c2
$trans_opts = array(
'name' => $md5_checksum,
'concurrency' => 6,
'partSize' => 25000000
);
$trans_opts['path'] = $src_path;
$transfer = $container->setupObjectTransfer($trans_opts);
$response = $transfer->upload();
Which allegedly uploads the file just fine
However when I try to download the file as recommended here https://github.com/rackspace/php-opencloud/blob/master/docs/userguide/ObjectStore/USERGUIDE.md:
$name = 'f210775ccff9b0e4f686ea49ac4932c2';
$object = $container->getObject($name);
$objectContent = $object->getContent();
$pathtofile = 'destinationpathforfile.zip';
$objectContent->rewind();
$stream = $objectContent->getStream();
file_put_contents($pathtofile, $stream);
$md5 = md5_file($pathtofile);
The result of md5_file ends up being different from 'f210775ccff9b0e4f686ea49ac4932c2'....moreover the downloaded zip ends up being unopenable/corrupted
What did I do wrong?
It's recommended that you only use multipart uploads for files over 5GB. For files under this threshold, you can use the normal uploadObject method.
When you use the transfer builder, it segments your large file into smaller segments (you provide the part size) and concurrently uploads each one. When this process has finished, a manifest file is created which contains a list of all these segments. When you download the manifest file, it collates them all together, effectively pretending to be the big file itself. But it's just really an organizer.
To get back to answering your question, the ETag header of a manifest file is not calculated how you may think. What you're currently doing is taking the MD5 checksum of the entire 700MB file, and comparing it against the MD5 checksum of the manifest file. But these aren't comparable. To quote the documentation:
the ETag header is calculated by taking the ETag value of each segment, concatenating them together, and then returning the MD5 checksum of the result.
There are also downsides to using this DLO operation that you need to be aware of:
End-to-end integrity is not assured. The eventual consistency model means that although you have uploaded a segment object, it might not appear in the container list immediately. If you download the manifest before the object appears in the container, the object will not be part of the content returned in response to a GET request.
If you think there's been an error in transmission, perhaps it's because a HTTP request failed along the way. You can use retry strategies (using the backoff plugin) to retry failed requests.
You can also turn on HTTP logging to check every network transaction to help with debugging. Be careful, though, using the above with echo out the HTTP request body (>25MB) into STDOUT. You might want to use this instead:
use Guzzle\Plugin\Log\LogPlugin;
use Guzzle\Log\ClosureLogAdapter;
$stream = fopen('php://output', 'w');
$logSubscriber = new LogPlugin(new ClosureLogAdapter(function($m) use ($stream) {
fwrite($stream, $m . PHP_EOL);
}), "# Request:\n{url} {method}\n\n# Response:\n{code} {phrase}\n\n# Connect time: {connect_time}\n\n# Total time: {total_time}", false);
$client->addSubscriber($logSubscriber);
As you can see, you're using a template to dictate what's outputted. There's a full list of template variables here.

stream audio/video files from gridFS on the browser

I have been trying to read an audio file from mongoDB which i have stored using GridFS. I could download the file in the system and play from it but I wanted to stream those audio/video files from the DB itself and play it in the browser. Is there anyway to do that without downloading the file to the system? Any help would be good.
The PHP GridFS support has a MongoGridFSFile::getResource() function that allows you to get the stream as a resource - which doesn't load the whole file in memory. Combined with fread/echo or stream_copy_to_stream you can prevent the whole file from being loaded in memory. With stream_copy_to_stream, you can simply copy the GridFSFile stream's resource to the STDOUT stream:
<?php
$m = new MongoClient;
$images = $m->my_db->getGridFS('images');
$image = $images->findOne('mongo.png');
header('Content-type: image/png;');
$stream = $image->getResource();
stream_copy_to_stream( $stream, STDOUT );
?>
Alternatively, you can use fseek() on the returned $stream resource to only send back parts of the stream to the client. Combined with HTTP Range requests, you can do this pretty efficiently.
If the other recipe fails, for example with NginX and php-fpm, because STDOUT is not available in fpm, you can use
fpassthru($stream);
instead of
stream_copy_to_stream( $stream, STDOUT );
So a complete solution looks like:
function img($nr)
{
$mongo = new MongoClient();
$img = $mongo->ai->getGridFS('img')->findOne(array('metadata.nr'=>$nr));
if (!$img)
err("not found");
header('X-Accel-Buffering: no');
header("Content-type: ".$img->file["contentType"]);
header("Content-length: ".$img->getSize());
fpassthru($img->getResource());
exit(0);
}
FYI:
In this example:
File is not acccessed by the filename, instead it is accessed by a number stored in the metadata. Hint: You can set an unique index to ensure, that no number can be used twice.
Content-Type is read from GridFS, too, so you do not need to hardcode this.
NginX caching is switched off to enable streaming.
This way you can even handle other things like video or html pages. If you want to enable NginX caching, perhaps only output X-Accel-Buffering on bigger sizes.

PHP readfile on a file which is increasing in size

Is it possible to use PHP readfile function on a remote file whose size is unknown and is increasing in size? Here is the scenario:
I'm developing a script which downloads a video from a third party website and simultaneously trans-codes the video into MP3 format. This MP3 is then transferred to the user via readfile.
The query used for the above process is like this:
wget -q -O- "VideoURLHere" | ffmpeg -i - "Output.mp3" > /dev/null 2>&1 &
So the file is fetched and encoded at the same time.
Now when the above process is in progress I begin sending the output mp3 to the user via readfile. The problem is that the encoding process takes some time and therefore depending on the users download speed readfile reaches an assumed EoF before the whole file is encoded, resulting in the user receiving partial content/incomplete files.
My first attempt to fix this was to apply a speed limit on the users download, but this is not foolproof as the encoding time and speed vary with load and this still led to partial downloads.
So is there a way to implement this system in such a way that I can serve the downloads simultaneously along with the encoding and also guarantee sending the complete file to the end user?
Any help is appreciated.
EDIT:
In response to Peter, I'm actually using fread(read readfile_chunked):
<?php
function readfile_chunked($filename,$retbytes=true) {
$chunksize = 1*(1024*1024); // how many bytes per chunk
$totChunk = 0;
$buffer = '';
$cnt =0;
$handle = fopen($filename, 'rb');
if ($handle === false) {
return false;
}
while (!feof($handle)) {
//usleep(120000); //Used to impose an artificial speed limit
$buffer = fread($handle, $chunksize);
echo $buffer;
ob_flush();
flush();
if ($retbytes) {
$cnt += strlen($buffer);
}
}
$status = fclose($handle);
if ($retbytes && $status) {
return $cnt; // return num. bytes delivered like readfile() does.
}
return $status;
}
readfile_chunked($linkToMp3);
?>
This still does not guarantee complete downloads as depending on the users download speed and the encoding speed, the EOF() may be reached prematurely.
Also in response to theJeztah's comment, I'm trying to achieve this without having to make the user wait..so that's not an option.
Since you are dealing with streams, you probably should use stream handling functions :). passthru comes to mind, although this will only work if the download | transcode command is started in your script.
If it is started externally, take a look at stream_get_contents.
Libevent as mentioned by Evert seems like the general solution where you have to use a file as a buffer. However in your case, you could do it all inline in your script without using a file as a buffer:
<?php
header("Content-Type: audio/mpeg");
passthru("wget -q -O- http://localhost/test.avi | ffmpeg -i - -f mp3 -");
?>
I don't think there's any of being notified about there being new data, short of something like inotify.
I suggest that if you hit EOF, you start polling the modification time of the file (using clearstatcache() between calls) every 200 ms or so. When you find the file size has increased, you can reopen the file, seek to the last position and continue.
I can highly recommend using libevent for applications like this.
It works perfect for cases like this.
The PHP documentation is a bit sparse for this, but you should be able to find more solid examples around the web.

gzdeflate() and large amount of data

I've been building a class to create ZIP files in PHP. An alternative to ZipArchive assuming it is not allowed in the server. Something to use with those free servers.
It is already sort of working, build the ZIP structures with PHP, and using gzdeflate() to generate the compressed data.
The problem is, gzdeflate() requires me to load the whole file in the memory, and I want the class to work limitated to 32MB of memory. Currently it is storing files bigger than 16MB with no compression at all.
I imagine I should make it compress data in blocks, 16MB by 16MB, but I don't know how to concatenate the result of two gzdeflate().
I've been testing it and it seems like it requires some math in the last 16-bits, sort of buff->last16bits = (buff->last16bits & newblock->first16bits) | 0xfffe, it works, but not for all samples...
Question: How to concatenate two DEFLATEd streams without decompressing it?
PHP stream filters are used to perform such tasks. stream_filter_append can be used while reading from or writing to streams. For example
$fp = fopen($path, 'r');
stream_filter_append($fp, 'zlib.deflate', STREAM_FILTER_READ);
Now fread will return you deflated data.
This may or may not help. It looks like gzwrite will allow you to write files without having them completely loaded in memory. This example from the PHP Manual page shows how you can compress a file using gzwrite and fopen.
http://us.php.net/manual/en/function.gzwrite.php
function gzcompressfile($source,$level=false){
// $dest=$source.'.gz';
$dest='php://stdout'; // This will stream the compressed data directly to the screen.
$mode='wb'.$level;
$error=false;
if($fp_out=gzopen($dest,$mode)){
if($fp_in=fopen($source,'rb')){
while(!feof($fp_in))
gzwrite($fp_out,fread($fp_in,1024*512));
fclose($fp_in);
}
else $error=true;
gzclose($fp_out);
}
else $error=true;
if($error) return false;
else return $dest;
}

Categories