Processing huge yaml-files via php - php

I need to process a huge yaml-file - which is 450 MB - to get the data in a database. Therefore I tried to use "spyc". But the file is too big.
Every chapter has the line --- !de.db.net,DB::Util::M10lDocument. And I need the content of every chapter as an array. Therefore I tried to use spyc. But the complete file is too big for that. I don't know how to split for those chapters.
Is it possible to read the complete file just block by block?
Does anyone have an idea how to work with that big file?

--- is the document boundary marker for a YAML stream. Using a YAML parser that processes the file as a stream should allow you to process the file in document sized chunks as long as each document is small enough to fit in available memory.
The yaml_parse_file function provided by the yaml PECL extension includes the ability to parse a single document out of a stream of documents. There is no built in method to iterate over the documents (eg foreach support) but you could implement your own loop that fetched sequential documents and halted when yaml_parse_file returns false indicating that the requested document was not found.
<?php
$docNum = 0;
while (false !== ($doc = yaml_parse_file('example.yaml', $docNum))) {
var_dump($doc);
$docNum++;
}

Related

League CSV package - reading one line at a time from a resource/stream

I'm using The PHP League CSV importer/exporter to import a large CSV file in Laravel. Since the file is large, I would like to stream it to the CSV parser and handle it one line at a time, without loading every line into memory.
Laravel uses flysystem for the underlying filesystem, and I am using that to obtain a PHP resource to the source CSV.
What I don't understand is how - if it is at all possible - I can feed that resource stream into League CSV so that it reads one line at a time for me to process, before reading in the next line. All the documentation seems to imply that a CSV file is always read fully into memory, and that is what I want to avoid.
Do I need to use callbacks? If so, how can I be sure the stream resource is only being read one line at a time as needed, and not all at once?
I'm guessing I start by creating a stream reader?
use League\Csv\Reader;
$reader = Reader::createFromStream($resource, 'r');
You can iterate over the rows without loading the whole file by using the IteratorAggregate interface of the Reader. So you basically just do
foreach ($reader as $row) {
// do stuff
}
If you are using a mac to read or create the CSV file you will need to add this to your code for it to work correctly:
if (!ini_get("auto_detect_line_endings")) {
ini_set("auto_detect_line_endings", '1');
}

CakePHP 3 CakePDF Plugin: Generating a large PDF by looping through content?

So my problems stems from trying to generate large PDF files but being hit by the memory limit / execution timeouts in PHP, the data is too great in volume to simply extend these limits so that solution is out of the question.
I have a background shell task running which handles all of this rendering and then alerts the user once the PDF has been completed.
In theory I would have a loop within this shell which would take in a chunk of data and render it to file, then take the next chunk and do the same. Once out of data to render, the file would then be written and completed ready to be served. This way the memory limit of PHP would not be hit as a manageable chunk will only ever be loaded.
I am currently using the CakePDF(v.3.5) plugin for CakePHP 3 (v.3.5.13) but am sturggling to find a solution which allows the rendering of some data and then adding more data to the same pdf.
Has anyone managed this before or is it out of scope of the plugin? Would another solution be to create multiple PDF files and then merge them together after all separate PDF's have been created?
This is more of a theoretical question if this would work and if anyone has managed it before. I don't have much code to show but if more detail is required then give me a shout and I will try and get something for you or some example code!
Thanks
I don't have direct experience with that version CakePdf, but under CakePHP 2.x I use the wkhtmltopdf engine which takes an .html output to produce the PDF.
If your shell generates such .html in chunks, it is easy to append.
Of course wkhtmltopdf is likely to put some load on the machine to produce the PDF, but since it's a binary, it happens outside of PHP's memory/time contraints.
That's certainly out of the scope of the plugin, it's built around the idea of rendering a single view to a single file, the interface doesn't support chunked creation of a single file, and if I'm not mistaken, none of the supported engines do support that either, at least not in a straightforward and efficient manner when it comes to large documents.
There's certainly lots of ways to do this, creating multiple PDFs and merging/concatenating them afterwards might be one of them, generating the source content in chunks, and passing it to a PDF renderer that can handle lots of content efficiently might be another one, and surely there also might be libraries out there that do explicitly support chunked creation of PDFs...
I thought I would post what I ended up doing for anyone in the future.
I used CakePDF to generate smaller PDF's which I stored in a tmp directory these are all under the limit of PHP's execution time and memory limits as I don't believe altering those provides a good solution. In this step I also saved the names of all of the PDF's generated for use in the next step.
The code for this looked something like:
while (!is_last_pdf) {
// Generate pdf in here with a portion of the data
$CakePdf = new CakePdf();
$CakePdf->template('page', 'default');
$CakePdf->viewVars(compact('data', 'other_stuff'));
// Save file name to array
$tmp_file_list[] = $file_name;
// Update the is_last_pdf variable
is_last_pdf = check_for_more_data();
}
From this I used GhostScript from within the Shell to merge all of the PDF files, the code for this looked something like this:
$output_path = 'output.pdf';
$file_list = '';
// Create a string of all the files to merge
foreach ($tmp_file_list as $file) {
$file_list .= $file . ' ';
}
// Execute GhostScript to merge all the files into the `output.pdf` file
exec('gs -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=' . $output_path . ' ' . $file_list);
All of the code here was in the Shell file responsible for creating the PDF.
Hope this helps someone :)

Broadcast stream with PHP within localhost

Maybe I'm asking the impossible but I wanted to clone a stream multiple times. A sort of multicast emulation. The idea is to write every 0.002 seconds a 1300 bytes big buffer into a .sock file (instead of IP:port to avoid overheading) and then to read from other scripts the same .sock file multiple times.
Doing it through a regular file is not doable. It works only within the same script that generates the buffer file and then echos it. The other scripts will misread it badly.
This works perfectly with the script that generates the chunks:
$handle = #fopen($url, 'rb');
$buffer = 1300;
while (1) {
$chunck = fread($handle, $buffer);
$handle2 = fopen('/var/tmp/stream_chunck.tmp', 'w');
fwrite($handle2, $chunck);
fclose($handle2);
readfile('/var/tmp/stream_chunck.tmp');
}
BUT the output of another script that reads the chunks:
while (1) {
readfile('/var/tmp/stream_chunck.tmp');
}
is messy. I don't know how to synchronize the reading process of chunks and I thought that sockets could make a miracle.
It works only within the same script that generates the buffer file and then echos it. The other scripts will misread it badly
Using a single file without any sort of flow control shouldn't be a problem - tail -F does just that. The disadvantage is that the data will just accululate indefinitely on the filesystem as long as a single client has an open file handle (even if you truncate the file).
But if you're writing chunks, then write each chunk to a different file (using an atomic write mechanism) then everyone can read it by polling for available files....
do {
while (!file_exists("$dir/$prefix.$current_chunk")) {
clearstatcache();
usleep(1000);
}
process(file_get_contents("$dir/$prefix.$current_chunk"));
$current_chunk++;
} while (!$finished);
Equally, you could this with a database - which should have slightly lower overhead for the polling, and simplifies the garbage collection of old chunks.
But this is all about how to make your solution workable - it doesn't really address the problem you are trying to solve. If we knew what you were trying to achieve then we might be able to advise on a more appropriate solution - e.g. if it's a chat application, video broadcast, something else....
I suspect a more appropriate solution would be to use mutli-processing, single memory model server - and when we're talking about PHP (which doesn't really do threading very well) that means an event based/asynchronous server. There's a bit more involved than simply calling socket_select() but there are some good scripts available which do most of the complicated stuff for you.

Decompressing an LZO stream in PHP

I have a number of LZO-compressed log files on Amazon S3, which I want to read from PHP. The AWS SDK provides a nice StreamWrapper for reading these files efficiently, but since the files are compressed, I need to decompress the content before I can process it.
I have installed the PHP-LZO extension which allows me to do lzo_decompress($data), but since I'm dealing with a stream rather than the full file contents, I assume I'll need to consume the string one LZO compressed block at a time. In other words, I want to do something like:
$s3 = S3Client::factory( $myAwsCredentials );
$s3->registerStreamWrapper();
$stream = fopen("s3://my_bucket/my_logfile", 'r');
$compressed_data = '';
while (!feof($stream)) {
$compressed_data .= fread($stream, 1024);
// TODO: determine if we have a full LZO block yet
if (contains_full_lzo_block($compressed_data)) {
// TODO: extract the LZO block
$lzo_block = get_lzo_block($compressed_data);
$input = lzo_decompress( $lzo_block );
// ...... and do stuff to the decompressed input
}
}
fclose($stream);
The two TODOs are where I'm unsure what to do:
Inspecting the data stream to dtermine whether I have a full LZO block yet
Extracting this block for decompression
Since the compression was done by Amazon (s3distCp) I don't have control over the block size, so I'll probably need to inspect the incoming stream to determine how big the blocks are -- is this a correct assumption?
(ideally, I'd use a custom StreamFilter directly on the stream, but I haven't been able to find anyone who has done that before)
Ok executing a command via PHP can be done in many different ways, something like:
$command = 'gunzip -c /path/src /path/dest';
$escapedCommand = escapeshellcmd($command);
system($escapedCommand);
or also
shell_exec('gunzip -c /path/src /path/dest');
will do the work.
Now it's a matter of what command to execute, under Linux there's a nice command line tool called lzop which extracts orcompresses lzop files.
You can use it via something like:
lzop -dN sources.lzo
So you final code might be something as easy as:
shell_exec('lzop -dN s3://my_bucket/my_logfile');

Merge multiple doc or rtf files into a single doc or rtf file by using php script

I would like to merge multiple doc or rtf files into a single file which should be the same format of multiple files.
What I mean is that if a user selects multiple rtf template files from a list box and clicks on a button on web page, the output should be a single rtf file which combines multiple rtf template files, I should use php for this.
I haven't decided the format of template files, but it should be either rtf or doc, and also I assume that template file has some images as well.
I have spent many hours to research the library for this, but still can't find it out.
Please help me out here!! :(
Thanks in advance.
If you are searching for a solution for handling RTF documents only, you can find a PHP package to merge multiple RTF documents here :
www.rtftools.com
Here is a short example on how to merge multiple documents together :
include ( 'path/to/RtfMerger.phpclass' ) ;
$merger = new RtfMerger ( 'sample1.rtf', 'sample2.rtf' ) ; // You can specify docs to be merged to the class constructor...
$merger -> Add ( 'sample3.rtf' ) ; // or by using the Add() method
$merger [] = 'sample4.rtf' ; // or by using the array access methods
$merger -> SaveTo ( 'output.rtf' ) ; // Will save files 'sample1' to 'sample4' into 'output.rtf'
This package allows you to handle documents that are bigger than the available memory.
I've been working on a similar project and havne't managed to find any PHP (or any other open source language) libraries for manipulating MSWord files. The way I approach it is kind of complicated, but works. Here's how I would do it (assuming you have a Linux server):
Setup:
Install JODConverter and OpenOffice
Start open office as a server (see http://www.artofsolving.com/node/10)
Approach (ie. what to do in your PHP code):
Convert your MSWord or RTF files into ODT format by calling JODConverter via backticks or exec()
Unzip each file into a temporary directory of its own
Read the contents.xml file from each unzipped document using a DOM Parser
Extract the <office:text> contents from each, and concatenate
Put this concatenated xml back into the right spot in one of the content.xml files
Re-zip the contents of that temporary directory and give it an .odt extension
Use JODConverter to convert this file back to MSWord again
As I said, it's not pretty, but it does the job.
If you're looking to go down the RTF route, this question may also help: Concatenate RTF files in PHP (REGEX)

Categories