PDFLib in PHP hogging resources and not flushing to file - php

I just inherited a PHP project that generates large PDF files and usually chokes after a few thousand pages and several gigs of server memory. The project was using PDFLib to generate these files 'in memory'.
I was tasked with fixing this, so the first thing I did was to send PDFLib output to a file instead of building in memory. The problem is, it still seems to be building PDFs memory. And much of the memory never seems to be returned to the OS. Eventually, the whole things chokes and dies.
When I task the program with building only snippets of the large PDFs, it seems that the data is not fully flushed to the file on end_document(). I get no errors, yet the PDF is not readable and opening it in a hex editor makes it obvious that the stream is incomplete.
I'm hoping that someone has experienced similar difficulties.

Solved! Needed to call PDF_delete_textflow() on each textflow, as they are given document scope and don't go away until the document is closed, which was never since all available memory was exhausted before that point.

You have to make sure that you are closing each page as well as closing the document. This would be done by calling the "end_page_ext" at the end of every written page.
Additionally if you are importing pages from another PDF you have to call "close_pdi_page" after each improted page and "close_pdi_document" when you're done with each imported document.

Related

Create a PDF file on the fly and stream it while it is not yet finished?

We want to merge a lot of PDF files into one big file and send it to the client. However, the resources on our production server are very restricted, so merging all files in memory first and then sending the finished PDF file results in our script being killed because it exhausts its available memory.
The only solution (besides getting a better server, obviously) would be starting to stream the PDF file before it is fully created to bypass the memory limit.
However I wonder if that is even possible. Can PDF files be streamed before they're fully created? Or doesn't the PDF file format allow streaming unfinished files because some headers or whatever have to be set after the full contents are certain?
If it is possible, which PDF library supports creating a file as a stream? Most libraries that I know of (like TCPDF) seem to create the full file in memory and then in the end output this finished result somewhere (i. e. via the $tcpdf->Output() method).
The PDF file format is entirely able to be streamed. There's certainly nothing that'll prevent it anyway.
As an example, we recently had a customer that required reading a single page over a HTTP connection to a remote PDF, without downloading or reading the whole PDF. We're able to do this by making many small HTTP requests for specific content within the PDF. We use the trailer at the end of the PDF and the cross reference table to find the required content without having to parse the whole PDF.
If I understand your problem, it looks like your current library you're using loads each PDF in memory before creating or streaming out the merged document.
If we look at this problem a different way, the better solution would be for the PDF library to only take references to the PDFs to be merged, then when the merged PDF is being created or streamed, pull in the content and resources from the PDFs to be merged, as-and-when required.
I'm not sure how many PHP libraries there are that can do this as I'm not too up-to-date with PHP, but I know there are probably a few C/C++ libraries that may be able to do this. I understand PHP can use extensions to call these libraries. Only downside is that they'll likely have commercial licenses.
Disclaimer: I work for the Mako SDK R&D group, hence why I know for sure there are some libraries which will do this. :)

PHP script to delete zero byte files

I'm having a problem with zero byte files. Sometimes, randomly it seems, the server I'm working with adds zero byte files into a directory. These files break another script. I can delete the files manually with no problem, but becuase of the extremely tight controls on the server, I can't do things like run batch scripts or cron jobs.
What I think I need is a small script on the front page (the only page, actually) that will run a script every time someone visits. It won't get huge traffic. The script would target a specific directory and delete zero byte files.
I've been experimenting with just something as basic as finding and displaying file sizes, and I'm not having much luck. I've even searched online for solutions to similar problems and I haven't found anything.
I don't expect you to do my coding for me (although I wouldn't turn it down! ; ) ), but if someone could help me with a simple way of even just displaying ONLY the zero byte file names, I might be able to proceed on my own from there. I just can't find a way that makes sense to me. And sorry to say, I have essentially no control over the server.
You can use DirectoryIterator class to loop through the files in the specified directory and unlink() them.

Handle large .plist files with CFPropertyList

I'm using CFPropertyList from https://github.com/rodneyrehm/CFPropertyList for handling content I add with PHP.
It all worked fine, but now that all content is added my file has about 700KB which is not big but seems big enough to let Apache crash on trying to save a file.
child pid 1278 exit signal Segmentation fault
I see in CacheGrind that a lot of time in my application is taken by calls to CFPropertyList->import() and CFDictionary->toXML() so where could be the bottleneck there???
Am I making to many changes at once? Should I load() and save() inbetween changes more to avoid having too many changes saved at once?
Any clue?
I do not think that it's the size that makes problems but a bug in PHP. Segfaults occur only if there is a serious bug in PHP itself.
The next steps:
First, upgrade to the latest PHP version (5.3.6)
If it does not happen anymore, feel happy
It still happens:
Reproduce the issue with a PHP script no longer than 20 lines.
Report the issue to bugs.php.net
When you implement a searchNode() function in an document of unknown size, you should always use a "depth" parameter to avoid stepping down in the document and calling your function enormous times in a recursive loop.
Because that creates infinite loops that also cause a segfault in PHP which don't end in a fatal error or warning.

PHP script: How big is too big?

I'm developing a webapp in PHP, and the core library is 94kb in size at this point. While I think I'm safe for now, how big is too big? Is there a point where the script's size becomes an issue, and if so can this be ameliorated by splitting the script into multiple libraries?
I'm using PHP 5.3 and Ubuntu 10.04 32bit in my server environment, if that makes any difference.
I've googled the issue, and everything I can find pertains to PHP upload size only.
Thanks!
Edit: To clarify, the 94kb file is a single file that contains all my data access and business logic, and a small amount of UI code that I have yet to extract to its own file.
Do you mean you have 1 file that is 94KB in size or that your whole library is 94KB in?
Regardless, as long as you aren't piling everything into one file and you're organizing your library into different files your file size should remain manageable.
If a single PHP file is starting to hit a few hundred KB, you have to think about why that file is getting so big and refactor the code to make sure that everything is logically organized.
I've used PHP applications that probably included several megabytes worth of code; the main thing if you have big programs is to use a code caching tool such as APC on your production server. That will cache the compiled (to byte code) PHP code so that it doesn't have to process every file for every page request and will dramatically speed up your code.

dompdf memory issues

I'm using DOMPDF to generate about 500 reports from one script. It's running out of memory after about 10-15 PDFs have been generated.
In debugging, it looks like it's loading 8M every time it gets to the font loading stuff, but this seems like something that should be handled with the font caching code.
Any ideas of what's going wrong here? I'd like to post a simple code snippet, but most of it is abstracted into multiple layers, so it's not just a simple copy/paste.
If you're using dompdf 0.6 beta, the memory error is the result of an infinite loop that dompdf enters when rendering tables. This is a known issue that I haven't been able to resolve.
Relevant URLs:
http://code.google.com/p/dompdf/issues/detail?id=34
http://code.google.com/p/dompdf/issues/detail?id=91
(The error you see is pdf PHP Fatal error: Allowed memory size of 268435456 bytes exhausted)
First if this is for anything remotely commercial just get Prince XML. It's substantially better and faster than any other HTML to PDF solution (and I've looked at them all). The cost will quickly be recouped in saved developer time.
Second, the quickest solution is probably to print each report in a separate process to solve any memory leak problems. If this is running from the command line have the outer loop be something like a shell script that will start a process for each report. If it's run from the Web fork a process for each script if you're on an OS that can do that.
Take a look at Convert HTML + CSS to PDF with PHP?.
As indicated by cletus, the quickest solution for you with DOMPDF is probably going to be rendering each report in a separate process. You can write a master script that calls a child script (using exec) which performs the actual rendering. As you can see in this discussion on the DOMPDF support group, it does seem to have the potential to provide a bit of a boost in performance.
It's difficult to say what's going on otherwise regarding memory usage without some kind of example that demonstrates the problem. I don't believe there is much optimization of DOMPDF and the underlaying CPDF rendering engine for multiple instances in a single script. So the font is probably being loaded into memory each time, even though it could use a static variable to cache that data.

Categories