How to Handing EXTREMELY Large Strings in PHP When Generating a PDF

How to Handing EXTREMELY Large Strings in PHP When Generating a PDF - php

I've got a report that can generate over 30,000 records if given a large enough date range. From the HTML side of things, a resultset this large is not a problem since I implement a pagination system that limits the viewable results to 100 at a given time.
My real problem occurs once the user presses the "Get PDF" button. When this happens, I essentially re-run the portion of the report that prints the data (the results of the report itself are stored in a 'save' table so there's no need to re-run the data-gathering logic), and store the results in a variable called $html. Keep in mind that this variable now contains 30,000 records of data plus the HTML needed to format it correctly on the PDF. Once I've got this HTML string created, I pass it to TCPDF to try and generate the PDF file for the user. However, instead of generating the PDF file, it just craps out without an error message (the 'Generating PDf...') dialog disappears and the system acts like you never asked it to do anything.
Through tests, I've discovered that the problem lies in the size of the $html variable being passed in. If the report under 3K records, it works fine. If it's over that, the HTML side of the report will print but not the PDF.
Helpful Info
PHP 5.3
TCPDF for PDF generation (also tried PS2PDF)
Script Memory Limit: 500 MB
How would you guys handle this scale of data when generating a PDF of this size?

Here is how I solved this issue: I noticed that some of the strings that I was having in my HTML output had some slight encoding issues - I ran htmlentities on those particular strings as I was querying the database for them and that cleared the problem.
Don't know if this was what was causing your problem, but my experience was very similar - when I was trying to output an HTML table that had a large size, with about 80.000 rows, TCPDF would display the page header but nothing table-related. This behaviour would be the same with different sets of data and different table structures.
After many attempts I started adding my own pagination - every 15 table rows, I would break the page and add a new table to the following page. That's when I noticed that every once and a while I would get blank pages between a lot of full and correct ones. That's when I realised that there must be a problem with those particular subsets of data, and discovered the encoding issue. It may be that you had something similar and TCPDF was not making it clear what your problem was.

Are you using the writeHTML method?
I went through the performance recommendations here: http://www.tcpdf.org/performances.php
It says "Split large HTML blocks in smaller pieces;".
I found that if my blocks of HTML went over 20,000 characters the PDF would take well over 2 minutes to generate.
I simply split my html up into the blocks and called writeHTML for each block and it improved dramatically. A file that wouldn't generate in 2 minutes before now takes 16 seconds.

TCPDF seems to be a native implementation of PDF generation in PHP. You may have better performance using a compiled library like PDFlib or a command-line app like htmldoc. The latter will have the best chances of generating a large PDF.
Also, are you breaking the output PDF into multiple pages? I.e. does TCPDF know to take a single HTML document and cut it into multiple pages, or are you generating multiple HTML files for it to combine into a single PDF document? That may also help.

I would break the PDF into parts, just like pagination.
1) Have "Get PDF" button on every paginated HTML page and allow downloading of records from that HTML page only.
2) Limit the maximum number of records that can be downloaded. If the maximum limit reaches, split the PDF and let the user to download multiple PDFs.

Related

TCPDF runs slow and consumes huge amounts of memory

TCPDF is an open source PDF generator I'm using in a web app of mine for generating reports. These reports are standard pre-made HTML tables, which TCPDF reads and converts into a PDF. Lately The reports have been running multiple pages and are taking much longer than I'd like to load.
The problems
TCPDF is currently generating the PDFs at 0.48s per page.
TCPDF uses on average 3-4 MB RAM per page during generation, often timing out the half gig php limit I have set. (512/4 = 128 = my page limit), even though the final file will be below 1MB.
What I've tried
Initially I thought the long wait time may have been down to database
calls for the required info and my php script generating the HTML, however timestamps on each page of the report, and also printing time
stamps at the completion of database calls and HTML generation rule out that possibility.
First thing I tried was updating TCPDF because I am running a 2010 version, this actually increased the load time by a factor of 4x! (however the new version was more memory efficient)
I've tried re-writing the HTML, removing all images and including all CSS in a style tag at the top (previously it was contained in each HTML tag), however this actually slowed the generation by 50%!
Question
Is there anyone whoes had a similar problem that can point out some other common edits which can improve the loading?
What are my alternatives from using HTML tables and TCPDF that can provide a more efficient way of generating the PDF?
Worst case scenario, I think I might have to make a separate PDF generating machine which does every report page by page, mashes them together, and Emails it when its done but that sounds naaasty.

How do I create a large pdf, semi quickly

I need to generate a large PDF, 2480 pages to be exact.
Currently I am using indesign, and while the output is exactly what I want.
I would rather not be involved in the document creation process.
It takes 31 minutes for indesign to execute the data merge, generate the pdf, save the pdf, and to save the pdf.indd file. (I dont really need the pdf.indd file, but I would rather not have to recreate the data merge if something were to happen to the pdf)
I am hoping for a php, or similar solution. Currently my data is stored in MySQL.
The majority of the pdf is static text, with 19 dynamically driven text fields.
There is one image on the pdf, 75x100px # 72dpi.
The output needs to be exact, the pdf file is printed and cut in half at 4.25 inches.
I have tried TCPDF, while it is fast at generating upto 50 pages, after that it would rather die than give me an output. I have also played with mPDF, and found it to be, ..., not as friendly. I have also considered generating many small files and using some utility to merge the smaller pdf's into one large pdf. Though that seems like driving around the mountain.
Any thoughts would be helpful.

You certainly can create documents directly with PHP, but it can be difficult. One method is to use one of the various PDF classes to create the document, as you have found. Another is to create images (using ImageMagic, GD, etc.) and convert those to PDF. (This method is less efficient, as you are creating raster graphics making the whole PDF page one giant graphic.)
However, I think you should consider simply scripting InDesign. InDesign has the capability to read data in via XML and create the document. This way, the design of your document isn't dependent on your programming abilities and you can still have the power of programmatically creating the document.

When it comes to huge number of pages in PDF, LaTeX is always the best answer. Nothing can really handle huge PDF generation as fast, accurate and elegant as LaTeX.
Check this question to see how to retrieve your data from the database.

Bulk template based pdf generation in PHP using pdftk

I am doing a bulk generation of pdf files based on templates and I ran into big performance issues pretty fast.
My current scenario is as follows:
get data to be filled from db
create fdf based on single data row and pdf form
write .fdf file to disk
merge the pdf with fdf using pdftk (fill_form with flatten command)
continue iterating over rows until all .pdf's are generated
all the generated files are merged together in the end and the single pdf is given to the client
I use passthru to give the raw output to the client (saves time writing file), but this is just a little performance improvements. The total operation time is about 50 seconds for 200 records and I would like to get down to at least 10 seconds in some way.
The ideal scenario would be operating all these pdfs in memory and not writing every single one of them to separate file but then the output would be impossible to do as I can't pass that kind of data to external tool like pdftk.
One other idea was to generate one big .fdf file with all those rows, but it looks like that is not allowed.
Am I missing something very trivial here?
I'm thanksfull for any advice.
PS. I know I could use some good library like pdflib but I am considering only open licensed libraries now.
EDIT:
I am up to figuring out the syntax to build an .fdf file with multiple pages using the same pdf as a template, spent few hours and couldn't find any good documentation.

After beeing faced with the same problem for a long time (wanted to generate my pdfs based on LaTeX) i finally decided to switch to another crude but effective technique:
i generate my pdfs in two steps: first i generate html with a template engine like twig or smarty. second i use mpdf to generate pdfs out of it. I tryed many other html2pdf frameworks and ended up using mpdf, it's very mature and is developed since a long time (frequent updates, rich functionality). the benefit using this technique: you can use css to design your documents (mpdf completely features css) - which comes along with the css benefit (http://www.csszengarden.com) and generate dynamic tables very easy.
Mpdf parses the html tables and looks for the theader, tfooter element and puts it on each page if your tables are bigger than one page size. Also you have the possibility to define page header and page footer elements with dynamic entities like page nr and so on.
i know, using this detour seems to be a workaround, but to be honest, no latex, pdf whatever engine is as strong and simple as html!

Try a different less complex library like fpdf (http://www.fpdf.org/)
I find it quite good and lite.
Always find libraries that are small and only do what you need them to do.
The bigger the library the more resources it consumes.

This won't help your multiple-page problem, but I notice that pdftk accepts the - character to mean 'read from standard input'.
You may be able to send the .fdf to the pdftk process via it's stdin, in order to avoid having to write them to disk.

Multipage invoice/document in PHP

TL;DR: I have problems with PHP generating PDF's longer than 1 page total.
Hello again. My goal is to create a script that will basically get all the important data and create a A4 format PDF invoice/document for printing/mailing/archiving. Generating PDF document is fine as long as the document does not overflow.
I want the invoice pages to be outlined with a border, it should contain:
the needed stuff required for the invoice to be valid
billed products/other information
place for supplier/customer signature and stamp or other data
All the pages HAVE to contain header and footer (company logo) and footer (page # of # - Invoice/Document ID - Date and Time - office ID - Printer ID, assigned personnel, whatever someone can ask), as well as border around the document body (under header, above footer).
Everything is fine as long as the document size is not bigger than
$pageSize-$pageMargins-$header-$footer-$invoiceDataBlock-$signaturesBlock
which is basically just like 10 cm for the actual invoiced items. If the document is bigger, I actually create attachment for the invoice manually using spreadsheet editor.
The question is: What can I do to create a multipage PDF document that has no problems like invoiced items overlaying the header/footer? I need to know when to continue on the next page. How do I know this? What is the best way to accomplish this task?
Thank you in advance!

I've used both FPDF and TCPDF to generate multi-page invoice files. They are roughly the same in terms of how they work. (I started with FPDF, then switched to TCPDF when I needed to include Unicode characters, which FPDF didn't support at the time.)
As Eugen suggested, you can hand-roll your own headers and footers more easily than using the functions built in to either FPDF or TCPDF.
My strategy for making sure I don't overwrite footers is simply to be careful with the data included on the invoice. When adding new SKUs, I test long names to make sure they will fit in their field in the invoice PDF. For items that must be variable-length, I put unknown content onto its own line to reduce possible impact:
Domain registration (2 years)
↳ example.com
As I generate each page of the invoice, I keep track of how many lines I've used. I know I can safely put 20 lines of items, and I know my maximum single item is 2 lines, then when I get to 20 lines, I start a new page. 15 items means 1 page. 25 items means two pages. The item counter goes up, and every time I hit the 20 line limit, it generates the next page and resets the page item counter.
Note that I'm not including any code in this answer because you didn't include any code in your question. If you'd like help with implementation, I suspect that will be grounds for an additional question. :-)

Use TCPDF. It has a very handy SetY() / GetY() pair of functions, that allows you to know, where on the page you are. You can use this to know when to do a page break.
Hint: Do not use the Header/Footer capabilities - they are clunky. Draw your own headers/footers.
Edit
As from discussion below, here are some details: To avoid overlaying you have 2 possibilities
Use getStringHeight() and calculate
Use Transactions
The first version draws its rationale from the fact, that of all objects you typically use in generating a PDF a text-flow is the only one, of which you cannot tell beforehand the height it will use. getStringHeight() provides you with a good enough estimate, so you know before adding the element, if it will fit on the page (leaving enough room on the bottom for the footer). So basically you extend your drawing loop to calculate the height of each element and test, if you need to start a new page first. This allows also for some sort of keeptogether, e.g. if the remaining space after a section title is too low, start a new page before, to keep section title and section body together.
The second version is even easier: In TCPDF you can use transactions simialr to a Database: Start a transaction, draw, if the result is not to your liking roll back, else commit. We found this to be quite a performance hog, ultimately deciding against it for long textual reports, but a 2-page invoice is a very different beast.

PHPExcel large data sets with multiple tabs - memory exhausted

Using PHPExcel I can run each tab separately and get the results I want but if I add them all into one excel it just stops, no error or any thing.
Each tab consists of about 60 to 80 thousand records and I have about 15 to 20 tabs. So about 1600000 records split into multiple tabs (This number will probably grow as well).
Also I have tested the 65000 row limitation with .xls by using the .xlsx extension with no problems if I run each tab it it's own excel file.
Pseudo code:
read data from db
start the PHPExcel process
parse out data for each page (some styling/formatting but not much)
(each numeric field value does get summed up in a totals column at the bottom of the excel using the formula SUM)
save excel (xlsx format)
I have 3GB of RAM so this is not an issue and the script is set to execute with no timeout.
I have used PHPExcel in a number of projects and have had great results but having such a large data set seems to be an issue.
Anyone every have this problem? work around? tips? etc...
UPDATE:
on error log --- memory exhausted
Besides adding more RAM to the box is there any other tips I could do?
Anyone every save current state and edit excel with new data?

I had the exact same problem and googling around did not find a valuable solution.
As PHPExcel generates Objects and stores all data in memory, before finally generating the document file which itself is also stored in memory, setting higher memory limits in PHP will never entirely solve this problem - that solution does not scale very well.
To really solve the problem, you need to generate the XLS file "on the fly". Thats what i did and now i can be sure that the "download SQL resultset as XLS" works no matter how many (million) row are returned by the database.
Pity is, i could not find any library which features "drive-by" XLS(X) generation.
I found this article on IBM Developer Works which gives an example on how to generate the XLS XML "on-the-fly":
http://www.ibm.com/developerworks/opensource/library/os-phpexcel/#N101FC
Works pretty well for me - i have multiple sheets with LOTS of data and did not even touch the PHP memory limit. Scales very well.
Note that this example uses the Excel plain XML format (file extension "xml") so you can send your uncompressed data directly to the browser.
http://en.wikipedia.org/wiki/Microsoft_Office_XML_formats#Excel_XML_Spreadsheet_example
If you really need to generate an XLSX, things get even more complicated. XLSX is a compressed archive containing multiple XML files. For that, you must write all your data on disk (or memory - same problem as with PHPExcel) and then create the archive with that data.
http://en.wikipedia.org/wiki/Office_Open_XML
Possibly its also possible to generate compressed archives "on the fly", but this approach seems really complicated.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.