PDF Parser PHP Library Not Working - php

I'm using the PDF Parser PHP library to parse the text from several PDFs. It works perfectly for a majority of these, but seems to just timeout and stop working for certain PDFs.
This is the code I'm using (straight from their demo page):
<?php
include 'vendor/autoload.php';
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('document.php');
$text = $pdf->getText();
echo $text;
?>
When I replace 'document.pdf' with the URL to this file, it works perfectly as expected.
However, when I replace 'document.pdf' with the URL to this file, it just times out with a blank page.
Any ideas why it would work for one file and not the other?
Thanks in advance for any advice!

yes this "ghost" error I saw it too, nothing even in error_log, nor tripped in try catch very hard to diagnose if you increase the memory_limit in php.ini it goes away, it's either something to do with the bad garbage collection on the developers part or ballooning - i think the latter because my loop failed after 4 pdf's but when I quadrupled available ram it didn't fail after 60

Related

PDF manipulation - images are distorted after few consecutive operations on PDF file

I've run into this weird issue with PDF file handling. Not sure if SO is the right place to ask this, but I couldn't find any specific sites for this. I hope that someone can shed some light on the issue.
This happens with the following specific process, if some of steps are omitted - the issue is not observed.
I have a PHP application that serves PDF files to users. These files are created by authors in MS Word 2007, then printed to protected PDF (using pdf995, most likely, I can confirm if needed).
I'll call this initial PDF file as 'source' hereinafter.
Upon request, the source file is processed in PHP the following way:
we decrypt it using qpdf:
qpdf --decrypt "source.pdf" "tmp_output.pdf"
Then we add security label / wartermark to it, encrypt and output to browser using mPDF 6.0:
$mpdf = new mPDF();
$mpdf->SetImportUse();
$pagecount = $mpdf->SetSourceFile($fpath);
if ($pagecount) {
for ($i=1;$i<=$pagecount;$i++){
$tplId = $mpdf->ImportPage($i);
$mpdf->UseTemplate($tplId);
$html = '[security label / watermark contents...]';
$mpdf->WriteHTML($html);
}
}
$mpdf->SetProtection(array('copy','print'), '', 'password',128);
$mpdf->Output('final_output.pdf','I');
With the exact steps described above, images in the output that were pasted in the Word doc appear as follows:
In the source PDF, tmp_output (qpdf decrypted file) the pasted images look correct:
The distortion doesn't take place if any of the following occurs:
Word doc printed to PDF without protection
mPDF output is not protected.
As you can see there too many factors, so I don't know where to look for a bug.
Each component works correctly on it's own and I cannot find any info on the issue. Any insights are greatly appreciated.
EDIT 1
After some more testing, it appears that this only happens to screenshots taken from web browser, Windows explorer, MS Word. Cannot reproduce this with screenshots from Gimp.
It appears that something along the way attempts to convert white to alpha and fails.
The current version (6.1) of Mpdf has a bug which does not handle escaped PDF strings (imported via FPDI) correct if they should be encrypted.
A pull request, which fixes this issue is available here.

PHPRtfLite - RTF file opens as raw

I am using PHPRtfLite library (http://sigma-scripts.de/phprtflite/docs/index.html) to produce an RTF file using PHP and Yii.
So far, I've made a simple "Hello world" function.
Yii::import('ext.phprtf.PHPRtfLite');
Yii::registerAutoloader(array('PHPRtfLite','registerAutoloader'), true);
$rtf = new PHPRtfLite();
$sect = $rtf->addSection();
$sect->writeText('Hello world!', new PHPRtfLite_Font(), new PHPRtfLite_ParFormat());
//save rtf document
$rtf->sendRtf('takis.rtf');
File is created successfully, but when I open it (either wordpad or ms word) I do not see the actual content of the file but the raw code of the RTF:
{\rtf\ansi\deff0\fs20
{\fonttbl{\f0 Times New Roman;}}
{\colortbl;\red0\green0\blue0;}
{\info
}
\paperw11907 \paperh16840 \deftab1298 \margl1701 \margr1701 \margt567 \margb1134 \pgnstart1\ftnnar \aftnnrlc \ftnstart1 \aftnstart1
\pard \ql {\fs20 Hello world!}
}
Do you have any idea on how to solve this?
Thank you very much in advance.
To answer my own question, in case someone is having the same issue in the coming future...
It seems to be a problem of the sendRTF function. Now, I save the created file locally:
$rtf->save('takis.rtf');
and then generate a link for the user to download the file. This works pretty good.
I have experienced same thing myself. I'm not sure, if you had same reasons, but in my case, there was extra newline in the beginning of PHP file, before <?php tag. When I used sendRtf to download file from browser, that newline ended up also in RTF file, making it invalid and as result, raw rtf code was displayed. When using save, such extra characters won't reach to file.
So one thing to check in similar situations - open Rtf file in Notepad and examine beginning of file.

Web page garbled (encoding?) when downloading from within PHP

I am trying to download this page (http://www.360.ru/) from within PHP. However, when I write the file out and view it, the content is garbled/corrupt. However, a different page from the same site downloads with out problems (http://www.360.ru/goods/category/3/466/). And both work perfect well within Chrome & Firefox (which both report the encoding is UTF-8). I can not think what the problem can be. Here is my PHP code:
<?php
file_put_contents('/temp/out.html', fopen("http://www.360.ru/", 'r'));
file_put_contents('/temp/out2.html', fopen("http://www.360.ru/goods/category/3/466/", 'r'));
exit;
?>
When I open the two files, "out.html" is garbled, corrupt and "out2.html" is perfectly okay. Any help would be really appreciated. Thanks!
Ah, figured it out - the first page was gzipped. Using gzopen instead of fopen fixed the problem. Hope this helps others...

HTML2PDF in PHP - convert utilities & scripts - examples & demos

I have a quite complicated HTML/CSS layout which I would like to convert to PDF on my server. I already have tryed DOMPDF, unfortunately it did not convert the HTML with correct layout. I have considered HTMLDOC but I have heard that it ignores CSS to a large extent, so I suppose the layout would break apart with that tool too.
My question therefor is - are there any online demos for other tools (like wkhtmltopdf i.e.) that I could use to verify how my HTML is converted? Before spending the rest of my life installing & testing one by one?
Unfortunately, I can't change the HTML layout to fit those tools. Or better said - I could, if any of them would get close to an acceptable result...
Not really an answer but for the question above, but I'll try to provide some of my experience, maybe it will help someone somwhere in the future.
wkthmltopdf is really THE ONLY solution that worked for me that could produce what I call acceptable results. Still, some minor modifications to the CSS had to be made, however, it worked really well when it comes to rendering the content. All the other packages are really only suitable if you have a rather simply document with one basic table etc. No chance to get them to produce fair results on complex docs with design elements, css, multiple overlapping images etc. If complex documents are in game - do not spend the time (like I did) - go straight to wkhtmltopdf.
Beware - the wkhtmltopdf installation is tricky. It was not so easy for me as the guys said in their comments (one of the reasons might be that I am not too familiar with Linux). The static binary did not work for me for some reason I can't explain. I suspect that there were problems with the version - apparently there is a difference between versions for different OS and processors, maybe I have the vrong version. For installing the non-static version first of all you have to have root access to the server, that's obvious. I installed it with apt-get using PuTTy, went quite well. I was lucky that my server already had all the predispositions to install wkhtmltopdf. So this was the easy part for me :) (btw, you don't have to care for symbolic links or wrappers as many tutorials tell you - I spent hours trying to figure out how to do that part, in the end I gave it up and everything works well though)
After the install I got the quite famous Cannot connect to X server error. This is due to the fact that we need to run wkhtmltopdf headless on a 'virtual' x server. Getting around this was also quite simple (if one does not care for the symbolic links). I installed it with apt-get install xvfb. This also went quite well for me, no problems.
After completing this I was able to run wkhtmltopdf. Beware - it took me some time to figure out that trying to run xvfb was the wrong way - instead you have to run xvfb-run. My PHP code now looks like this exec("xvfb-run wkhtmltopdf --margin-left 16 /data/web/example.com/source.html /data/web/example.com/target.pdf"); (notice the --margin-left 16 command line option for wkhtmltopdf - it makes my content more centered; I left it in place to demonstrate how you can use command line options).
I also wanted to protect the generated PDF files from editing (in my case, print protect is also possible). After doing some research I found this class from ID Security Suite. First of all I have to say - IT'S OLD (I am running PHP 5+). However, I made some improvements to it. First of all - it's a wrapper around the FPDF library, so there is a file called fpdf.php in the package. I replaced this file from the latest FPDF version I got from here. It made my PHP warnings look more sustainable. I also changed the $pdf =& new FPDI_Protection(); and removed the & sign as I was getting an deprecated warning for it. However, there are more of those to come. Instead of searching and modifying the code I just turned the error reporting lvl to 0 with error_reporting(0); (although turning off the warnings only should be sufficient). Now someone will say that this is not "good practice". I am using this whole stuff on an internal system, so I do not really have to care. For sure the scripts could be modifiyed to match latest requirements. For me I didn't want to spend another hours working on it. Be careful where the script says $pdf->SetProtection(array('print'), '', $password); (I allowed printing my documents as you can see). It took me a while to figure out that the first argument is the permissions. The second is the USER PASSWORD - if you provide this then the docs will require a password to open (I left this blank). The third is the OWNER PASSWORD - this is what you need to make the docs "secured" against editing, copying etc.
My whole code now looks like:
// get the HTML content of the file we want to convert
$invoice = file_get_contents("http://www.example.com/index.php?s=invoices-print&invoice_no=".$_GET['invoice_no'];
// replace the CSS style from a print version to a specially modified PDF version
$invoice = str_replace('href="design/css/base.print.css"','href="design/css/base.pdf.css"',$invoice);
// write the modified file to disk
file_put_contents("docs/invoices/tmp/".$_GET['invoice_no'].".html", $invoice);
// do the PDF magic
exec("xvfb-run wkhtmltopdf --margin-left 16 /data/web/domain.com/web/docs/invoices/tmp/".$_GET['invoice_no'].".html /data/web/domain.com/web/docs/invoices/".$_GET['invoice_no'].".pdf");
// delete the temporary HTML data - we do not need that anymore since our PDF is created
unlink("docs/invoices/tmp/".$_GET['invoice_no'].".html");
// workaround the warnings
error_reporting(0);
// script from ID Security Suite
function pdfEncrypt ($origFile, $password, $destFile){
require_once('libraries/fpdf/FPDI_Protection.php');
$pdf = new FPDI_Protection();
$pdf->FPDF('P', 'in');
//Calculate the number of pages from the original document.
$pagecount = $pdf->setSourceFile($origFile);
//Copy all pages from the old unprotected pdf in the new one.
for ($loop = 1; $loop <= $pagecount; $loop++) {
$tplidx = $pdf->importPage($loop);
$pdf->addPage();
$pdf->useTemplate($tplidx);
}
//Protect the new pdf file, and allow no printing, copy, etc. and
//leave only reading allowed.
$pdf->SetProtection(array('print'), '', $password);
$pdf->Output($destFile, 'F');
return $destFile;
}
//Password for the PDF file (I suggest using the email adress of the purchaser).
$password = md5(date("Ymd")).md5(date("Ymd"));
//Name of the original file (unprotected).
$origFile = "docs/invoices/".$_GET['invoice_no'].".pdf";
//Name of the destination file (password protected and printing rights removed).
$destFile = "docs/invoices/".$_GET['invoice_no'].".pdf";
//Encrypt the book and create the protected file.
pdfEncrypt($origFile, $password, $destFile );
Hope this helps someone to save some time in the future. This whole solution took me like 12 hours to implement into our invoicing system. If there was better info on wkhtmltopdf for users like me, who are not that familiar with Linux/UNIX, I could have saved some of the hours spent on this.
However - what doesn't kill you makes you stronger :) So I am a bit more perfect now that I made this run :)

TCPDF outputs weird characters in IE8

Today I started experimenting with PHP-based PDF generators. I tried TCPDF and it works fine for the most part, although it seems to be a little slow. But when I load the PHP file that generates my PDF in Internet Explorer 8, I see lines and lines of weird characters. Chrome however recognizes it as a PDF.
I'm assuming that I have to set a special MIME type to tell IE that it should interpret the page output as a PDF file. If yes, how can I do this?
putting "application/pdf" or "application/octet-stream" mime types might help. keep in mind that "application/octet-stream" will force download of the file and might prevent it from opening in the browser..
in case you wonder, you can do it like that:
header('Content-type: application/octet-stream');
I had this problem also but what I did to get it work is I added
exit();
at the end of pdf output.
You need to handle IE differently for dynamic-generated content. See this article,
http://support.microsoft.com/default.aspx?scid=kb;en-us;293792
In my code, I do this,
if(isset($_SERVER['HTTP_USER_AGENT']) AND ($_SERVER['HTTP_USER_AGENT']=='contype')) {
header('Content-Type: application/pdf');
exit;
}
This problem may also explain slowness you mentioned because your page actually sends the whole PDF multiple times without this logic.
#Pieter: I was experiencing the same issue using tcpdf (with fpdi), and loading the page that was generating the pdf using an ajax call. I changed the javascript to load the page using window.location instead, and the issue went away and the performance was much better. I believe that the other two posters are correct in the idea that the document header is causing the issue. In my case, because of the ajax call, the header was not being applied to the whole document, and causing the issue. Hope this helps.
I found this to be a problem too, and for me this all hinged on the code:
if (php_sapi_name( != 'cli') {
on line 7249 of the tcpdf.php file.
I commented this 'if' statement (and related '}')and all works fine for my other browser and ie8
Hope this helps

Categories