I was wondering if is there a way for php to check if a PDF file stored locally on the server is corrupted or not. We have a php application that deals with a lot of scanned documents converted in PDF and it would be nice to check which of them is corrupted to alert the user.
I tried to look around but with no luck.
There are versions of pdflib available which can read PDFs - you could simply try to open and read each page with that.
The problem is there are many ways a PDF file can be corrupt.
Maybe your best solution would be to find a PDF reading lib and try to extract the first word from each page or something. That would at least catch some basic types of corruption.
Related
We want to merge a lot of PDF files into one big file and send it to the client. However, the resources on our production server are very restricted, so merging all files in memory first and then sending the finished PDF file results in our script being killed because it exhausts its available memory.
The only solution (besides getting a better server, obviously) would be starting to stream the PDF file before it is fully created to bypass the memory limit.
However I wonder if that is even possible. Can PDF files be streamed before they're fully created? Or doesn't the PDF file format allow streaming unfinished files because some headers or whatever have to be set after the full contents are certain?
If it is possible, which PDF library supports creating a file as a stream? Most libraries that I know of (like TCPDF) seem to create the full file in memory and then in the end output this finished result somewhere (i. e. via the $tcpdf->Output() method).
The PDF file format is entirely able to be streamed. There's certainly nothing that'll prevent it anyway.
As an example, we recently had a customer that required reading a single page over a HTTP connection to a remote PDF, without downloading or reading the whole PDF. We're able to do this by making many small HTTP requests for specific content within the PDF. We use the trailer at the end of the PDF and the cross reference table to find the required content without having to parse the whole PDF.
If I understand your problem, it looks like your current library you're using loads each PDF in memory before creating or streaming out the merged document.
If we look at this problem a different way, the better solution would be for the PDF library to only take references to the PDFs to be merged, then when the merged PDF is being created or streamed, pull in the content and resources from the PDFs to be merged, as-and-when required.
I'm not sure how many PHP libraries there are that can do this as I'm not too up-to-date with PHP, but I know there are probably a few C/C++ libraries that may be able to do this. I understand PHP can use extensions to call these libraries. Only downside is that they'll likely have commercial licenses.
Disclaimer: I work for the Mako SDK R&D group, hence why I know for sure there are some libraries which will do this. :)
I have relatively sensitive data in .docx, .xlsx and PDF files that all need to be converted to a single PDF file locally. Sending these files off to phpdocx or Google Docs or anything like this is not an option.
The only other option I am seeing is OpenOffice / LibreOffice but I am not satisfied with how they are converting the documents.
Is there any other alternative anyone is aware of? Thanks!
Definitely a difficult task. The very recent release of LibreOffice 3.6 has fixes to it's docx processing if that might help, but you haven't specified what the actual problems you encountered when you tried OpenOffice.
If you have time to experiment (and bring in any tools/languages you need to get the job done) you could try LibreOffice to produce PDFS, then use one of the many PDF libs to stitch the PDFs into the single file you require.
You could also look at ODFConverter which has traditionally been much better with DOCX than either OpenOffice or LibreOffice. This would allow you docx -> odt -> pdf. I think it can do the xlsx also. Then do the PDF stitching again.
I suggest testing the stages manually at first and if promising, try something like JODConverter (requires Java) to allow you to automate the process via scripts.
Good luck.
What is, according to you, the best way to convert uploaded files of any kind (.doc, .docx,...) into a pdf-file using nothing but php. Is it even possible to do so?
I looked at FPDF, but this creates the pdf files from text.
An other solution previously given was to use the PDFlib library on your server, but unfortunately, my server doesn't support this library...
What is the best way to convert to files my users upload on my site to pdf files?
A simpler approach would be to restrict uploads to .PDF format programmatically and require your users to only upload .pdf files. Provide a link on the upload page to a free and open source pdf printer (e.g. Cuteftp) that the user can install to create .pdf documents from any file that can be printed.
Trying to do it through PHP will be problematic because the uploads could be generated from many different programs that would be impossible to cater for in their entirety. e.g. How would it handle Scribus or ABC Flowcharter or any other 'non-standard' application someone used to create a document?
Much better to filter the upload upfront.
The best server-side PDF generator from those I tried was, so far, wkhtmltopdf, a WebKit-based, self-contained invisible browser that can render any HTML+CSS and generate a PDF from it. Reasonably fast and fairly reliable, has some useful PDF options, such as page size, orientation, etc.
The second part of the job in your case is to convert documents to HTML prior to feeding them to wkhtmltopdf. If possible, have your users upload the docs in HTML (Word and Co. can export (crappy) HTML). If this is not an option, you will have to find a tool just for that, which, in my opinion, is much easier than finding a tool that converts Word docs directly into PDF.
Good thing about wkhtmltopdf is also that you can feed the output of your PHP script to it using the ob_xxx() functions.
PHP Excel best simple way to create doc, docx, xls, xlsx, pdf files with PHP. Its lot easier with clear documentation.
Use Microsoft Office to render Microsoft Office documents, if you care about accuracy at all. This is easily done by invoking Office over COM.
Get access to your server, and install what you need. Doing so would be far easier than monkeying around with sub-par solutions.
Well... I can think of one way of doing it quite easily, but it doesn't involve using PHP.
Upload your documents to a folder on your server, that are browsable by your users.
EG: http://mysite.com/docs/
Then get your users to install a virtual printer driver such as Primo PDF
http://www.primopdf.com/index.aspx
then they can load the document into their browser, and print to PDF for offline browsing.
If this is not an option, and your dealing with office documents that conform to the openXML standard, you could attempt to parse the XML doc into a PHP page for display in the browser, then use JavaScript to trigger a print.
Unfortunately, it does still depend on your user having a PDF printer installed.
Alternatively, you could just load the docs natively, and print to your own PDF printer, then upload the PDF's to the web server for download.
I can't think of any easy way of doing this otherwise, without installing all sorts of different document parser tool-kits and doing a huge amount of behind the scenes work.
I've built a web application incorporating the fpdf library which allows clients to upload pdf files which my system then combines into a monthly report (adding a cover, contents page etc.).
Last month I got this error:
FPDF Error: Error while decompressing stream
I've googled it and the only people who have encountered it before seem to be German!
The error handler is at line 241 of fpdi_pdf_parser.php and refers to "case '/FlateDecode':" and other things I don't understand.
I traced the problem to a single pdf file which appeared normal but consistently caused the problem. I created a new version of the pdf by screen grabbing from the old one and when I uploaded that everything worked.
As I say I got round the problem but don't really understand how and don't want to run into the same thing again.
Any ideas what was going on?
Thanks in advance.
PDF files can be compressed in different ways with different algorithms, if your application is open to receive any file it is possible that you got a corrupt one that FPDF was not able to decompress. Even in such scenarios (I mean corrupt files) other PDF parsers/readers may be able to recover the file and show the content (or some part of it), but it does not mean the file is valid.
It is also possible that this file contains some specific feature from the PDF specification that is not supported by FPDF. If it is an option for you to post the offending file it might be possible to narrow down the issue a bit more.
usual in such cases helps install or update zlib module by PHP. The problem also arises due to the pictures are inserted into the pdf-document (see requirements by image on http://www.fpdf.org/en/doc/image.htm).
Looking for a way to enable someone to upload a single file which will be series of image files (all gif) merged together as one big file. Here is what I need to do:
Using VB6, want to merge the image files (potentially dozens of them) into a single file
Upload file to a PHP Script (easy enough)
Have PHP break apart the single file and write image files
I know how to handle the uploading of the file. I also know how to write the image files in PHP. What I am unsure of is the merging/un-merging operation.
In theory, I should just be able to use VB6 to merge all images using binary read/writing. However, does anyone know the series of binary codes that prefix each .gif file so PHP can pick up on that, or do I need to write some sort of binary separator in between each merged image?
I could surely tinker with this myself, but I thought some of you smarter-than-me coders may have already done this, and/or could provide a link, some code, or some 'things to consider'.
Thanks.
Instead of merging/un-merging, if the whole purpose is to avoid overhead of sending dozens of files, why not zipping them and unzipping in PHP?
That should be far easier than the merging operation you're proposing.
Here's a free Zip/Unzip library for Windows: Info-ZIP
Here's some sample code that uses Info-ZIP: Zip and Unzip Using VB5 or VB6
Here's PHP's documentation on the ZIP module: php.net/zip
Here's an example of how to use "unzip" command through PHP, rather than using the Zip module: Zipping and Unzipping Files with PHP
Google is your friend :)